Robust and Clustered Standard Errors

Week 7

Published

February 20, 2025

Before we start

  • Any questions?
Download script

Download data

Review of the previous week

Review

Load tidyverse library

library(tidyverse)

Load the George Ward’s replication data

using ggpairs() from GGally library, visualize the relationships between the variables. What can you see?

Explore the relationship between subjective well being (satislfe_survey_mean) and number of parties in government (parties_ingov). Draw a boxplot. Does it look right? If not, correct!

Set up a pooled model. Predict satislfe_survey_mean by parties_ingov. Save the model to model_pooled object. Present summary. What do you think about the model?

Introduce country fixed effects. Save it to the model_countryfe object. Present summary. Compare it to the pooled model. Pay attention to \(R^2\). Why are they different?

Using ggpredict() from ggeffects library visualize model_countryfe. Plot parties_ingov and the following country values: BEL, DNK, SWE. Take a moment to understand the graph. Insert the argument connect_line = TRUE to make the comprehension of the graph easier.

Review of the homework

The set.seed() function allows you to make the random process “controllable”. To get the sense of what’s going on, let’s experiment a bit.

Let’s randomly generate 10 integers. Each time you run the chunk below, a different vector is generated.

sample.int(100, 10)
 [1] 78 29 62  2 50 80 69 34 65 93

Now, if we want to get the same results each time, we can set the seed.

set.seed(123)

sample.int(100, 10)
 [1] 31 79 51 14 67 42 50 43 97 25

Be careful when you include the set.seed() within functions.

wrongfunc = function(x){
  set.seed(123)

  sample.int(x, 10)
}

In simulations it wouldn’t allow you to randomly sample/generate data.

for(i in 1:10){
  print(wrongfunc(100))
}
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 31 79 51 14 67 42 50 43 97 25

Instead, you would want that the loop would produce the constant result.

corrfunc = function(x){
  sample.int(x, 10)
}

set.seed(123)
for(i in 1:10){
  print(corrfunc(100))
}
 [1] 31 79 51 14 67 42 50 43 97 25
 [1] 90 91 69 99 57 92  9 93 72 26
 [1]  7 42  9 83 36 78 81 43 76 15
 [1] 32  7  9 41 74 23 27 60 53 99
 [1] 53 27 96 38 89 34 93 69 72 76
 [1] 63 13 82 97 91 25 38 21 79 41
 [1] 47 90 60 95 16 94  6 72 86 92
 [1] 39 31 81 50 34  4 13 69 25 52
 [1] 22 89 32 25 87 35 40 30 12 31
 [1] 30 64 14 93 96 71 67 23 79 85

Agenda

  • Introduction to diagnostics

  • Fixing heteroscedasticity

  • And if we can’t, then we will be working with robust standard errors

  • Finally, dealing with clustered standard errors

Country-Year Fixed Effects Model

Let’s explore Comparative Political Dataset. It consists of political and institutional country-level data. Take a look on their codebook.

Today we are working with the following variables.

  • prefisc_gini - Gini index. What is it?

  • openc - Openness of the economy (trade as % of GDP)

  • servadmi_pmp - Public and mandatory private employment services and administration as a percentage of GDP.

  • country and year

First of all, let’s load the data

library(readxl)
cpds = read_excel("data/cpds.xlsx")

Imagine you are interested in explaining inequality (prefisc_gini) by the amount of trade (measured as oppenness of the economy – openc). Set up a simple linear regression (slr).

model_slr = lm(prefisc_gini ~ openc, cpds)
summary(model_slr)

Call:
lm(formula = prefisc_gini ~ openc, data = cpds)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.7307  -2.8912   0.5475   2.8113  12.9869 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 41.457621   0.295214 140.432   <2e-16 ***
openc       -0.001798   0.002652  -0.678    0.498    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.141 on 547 degrees of freedom
  (1317 observations deleted due to missingness)
Multiple R-squared:  0.00084,   Adjusted R-squared:  -0.0009866 
F-statistic: 0.4599 on 1 and 547 DF,  p-value: 0.498

Given how complex the inequality, it’s fair to assume we have some confounders. Let’s control for labour market policies (measured by Public employment as % of GDP – servadmi_pmp). Set up a multiple linear regression (MLR) below. What do you think about the model?

model_mlr = lm(prefisc_gini ~ openc + servadmi_pmp, cpds)
summary(model_mlr)

Call:
lm(formula = prefisc_gini ~ openc + servadmi_pmp, data = cpds)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.2821  -2.5147   0.3219   2.6891  13.0194 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  42.216729   0.454877  92.809  < 2e-16 ***
openc        -0.007419   0.002691  -2.758  0.00605 ** 
servadmi_pmp  1.717317   1.862425   0.922  0.35695    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.934 on 477 degrees of freedom
  (1386 observations deleted due to missingness)
Multiple R-squared:  0.01985,   Adjusted R-squared:  0.01574 
F-statistic: 4.829 on 2 and 477 DF,  p-value: 0.008388

How can we check if the homoscedasticity assumption is satisfied? Proceed with simple analysis. Extract residuals and fitted (predicted) values. Plot it. What do you think?

res = model_mlr$residuals
fit_val = model_mlr$fitted.values

ggplot() +
  geom_point(aes(x = fit_val, y = res)) +
  geom_hline(yintercept = 0) 

The same plot can be easily accessed using the base R functions. We’ll get to this next week in a more detail.

plot(model_mlr, which = 1)

We know that our data is of country-year structure. Let’s introduce the fixed effects to the model. First, make sure these are factors.

cpds$country = as.factor(cpds$country)
cpds$year = as.factor(cpds$year)

What has changed?

model_fe = lm(prefisc_gini ~ openc + servadmi_pmp + country + year, cpds)
summary(model_fe)

Call:
lm(formula = prefisc_gini ~ openc + servadmi_pmp + country + 
    year, data = cpds)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.4138 -1.1115 -0.0014  0.9614  6.5810 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            35.437237   1.560752  22.705  < 2e-16 ***
openc                   0.011568   0.004632   2.497 0.012898 *  
servadmi_pmp            2.446305   1.717145   1.425 0.155022    
countryAustria         -3.621612   0.771798  -4.692 3.68e-06 ***
countryBelgium         -3.636327   0.879871  -4.133 4.35e-05 ***
countryCanada          -1.604814   0.711066  -2.257 0.024538 *  
countryCzech Republic  -6.077881   1.015740  -5.984 4.75e-09 ***
countryDenmark         -4.564718   0.887802  -5.142 4.22e-07 ***
countryEstonia         -4.319239   1.184771  -3.646 0.000301 ***
countryFinland         -2.176054   0.923626  -2.356 0.018942 *  
countryFrance           0.007526   0.746337   0.010 0.991959    
countryGermany         -2.894690   0.735793  -3.934 9.80e-05 ***
countryGreece          -0.290260   1.005907  -0.289 0.773067    
countryHungary          1.280997   1.088680   1.177 0.240016    
countryIceland        -11.037538   1.339412  -8.241 2.32e-15 ***
countryIreland          2.232621   0.956071   2.335 0.020013 *  
countryItaly           -1.154673   0.991811  -1.164 0.245016    
countryJapan           -7.228360   1.331619  -5.428 9.75e-08 ***
countryLuxembourg      -6.698123   1.312531  -5.103 5.11e-07 ***
countryNetherlands     -3.895918   0.951399  -4.095 5.09e-05 ***
countryNorway          -5.508208   0.901830  -6.108 2.34e-09 ***
countryPoland          -0.011346   0.801278  -0.014 0.988709    
countryRomania         -2.871224   0.836577  -3.432 0.000660 ***
countrySlovakia        -9.720522   1.029048  -9.446  < 2e-16 ***
countrySlovenia        -8.015692   1.168882  -6.858 2.58e-11 ***
countrySpain           -0.276633   0.735491  -0.376 0.707022    
countrySweden          -5.162188   0.776541  -6.648 9.49e-11 ***
countrySwitzerland    -10.486322   0.834016 -12.573  < 2e-16 ***
countryUnited Kingdom   1.696744   0.724065   2.343 0.019587 *  
countryUSA              2.433579   0.735544   3.309 0.001020 ** 
year1981               -1.544550   2.429892  -0.636 0.525361    
year1982                1.455276   2.429893   0.599 0.549566    
year1983                6.820538   2.008574   3.396 0.000751 ***
year1984                1.041995   2.430511   0.429 0.668356    
year1985                3.685598   1.599656   2.304 0.021721 *  
year1986                1.800695   1.637568   1.100 0.272143    
year1987                4.646000   1.556861   2.984 0.003013 ** 
year1988                3.458806   1.634901   2.116 0.034979 *  
year1989                3.340307   1.635580   2.042 0.041762 *  
year1990                3.675834   1.579358   2.327 0.020428 *  
year1991                3.600641   1.610814   2.235 0.025935 *  
year1992                5.821592   1.534969   3.793 0.000171 ***
year1993                6.358422   1.605394   3.961 8.81e-05 ***
year1994                7.223842   1.568944   4.604 5.53e-06 ***
year1995                7.112686   1.506798   4.720 3.23e-06 ***
year1996                7.593346   1.540853   4.928 1.21e-06 ***
year1997                7.216507   1.560473   4.625 5.04e-06 ***
year1998                7.325058   1.562397   4.688 3.75e-06 ***
year1999                6.804528   1.540786   4.416 1.29e-05 ***
year2000                6.705205   1.502256   4.463 1.04e-05 ***
year2001                7.792546   1.609835   4.841 1.84e-06 ***
year2002                7.324104   1.556131   4.707 3.45e-06 ***
year2003                7.466408   1.546082   4.829 1.94e-06 ***
year2004                8.118553   1.482882   5.475 7.64e-08 ***
year2005                8.049804   1.526365   5.274 2.16e-07 ***
year2006                7.865085   1.521659   5.169 3.69e-07 ***
year2007                7.320005   1.484491   4.931 1.19e-06 ***
year2008                7.695709   1.508191   5.103 5.13e-07 ***
year2009                8.741259   1.518804   5.755 1.69e-08 ***
year2010                8.388088   1.487661   5.638 3.19e-08 ***
year2011                8.787078   1.525340   5.761 1.64e-08 ***
year2012                9.121104   1.514125   6.024 3.78e-09 ***
year2013                8.949772   1.491692   6.000 4.34e-09 ***
year2014                8.730829   1.519104   5.747 1.77e-08 ***
year2015                8.248437   1.512792   5.452 8.59e-08 ***
year2016                7.999621   1.495341   5.350 1.47e-07 ***
year2017                7.881164   1.522311   5.177 3.53e-07 ***
year2018                7.639807   1.516810   5.037 7.10e-07 ***
year2019                7.610938   1.526318   4.986 9.09e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.972 on 411 degrees of freedom
  (1386 observations deleted due to missingness)
Multiple R-squared:  0.7877,    Adjusted R-squared:  0.7526 
F-statistic: 22.43 on 68 and 411 DF,  p-value: < 2.2e-16

Now, let’s check if the heteroscedasticity problem persists.

plot(model_fe, which = 1)

Draw the same graph using ggplot(). What do you think, is there a problem?

Let’s conduct a formal test using bptest() from lmtest library. It stands for Breusch-Pagan Test. How would you approach interpreting the results?

library(lmtest)
bptest(model_fe)

    studentized Breusch-Pagan test

data:  model_fe
BP = 182.68, df = 68, p-value = 1.928e-12

Robust Standard Errors

Now, introduce robust SEs. They are also referred to as Heteroskedasticity-consistent standard errors (HC SEs).

First, load the library estimatr. If you don’t have it, take a moment to install.

library(estimatr)

Now, we can use lm_robust() function to introduce the robust standard errors to attempt to account for the heteroscedasticity.

model_hc = lm_robust(prefisc_gini ~ openc + servadmi_pmp, cpds,
                     fixed_effects = country + year,
                     se_type = "HC2")
summary(model_hc)

Call:
lm_robust(formula = prefisc_gini ~ openc + servadmi_pmp, data = cpds, 
    fixed_effects = country + year, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  CI Lower CI Upper  DF
openc         0.01157   0.004431   2.611 0.009364  0.002858  0.02028 411
servadmi_pmp  2.44631   1.769259   1.383 0.167516 -1.031621  5.92423 411

Multiple R-squared:  0.7877 ,   Adjusted R-squared:  0.7526
Multiple R-squared (proj. model):  0.01949 ,    Adjusted R-squared (proj. model):  -0.1427 
F-statistic (proj. model): 4.644 on 2 and 411 DF,  p-value: 0.01013

Let’s compare two models side by side. What’s the difference? Did we account for heteroscedasticity? Remember that we have not displayed estimates for fixed effects.

library(modelsummary)
modelsummary(list("Fixed Effects Model" = model_fe, 
                  "Fixed Effects Model (HC)" = model_hc), 
             stars = T,
             coef_omit = "country|year",
             gof_omit = "AIC|BIC|Log.Lik.|RMSE",
             coef_rename = c("Intercept",
                             "Economy Openness",
                             "Public Sector"))
Fixed Effects Model Fixed Effects Model (HC)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 35.437***
(1.561)
Economy Openness 0.012* 0.012**
(0.005) (0.004)
Public Sector 2.446 2.446
(1.717) (1.769)
Num.Obs. 480 480
R2 0.788 0.788
R2 Adj. 0.753 0.753

Clustered Standard Errors

During the lecture we have discussed clustered standard errors. Let’s try to implement it.

Assuming there is a correlation between country X in year \(t\) and year \(t+1\), it’s valid to introduce clustered standard errors to the model. However, be careful if you don’t have enough obervations. Note a couple of things:

  • \(R^2\) did not change

  • \(\beta\) coefficients did not change

model_clust = lm_robust(prefisc_gini ~ openc + servadmi_pmp, cpds,
                      fixed_effects = country + year,
                      clusters = country)

modelsummary(list("Fixed Effects Model" = model_fe, 
                  "Fixed Effects Model (HC)" = model_hc,
                  "Fixed Effects Model (HC + Cl)" = model_clust), 
             stars = T,
             coef_omit = "country|year",
             gof_omit = "AIC|BIC|Log.Lik.|RMSE",
             coef_rename = c("Intercept",
                             "Economy Openness",
                             "Public Sector"))
Fixed Effects Model Fixed Effects Model (HC) Fixed Effects Model (HC + Cl)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 35.437***
(1.561)
Economy Openness 0.012* 0.012** 0.012
(0.005) (0.004) (0.021)
Public Sector 2.446 2.446 2.446
(1.717) (1.769) (2.180)
Num.Obs. 480 480 480
R2 0.788 0.788 0.788
R2 Adj. 0.753 0.753 0.753
Std.Errors by: country

Let’s compare the models visually. First, extract the information using tidy(). The code below is a bit confusing, take a moment in your free time to get a sense what’s going on.

library(broom)
model_comparison = bind_rows(cbind(model = "Fixed Effects (HC)", tidy(model_hc, conf.int = T)),
                             cbind(model = "Fixed Effects (HC + Cl)", tidy(model_clust, conf.int = T)))
model_comparison
                    model         term   estimate   std.error statistic
1      Fixed Effects (HC)        openc 0.01156786 0.004430807 2.6107801
2      Fixed Effects (HC) servadmi_pmp 2.44630535 1.769259191 1.3826721
3 Fixed Effects (HC + Cl)        openc 0.01156786 0.021259151 0.5441357
4 Fixed Effects (HC + Cl) servadmi_pmp 2.44630535 2.179715863 1.1223047
      p.value     conf.low  conf.high         df      outcome
1 0.009363945  0.002857992 0.02027773 411.000000 prefisc_gini
2 0.167516215 -1.031620650 5.92423134 411.000000 prefisc_gini
3 0.642254837 -0.082433704 0.10556943   1.944857 prefisc_gini
4 0.292747596 -2.537707419 7.43031811   8.410874 prefisc_gini

This is a variation of an effects plot, but instead of comparing different estimates, we compare the estimate across models. Clustering standard errors has widened our confidence intervals, which now cross zero.

model_comparison %>%
  filter(term == "openc") %>%
  ggplot(aes(y = model, x = estimate)) + 
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, height = 0.1), position = "dodge") + 
  geom_vline(xintercept = 0, linetype = 2) + 
  geom_point(position = position_dodge(width = 0.1)) + 
  theme_bw()

However, a small note on robust standard errors: they are not a panacea. If faced with heteroscedasticity, consider reporting different model specifications. Similarly, as we did—try out different standard errors, clustering methods, and assess how robust your model is. If your results do not hold across different model specifications, this is a signal that there may be no effect or an indication to explore alternative solutions.

  • Use another model (e.g., if you’re dealing with binary dependent variable, you can use logit insted of linear probability model)

  • Transform DV

Exercises

Let’s continue exploring CPDS dataset. The goal to identify what factors are associated with right-wing parties popularity. Say, you operatianlize it as parliamentary seat share of right-wing parties in government (gov_right3).

Is political fragmentation associated with right-wing popularity? Quite frequently political fragmentation is measured using the effective number of parties index. Feel free to explore it in your free time here. CPDS has index for this: effpar_leg.

Solution

YOUR SOLUTION HERE

Set up a simple linear regression. Set gov_right3 as dependent variable, and effpar_leg as independent. Present the summary. Take a moment to interpret the results.

Solution

YOUR SOLUTION HERE

Let’s check if the homoscedasticity assumption is satisfied. First, plot the residuals vs fitted graph eather using plot() or ggplot(). Does the graph show homoscedasticity?

Solution

YOUR SOLUTION HERE

Let’s formally test it. Run Breusch-Pagan test (bptest() from lmtest library). Interpret the results.

Solution

YOUR SOLUTION HERE

Ok, apparently there is something wrong with the data. Let’s first understand the relationship between dependent and independent variable. Using geom_point() plot effpar_leg against gov_right3.

Solution

YOUR SOLUTION HERE

Doesn’t look linear, right? Let’s experiment. We see that there is a high zero inflation. We don’t know how to solve it yet, so let’s get rid of all zeroes for gov_right3. Using filter(), leave observations where gov_right3 is not equal to 0.

Solution
cpds_nozeroes = ...

Now, create a new variable sqrt_effpar_leg that would be a square root of effpar_leg. You can use mutate() for this task.

Solution

YOUR SOLUTION HERE

Then, draw a scatterplot, where sqrt_effpar_leg is on the X axis, and effpar_leg on the Y. Draw a regression line. Were we able to make the relationsip linear?

Solution

YOUR SOLUTION HERE

Set up a model, where gov_right3 is dependent variable, and sqrt_effpar_leg is independent. Present the summary. Compare this adjusted model to the previous one.

Solution

YOUR SOLUTION HERE

Let’s check if the homoscedasticity assumption is satisfied for the adjusted model. Plot the residuals vs fitted graph eather using plot() or ggplot(). How does it compare?

Solution

YOUR SOLUTION HERE

Now, using lm_robust() cluster standard errors by country. Present summary. How did the results change? Pay special attention to the standard error.

Solution

YOUR SOLUTION HERE

Add the country and year fixed effects. Make sure that year is of class factor. How did p-value change? Why?

Solution

YOUR SOLUTION HERE

Present the results for the models using modelsummary(). Indicate confidence intervals. Are the models robust? What do we account for, and what don’t we?

Solution

YOUR SOLUTION HERE

Check List

I know what the standard error is

I remember homoscedasticity assumption

I know when I might need to use robust standard errors

I know when I might need to use clustered standard errors

I know how to use lm_robust() to account for fixed effects and various standard errors