library(tidyverse)
Robust and Clustered Standard Errors
Week 7
Before we start
- Any questions?
Review of the previous week
Load tidyverse
library
Load the George Ward’s replication data
using ggpairs()
from GGally
library, visualize the relationships between the variables. What can you see?
Explore the relationship between subjective well being (satislfe_survey_mean
) and number of parties in government (parties_ingov
). Draw a boxplot. Does it look right? If not, correct!
Set up a pooled model. Predict satislfe_survey_mean
by parties_ingov
. Save the model to model_pooled
object. Present summary. What do you think about the model?
Introduce country fixed effects. Save it to the model_countryfe
object. Present summary. Compare it to the pooled model. Pay attention to \(R^2\). Why are they different?
Using ggpredict()
from ggeffects
library visualize model_countryfe
. Plot parties_ingov
and the following country
values: BEL, DNK, SWE
. Take a moment to understand the graph. Insert the argument connect_line = TRUE
to make the comprehension of the graph easier.
Review of the homework
The set.seed()
function allows you to make the random process “controllable”. To get the sense of what’s going on, let’s experiment a bit.
Let’s randomly generate 10 integers. Each time you run the chunk below, a different vector is generated.
sample.int(100, 10)
[1] 78 29 62 2 50 80 69 34 65 93
Now, if we want to get the same results each time, we can set the seed.
set.seed(123)
sample.int(100, 10)
[1] 31 79 51 14 67 42 50 43 97 25
Be careful when you include the set.seed()
within functions.
= function(x){
wrongfunc set.seed(123)
sample.int(x, 10)
}
In simulations it wouldn’t allow you to randomly sample/generate data.
for(i in 1:10){
print(wrongfunc(100))
}
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
[1] 31 79 51 14 67 42 50 43 97 25
Instead, you would want that the loop would produce the constant result.
= function(x){
corrfunc sample.int(x, 10)
}
set.seed(123)
for(i in 1:10){
print(corrfunc(100))
}
[1] 31 79 51 14 67 42 50 43 97 25
[1] 90 91 69 99 57 92 9 93 72 26
[1] 7 42 9 83 36 78 81 43 76 15
[1] 32 7 9 41 74 23 27 60 53 99
[1] 53 27 96 38 89 34 93 69 72 76
[1] 63 13 82 97 91 25 38 21 79 41
[1] 47 90 60 95 16 94 6 72 86 92
[1] 39 31 81 50 34 4 13 69 25 52
[1] 22 89 32 25 87 35 40 30 12 31
[1] 30 64 14 93 96 71 67 23 79 85
Agenda
Introduction to diagnostics
Fixing heteroscedasticity
And if we can’t, then we will be working with robust standard errors
Finally, dealing with clustered standard errors
Country-Year Fixed Effects Model
Let’s explore Comparative Political Dataset. It consists of political and institutional country-level data. Take a look on their codebook.
Today we are working with the following variables.
prefisc_gini
- Gini index. What is it?openc
- Openness of the economy (trade as % of GDP)servadmi_pmp
- Public and mandatory private employment services and administration as a percentage of GDP.country
andyear
First of all, let’s load the data
library(readxl)
= read_excel("data/cpds.xlsx") cpds
Imagine you are interested in explaining inequality (prefisc_gini
) by the amount of trade (measured as oppenness of the economy – openc
). Set up a simple linear regression (slr).
= lm(prefisc_gini ~ openc, cpds)
model_slr summary(model_slr)
Call:
lm(formula = prefisc_gini ~ openc, data = cpds)
Residuals:
Min 1Q Median 3Q Max
-11.7307 -2.8912 0.5475 2.8113 12.9869
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.457621 0.295214 140.432 <2e-16 ***
openc -0.001798 0.002652 -0.678 0.498
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.141 on 547 degrees of freedom
(1317 observations deleted due to missingness)
Multiple R-squared: 0.00084, Adjusted R-squared: -0.0009866
F-statistic: 0.4599 on 1 and 547 DF, p-value: 0.498
Given how complex the inequality, it’s fair to assume we have some confounders. Let’s control for labour market policies (measured by Public employment as % of GDP – servadmi_pmp
). Set up a multiple linear regression (MLR) below. What do you think about the model?
= lm(prefisc_gini ~ openc + servadmi_pmp, cpds)
model_mlr summary(model_mlr)
Call:
lm(formula = prefisc_gini ~ openc + servadmi_pmp, data = cpds)
Residuals:
Min 1Q Median 3Q Max
-12.2821 -2.5147 0.3219 2.6891 13.0194
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.216729 0.454877 92.809 < 2e-16 ***
openc -0.007419 0.002691 -2.758 0.00605 **
servadmi_pmp 1.717317 1.862425 0.922 0.35695
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.934 on 477 degrees of freedom
(1386 observations deleted due to missingness)
Multiple R-squared: 0.01985, Adjusted R-squared: 0.01574
F-statistic: 4.829 on 2 and 477 DF, p-value: 0.008388
How can we check if the homoscedasticity assumption is satisfied? Proceed with simple analysis. Extract residuals and fitted (predicted) values. Plot it. What do you think?
= model_mlr$residuals
res = model_mlr$fitted.values
fit_val
ggplot() +
geom_point(aes(x = fit_val, y = res)) +
geom_hline(yintercept = 0)
The same plot can be easily accessed using the base R functions. We’ll get to this next week in a more detail.
plot(model_mlr, which = 1)
We know that our data is of country
-year
structure. Let’s introduce the fixed effects to the model. First, make sure these are factors.
$country = as.factor(cpds$country)
cpds$year = as.factor(cpds$year) cpds
What has changed?
= lm(prefisc_gini ~ openc + servadmi_pmp + country + year, cpds)
model_fe summary(model_fe)
Call:
lm(formula = prefisc_gini ~ openc + servadmi_pmp + country +
year, data = cpds)
Residuals:
Min 1Q Median 3Q Max
-7.4138 -1.1115 -0.0014 0.9614 6.5810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.437237 1.560752 22.705 < 2e-16 ***
openc 0.011568 0.004632 2.497 0.012898 *
servadmi_pmp 2.446305 1.717145 1.425 0.155022
countryAustria -3.621612 0.771798 -4.692 3.68e-06 ***
countryBelgium -3.636327 0.879871 -4.133 4.35e-05 ***
countryCanada -1.604814 0.711066 -2.257 0.024538 *
countryCzech Republic -6.077881 1.015740 -5.984 4.75e-09 ***
countryDenmark -4.564718 0.887802 -5.142 4.22e-07 ***
countryEstonia -4.319239 1.184771 -3.646 0.000301 ***
countryFinland -2.176054 0.923626 -2.356 0.018942 *
countryFrance 0.007526 0.746337 0.010 0.991959
countryGermany -2.894690 0.735793 -3.934 9.80e-05 ***
countryGreece -0.290260 1.005907 -0.289 0.773067
countryHungary 1.280997 1.088680 1.177 0.240016
countryIceland -11.037538 1.339412 -8.241 2.32e-15 ***
countryIreland 2.232621 0.956071 2.335 0.020013 *
countryItaly -1.154673 0.991811 -1.164 0.245016
countryJapan -7.228360 1.331619 -5.428 9.75e-08 ***
countryLuxembourg -6.698123 1.312531 -5.103 5.11e-07 ***
countryNetherlands -3.895918 0.951399 -4.095 5.09e-05 ***
countryNorway -5.508208 0.901830 -6.108 2.34e-09 ***
countryPoland -0.011346 0.801278 -0.014 0.988709
countryRomania -2.871224 0.836577 -3.432 0.000660 ***
countrySlovakia -9.720522 1.029048 -9.446 < 2e-16 ***
countrySlovenia -8.015692 1.168882 -6.858 2.58e-11 ***
countrySpain -0.276633 0.735491 -0.376 0.707022
countrySweden -5.162188 0.776541 -6.648 9.49e-11 ***
countrySwitzerland -10.486322 0.834016 -12.573 < 2e-16 ***
countryUnited Kingdom 1.696744 0.724065 2.343 0.019587 *
countryUSA 2.433579 0.735544 3.309 0.001020 **
year1981 -1.544550 2.429892 -0.636 0.525361
year1982 1.455276 2.429893 0.599 0.549566
year1983 6.820538 2.008574 3.396 0.000751 ***
year1984 1.041995 2.430511 0.429 0.668356
year1985 3.685598 1.599656 2.304 0.021721 *
year1986 1.800695 1.637568 1.100 0.272143
year1987 4.646000 1.556861 2.984 0.003013 **
year1988 3.458806 1.634901 2.116 0.034979 *
year1989 3.340307 1.635580 2.042 0.041762 *
year1990 3.675834 1.579358 2.327 0.020428 *
year1991 3.600641 1.610814 2.235 0.025935 *
year1992 5.821592 1.534969 3.793 0.000171 ***
year1993 6.358422 1.605394 3.961 8.81e-05 ***
year1994 7.223842 1.568944 4.604 5.53e-06 ***
year1995 7.112686 1.506798 4.720 3.23e-06 ***
year1996 7.593346 1.540853 4.928 1.21e-06 ***
year1997 7.216507 1.560473 4.625 5.04e-06 ***
year1998 7.325058 1.562397 4.688 3.75e-06 ***
year1999 6.804528 1.540786 4.416 1.29e-05 ***
year2000 6.705205 1.502256 4.463 1.04e-05 ***
year2001 7.792546 1.609835 4.841 1.84e-06 ***
year2002 7.324104 1.556131 4.707 3.45e-06 ***
year2003 7.466408 1.546082 4.829 1.94e-06 ***
year2004 8.118553 1.482882 5.475 7.64e-08 ***
year2005 8.049804 1.526365 5.274 2.16e-07 ***
year2006 7.865085 1.521659 5.169 3.69e-07 ***
year2007 7.320005 1.484491 4.931 1.19e-06 ***
year2008 7.695709 1.508191 5.103 5.13e-07 ***
year2009 8.741259 1.518804 5.755 1.69e-08 ***
year2010 8.388088 1.487661 5.638 3.19e-08 ***
year2011 8.787078 1.525340 5.761 1.64e-08 ***
year2012 9.121104 1.514125 6.024 3.78e-09 ***
year2013 8.949772 1.491692 6.000 4.34e-09 ***
year2014 8.730829 1.519104 5.747 1.77e-08 ***
year2015 8.248437 1.512792 5.452 8.59e-08 ***
year2016 7.999621 1.495341 5.350 1.47e-07 ***
year2017 7.881164 1.522311 5.177 3.53e-07 ***
year2018 7.639807 1.516810 5.037 7.10e-07 ***
year2019 7.610938 1.526318 4.986 9.09e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.972 on 411 degrees of freedom
(1386 observations deleted due to missingness)
Multiple R-squared: 0.7877, Adjusted R-squared: 0.7526
F-statistic: 22.43 on 68 and 411 DF, p-value: < 2.2e-16
Now, let’s check if the heteroscedasticity problem persists.
plot(model_fe, which = 1)
Draw the same graph using ggplot()
. What do you think, is there a problem?
Let’s conduct a formal test using bptest()
from lmtest
library. It stands for Breusch-Pagan Test. How would you approach interpreting the results?
library(lmtest)
bptest(model_fe)
studentized Breusch-Pagan test
data: model_fe
BP = 182.68, df = 68, p-value = 1.928e-12
Robust Standard Errors
Now, introduce robust SEs. They are also referred to as Heteroskedasticity-consistent standard errors (HC SEs).
First, load the library estimatr
. If you don’t have it, take a moment to install.
library(estimatr)
Now, we can use lm_robust()
function to introduce the robust standard errors to attempt to account for the heteroscedasticity.
= lm_robust(prefisc_gini ~ openc + servadmi_pmp, cpds,
model_hc fixed_effects = country + year,
se_type = "HC2")
summary(model_hc)
Call:
lm_robust(formula = prefisc_gini ~ openc + servadmi_pmp, data = cpds,
fixed_effects = country + year, se_type = "HC2")
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
openc 0.01157 0.004431 2.611 0.009364 0.002858 0.02028 411
servadmi_pmp 2.44631 1.769259 1.383 0.167516 -1.031621 5.92423 411
Multiple R-squared: 0.7877 , Adjusted R-squared: 0.7526
Multiple R-squared (proj. model): 0.01949 , Adjusted R-squared (proj. model): -0.1427
F-statistic (proj. model): 4.644 on 2 and 411 DF, p-value: 0.01013
Let’s compare two models side by side. What’s the difference? Did we account for heteroscedasticity? Remember that we have not displayed estimates for fixed effects.
library(modelsummary)
modelsummary(list("Fixed Effects Model" = model_fe,
"Fixed Effects Model (HC)" = model_hc),
stars = T,
coef_omit = "country|year",
gof_omit = "AIC|BIC|Log.Lik.|RMSE",
coef_rename = c("Intercept",
"Economy Openness",
"Public Sector"))
Fixed Effects Model | Fixed Effects Model (HC) | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
Intercept | 35.437*** | |
(1.561) | ||
Economy Openness | 0.012* | 0.012** |
(0.005) | (0.004) | |
Public Sector | 2.446 | 2.446 |
(1.717) | (1.769) | |
Num.Obs. | 480 | 480 |
R2 | 0.788 | 0.788 |
R2 Adj. | 0.753 | 0.753 |
Clustered Standard Errors
During the lecture we have discussed clustered standard errors. Let’s try to implement it.
Assuming there is a correlation between country X in year \(t\) and year \(t+1\), it’s valid to introduce clustered standard errors to the model. However, be careful if you don’t have enough obervations. Note a couple of things:
\(R^2\) did not change
\(\beta\) coefficients did not change
= lm_robust(prefisc_gini ~ openc + servadmi_pmp, cpds,
model_clust fixed_effects = country + year,
clusters = country)
modelsummary(list("Fixed Effects Model" = model_fe,
"Fixed Effects Model (HC)" = model_hc,
"Fixed Effects Model (HC + Cl)" = model_clust),
stars = T,
coef_omit = "country|year",
gof_omit = "AIC|BIC|Log.Lik.|RMSE",
coef_rename = c("Intercept",
"Economy Openness",
"Public Sector"))
Fixed Effects Model | Fixed Effects Model (HC) | Fixed Effects Model (HC + Cl) | |
---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
Intercept | 35.437*** | ||
(1.561) | |||
Economy Openness | 0.012* | 0.012** | 0.012 |
(0.005) | (0.004) | (0.021) | |
Public Sector | 2.446 | 2.446 | 2.446 |
(1.717) | (1.769) | (2.180) | |
Num.Obs. | 480 | 480 | 480 |
R2 | 0.788 | 0.788 | 0.788 |
R2 Adj. | 0.753 | 0.753 | 0.753 |
Std.Errors | by: country |
Let’s compare the models visually. First, extract the information using tidy()
. The code below is a bit confusing, take a moment in your free time to get a sense what’s going on.
library(broom)
= bind_rows(cbind(model = "Fixed Effects (HC)", tidy(model_hc, conf.int = T)),
model_comparison cbind(model = "Fixed Effects (HC + Cl)", tidy(model_clust, conf.int = T)))
model_comparison
model term estimate std.error statistic
1 Fixed Effects (HC) openc 0.01156786 0.004430807 2.6107801
2 Fixed Effects (HC) servadmi_pmp 2.44630535 1.769259191 1.3826721
3 Fixed Effects (HC + Cl) openc 0.01156786 0.021259151 0.5441357
4 Fixed Effects (HC + Cl) servadmi_pmp 2.44630535 2.179715863 1.1223047
p.value conf.low conf.high df outcome
1 0.009363945 0.002857992 0.02027773 411.000000 prefisc_gini
2 0.167516215 -1.031620650 5.92423134 411.000000 prefisc_gini
3 0.642254837 -0.082433704 0.10556943 1.944857 prefisc_gini
4 0.292747596 -2.537707419 7.43031811 8.410874 prefisc_gini
This is a variation of an effects plot, but instead of comparing different estimates, we compare the estimate across models. Clustering standard errors has widened our confidence intervals, which now cross zero.
%>%
model_comparison filter(term == "openc") %>%
ggplot(aes(y = model, x = estimate)) +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, height = 0.1), position = "dodge") +
geom_vline(xintercept = 0, linetype = 2) +
geom_point(position = position_dodge(width = 0.1)) +
theme_bw()
However, a small note on robust standard errors: they are not a panacea. If faced with heteroscedasticity, consider reporting different model specifications. Similarly, as we did—try out different standard errors, clustering methods, and assess how robust your model is. If your results do not hold across different model specifications, this is a signal that there may be no effect or an indication to explore alternative solutions.
Use another model (e.g., if you’re dealing with binary dependent variable, you can use logit insted of linear probability model)
Transform DV
Exercises
Let’s continue exploring CPDS dataset. The goal to identify what factors are associated with right-wing parties popularity. Say, you operatianlize it as parliamentary seat share of right-wing parties in government (gov_right3
).
Is political fragmentation associated with right-wing popularity? Quite frequently political fragmentation is measured using the effective number of parties index. Feel free to explore it in your free time here. CPDS has index for this: effpar_leg
.
Set up a simple linear regression. Set gov_right3
as dependent variable, and effpar_leg
as independent. Present the summary. Take a moment to interpret the results.
Let’s check if the homoscedasticity assumption is satisfied. First, plot the residuals vs fitted graph eather using plot()
or ggplot()
. Does the graph show homoscedasticity?
Let’s formally test it. Run Breusch-Pagan test (bptest()
from lmtest
library). Interpret the results.
Ok, apparently there is something wrong with the data. Let’s first understand the relationship between dependent and independent variable. Using geom_point()
plot effpar_leg
against gov_right3
.
Doesn’t look linear, right? Let’s experiment. We see that there is a high zero inflation. We don’t know how to solve it yet, so let’s get rid of all zeroes for gov_right3
. Using filter()
, leave observations where gov_right3
is not equal to 0.
Now, create a new variable sqrt_effpar_leg
that would be a square root of effpar_leg
. You can use mutate()
for this task.
Then, draw a scatterplot, where sqrt_effpar_leg
is on the X axis, and effpar_leg
on the Y. Draw a regression line. Were we able to make the relationsip linear?
Set up a model, where gov_right3
is dependent variable, and sqrt_effpar_leg
is independent. Present the summary. Compare this adjusted model to the previous one.
Let’s check if the homoscedasticity assumption is satisfied for the adjusted model. Plot the residuals vs fitted graph eather using plot()
or ggplot()
. How does it compare?
Now, using lm_robust()
cluster standard errors by country
. Present summary. How did the results change? Pay special attention to the standard error.
Add the country
and year
fixed effects. Make sure that year
is of class factor. How did p-value change? Why?
Present the results for the models using modelsummary()
. Indicate confidence intervals. Are the models robust? What do we account for, and what don’t we?
Check List
I know what the standard error is
I remember homoscedasticity assumption
I know when I might need to use robust standard errors
I know when I might need to use clustered standard errors
I know how to use lm_robust()
to account for fixed effects and various standard errors