library(tidyverse)
= read.csv("data/sipri.csv") sipri
Practice and Replication
Week 6
Before we start
- Any questions?
Review of the previous week
Load the tidyverse
library and sipri dataset
Using GGally
library, visualize the relationship between Import
and Regime
variables
Set up a linear model. Predict Import
by Regime
.
What is the reference category in this case?
Using relevel()
, set the reference category to Minimally Democratic
Set-up the model again, and plot the ggpredict()
. Did anything change?
Extract the model information using tidy()
function from broom
library.
Visualize the effects plot. Is it different to the plot generated by ggpredict()
?
Review of the homework
- Although the DV in the problem set was binary, we asked to set up a linear model. We didn’t discuss it, but it’s called Linear Probability Model. Alternative is logistic regression, which we have not covered yet.
Agenda
- Replicating a paper, applying tools we learned
The paper
Today we are replicating George Ward’s paper.1
Ward is interested in a quite straightforward question: Is subjective well-being a predictor of election results?, or more specifically: Do government parties get more votes when the electorate is happy? Ward uses observational data to find an answer to this question. He did both micro and macro level analyses. We will focus on the macro level analysis.
Part of his data comes from a great source – Eurobarometer. Feel free to explore in your free time.
First, load the data.
load("data/ward2020.Rdata")
head(dta)
country year vote_share_cab vote_share_cab_tmin1 satislfe_survey_mean
1 AUT 1995 66.40 62.60 3.240443
2 AUT 1999 60.10 66.40 3.125498
3 AUT 2002 52.31 53.80 3.119266
4 AUT 2006 38.40 42.30 3.016882
5 AUT 2008 55.24 69.60 2.996976
6 AUT 2013 50.80 55.24 3.026157
parties_ingov seatshare_cabinet cab_ideol_sd ENEP_tmin1
1 2 0.6393443 1.940301 3.87
2 2 0.6721312 1.940301 3.59
3 2 0.5683060 1.326603 3.82
4 2 0.4972678 1.666439 3.02
5 2 0.7322404 1.940301 3.71
6 2 0.5901639 1.940301 4.79
The variables are the following:
country
= countryyear
= election yearvote_share_cab
= % votes won by cabinet partiessatislfe_survey_mean
= Life Satisfaction, Country-Survey mean (1-4 scale)
Let’s pause for a second, and think about the data structure.
Say, you find a correlation between citizen’s happiness (satislfe_survey_mean
) before an election and the electoral outcome (e.g. vote_share_cab
). Would you consider this as strong evidence for the claim that happiness causes voters to vote for incumbent parties?
There may be ommitted confounding variables. Because the data structure is hierarchical, we can think of potential counfounders on the country level (variables that vary between countries but not constant over time, e.g., GDP per capita) and on the election level (variables that vary within countries, e.g. voter turnout).
Thus, we need to control for these confounding variables.
Simple linear regression model
Our dependent variables is vote share of cabinet parties (vote_share_cab
) and we are interested in the effect of happiness (satislfe_survey_mean
).
= lm(vote_share_cab ~ satislfe_survey_mean, data = dta) lm_pooling
Present the summary.
The model can be written as follows:
\[ Y_{it} = \beta_0 + \beta_1X_{it} + u_{it} \]
Subscripts:
- \(i\) indicates countries.
- \(t\) indicates time points, i.e. elections.
This approach is sometimes called complete pooling, because we throw all the data from all the countries and all the elections into one and the same pool. In other words: we ignore the data structure in the model specification.
Take a moment to plot the regression using ggplot()
.
ggplot()
Regression with fixed effects
There are different sorts of fixed effects, broadly speaking unit and time fixed effects. Let’s discuss it step-by-step.
Unit fixed effects
First of all, the fixed effects should be of class factor()
. Let’s check it and correct if needed.
class(dta$country)
[1] "character"
$country = as.factor(dta$country) dta
Unit fixed effects regression models can be very useful to solve one of the problems we already identified: omitted confounders on some higher level unit of observation. This higher level unit, in our case, are countries. Thus, let’s add country fixed effects. Notably, the variable satislfe_survey_mean
.
= lm(vote_share_cab ~ satislfe_survey_mean + country, data = dta)
fe_model
summary(fe_model)
Call:
lm(formula = vote_share_cab ~ satislfe_survey_mean + country,
data = dta)
Residuals:
Min 1Q Median 3Q Max
-26.5698 -3.3365 0.7488 4.0599 20.5531
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.1445 17.3984 -0.238 0.812112
satislfe_survey_mean 18.7915 5.5556 3.382 0.000964 ***
countryBEL -4.9655 3.5399 -1.403 0.163220
countryDEU -2.9294 3.6301 -0.807 0.421237
countryDNK -25.9045 4.2882 -6.041 1.68e-08 ***
countryESP -12.4643 3.9014 -3.195 0.001778 **
countryFIN -0.3649 4.6387 -0.079 0.937433
countryFRA -15.7479 3.9968 -3.940 0.000136 ***
countryGBR -18.0236 3.7206 -4.844 3.74e-06 ***
countryGRC -4.0953 4.7534 -0.862 0.390613
countryIRL -14.2652 3.6413 -3.918 0.000147 ***
countryITA -4.0091 4.2554 -0.942 0.347980
countryLUX -4.2656 3.9385 -1.083 0.280907
countryNLD -16.5458 3.9249 -4.216 4.78e-05 ***
countryPRT -3.6596 4.7765 -0.766 0.445044
countrySWE -19.3069 4.6118 -4.186 5.35e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.133 on 123 degrees of freedom
Multiple R-squared: 0.4973, Adjusted R-squared: 0.436
F-statistic: 8.113 on 15 and 123 DF, p-value: 1.679e-12
Let’s visualize the model. Load the ggeffects
library.
library(ggeffects)
Warning: package 'ggeffects' was built under R version 4.4.2
To simplify the perception, let’s choose only the following countries: AUT, DNK, IRL
. You can see the effect of adding the country
variable quite vividly, right? Compare it to the previous graph.
ggpredict(..., terms = c("...", "... [AUT, DNK, IRL]")) %>%
plot()
Including country fixed effects has the following pros and cons:
Advantage: Confounders that only vary between countries are of no more concern.
Disadvantage: We lose all the higher level information. So if we were interested in why governing parties in country A seem to systematically receive more votes than governing parties in B, then we cannot answer them anymore as soon as we include country fixed effects. Why? Perfect colinearity between country fixed effects and time-invariant country characteristics (e.g, political system)
Two way fixed effects
Now, let’s add year fixed effects. Usually, when we have two fixed effects, we refer to it as two way fixed effects.
Again, the fixed effects should be of class factor()
. Let’s check it and correct if needed.
class(dta$year)
[1] "numeric"
$year = as.factor(dta$year) dta
Just as country fixed effects help us to account for all unobserved confounders that only vary across countries (but not within countries), time fixed effects help us to account for all unobserved confounders that only vary across time (but not within time, i.e. between countries at one point in time).
= lm(vote_share_cab ~ satislfe_survey_mean + year + country, data = dta)
twfe_model
summary(twfe_model)
Call:
lm(formula = vote_share_cab ~ satislfe_survey_mean + year + country,
data = dta)
Residuals:
Min 1Q Median 3Q Max
-18.706 -2.384 0.000 2.998 16.267
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.422 21.550 -1.319 0.190868
satislfe_survey_mean 23.277 6.744 3.452 0.000883 ***
year1974 17.376 8.051 2.158 0.033832 *
year1975 -2.300 9.568 -0.240 0.810632
year1976 13.916 8.764 1.588 0.116159
year1977 10.132 7.836 1.293 0.199644
year1978 7.845 8.841 0.887 0.377457
year1979 15.994 7.814 2.047 0.043886 *
year1980 16.637 10.084 1.650 0.102804
year1981 11.353 7.621 1.490 0.140136
year1982 9.686 8.324 1.164 0.247988
year1983 19.800 8.243 2.402 0.018560 *
year1984 15.034 8.459 1.777 0.079235 .
year1985 12.587 8.853 1.422 0.158894
year1986 14.395 8.283 1.738 0.085977 .
year1987 12.513 7.516 1.665 0.099740 .
year1988 9.391 8.511 1.103 0.273085
year1989 12.756 7.890 1.617 0.109763
year1990 10.808 8.507 1.270 0.207502
year1991 16.207 8.921 1.817 0.072918 .
year1992 14.106 8.310 1.697 0.093404 .
year1993 6.628 8.332 0.796 0.428601
year1994 13.128 7.845 1.673 0.098067 .
year1995 11.993 8.353 1.436 0.154896
year1996 13.466 8.889 1.515 0.133614
year1997 6.247 8.266 0.756 0.451941
year1998 13.457 7.840 1.716 0.089872 .
year1999 14.511 7.855 1.847 0.068297 .
year2000 11.505 9.075 1.268 0.208460
year2001 10.043 8.114 1.238 0.219377
year2002 8.987 7.665 1.172 0.244415
year2003 11.397 8.285 1.376 0.172684
year2004 9.556 8.362 1.143 0.256456
year2005 5.740 7.931 0.724 0.471288
year2006 5.110 8.076 0.633 0.528670
year2007 7.857 7.727 1.017 0.312219
year2008 12.285 8.343 1.472 0.144733
year2009 14.633 7.962 1.838 0.069694 .
year2010 2.023 8.086 0.250 0.803072
year2011 -2.143 7.727 -0.277 0.782207
year2012 7.787 8.270 0.942 0.349149
year2013 9.693 8.295 1.169 0.245988
year2014 11.236 8.940 1.257 0.212381
countryBEL -6.146 3.853 -1.595 0.114542
countryDEU -4.945 4.000 -1.236 0.219819
countryDNK -26.270 5.079 -5.172 1.61e-06 ***
countryESP -11.341 4.437 -2.556 0.012428 *
countryFIN 1.702 5.034 0.338 0.736101
countryFRA -13.166 4.581 -2.874 0.005159 **
countryGBR -20.140 4.225 -4.767 8.00e-06 ***
countryGRC -2.053 5.628 -0.365 0.716261
countryIRL -13.351 4.074 -3.278 0.001537 **
countryITA -5.090 4.602 -1.106 0.271927
countryLUX -8.416 4.344 -1.937 0.056159 .
countryNLD -17.457 4.400 -3.967 0.000155 ***
countryPRT -1.255 5.315 -0.236 0.813944
countrySWE -18.345 4.951 -3.705 0.000382 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.766 on 82 degrees of freedom
Multiple R-squared: 0.6985, Adjusted R-squared: 0.4926
F-statistic: 3.392 on 56 and 82 DF, p-value: 2.595e-07
Let’s visualize to get the sense of what’s going on. Check the following countries AUT, DNK, IRL
in the following years 1987, 1992, 2011
. As it is a fixed effects model, we can interpret the same style as “holding everything else constant, on average one unit increase …”.
ggpredict(twfe_model, terms = c("satislfe_survey_mean", "country [AUT, DNK, IRL]", "year [1987, 1992, 2011]")) %>%
plot()
Usually, we do not report estimates for the fixed effects. You would expect to see something like this:
library(modelsummary)
Warning: package 'modelsummary' was built under R version 4.4.2
modelsummary(list("Pooled Model" = lm_pooling,
"Fixed Effects Model" = fe_model,
"Two-Way Fixed Effects Model" = twfe_model),
coef_omit = "country|year",
stars = TRUE,
gof_omit = "AIC|BIC|Log.Lik.|F|RMSE",
add_rows = data.frame(term = "Fixed effects",
pooled = "N",
fixed = "Y",
twfe = "Y"))
Pooled Model | Fixed Effects Model | Two-Way Fixed Effects Model | |
---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |||
(Intercept) | 31.040*** | -4.145 | -28.422 |
(8.025) | (17.398) | (21.550) | |
satislfe_survey_mean | 3.936 | 18.792*** | 23.277*** |
(2.580) | (5.556) | (6.744) | |
Num.Obs. | 139 | 139 | 139 |
R2 | 0.017 | 0.497 | 0.698 |
R2 Adj. | 0.010 | 0.436 | 0.493 |
Fixed effects | N | Y | Y |
Including control variables
By including country and year fixed effects, we made sure that our effect estimate is not biased due to unobserved confounders that are time-invariant on the country level in a given year. However, we still need to think about potential time-variant confounders we need to control for. Take a moment to think this through.
Here, we are back in the familiar game: we must think about potential covariates, try to get the data and include them in our model. Ward includes a series of control variables:
parties_ingov
= Number of parties in governmentseatshare_cabinet
= % seats held by governing coalitioncab_ideol_sd
= Gov Ideologial Disparity (Standard deviation of government party positions on left-right scale)ENEP_tmin1
= Party Fractionalisation - Last Election (Gallagher Fractionalization Index)
Let us set up the model.
= lm(vote_share_cab ~ satislfe_survey_mean + year + country + parties_ingov + seatshare_cabinet + cab_ideol_sd + ENEP_tmin1, data = dta) twfe_controls_model
By now, the summary below is unreadable.
summary(twfe_controls_model)
Let’s play around the modelsummary()
to include the most important information.
modelsummary(twfe_controls_model,
coef_omit = "country|year",
stars = TRUE,
output = "markdown")
(1) | |
---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
(Intercept) | -46.353* |
(22.003) | |
satislfe_survey_mean | 24.545*** |
(6.325) | |
parties_ingov | 3.222* |
(1.231) | |
seatshare_cabinet | 22.265** |
(8.411) | |
cab_ideol_sd | -0.825 |
(1.359) | |
ENEP_tmin1 | -0.069 |
(0.897) | |
Num.Obs. | 139 |
R2 | 0.766 |
R2 Adj. | 0.587 |
AIC | 941.1 |
BIC | 1123.0 |
Log.Lik. | -408.542 |
F | 4.267 |
RMSE | 4.57 |
However, often authors present only the main explanatory variable. You should mention your controls, but you don’t have to report it directly (usually, in this case the extended table goes to appendix). Let’s present all of the models. And customize the display. This is what you can see in Table 1(column 1) of the article with one difference. The author standardized numeric variables. Why would you standardize the variable in this case?
modelsummary(list("Pooled Model" = lm_pooling,
"Fixed Effects Model" = fe_model,
"Two-Way Fixed Effects Model" = twfe_model,
"Two-Way Fixed Effects Model w Controls" = twfe_controls_model),
coef_rename = c("(Intercept)", "Well-Being"),
coef_omit = "country|year|parties_ingov|seatshare_cabinet|cab_ideol_sd|ENEP_tmin1",
stars = TRUE,
gof_omit = "AIC|BIC|Log.Lik.|F|RMSE",
add_rows = data.frame(term = c("Fixed effects", "Conrols"),
pooled = c("N", "N"),
fixed = c("Y", "N"),
twfe = c("Y", "N"),
twfe_c = c("Y", "Y")))
Pooled Model | Fixed Effects Model | Two-Way Fixed Effects Model | Two-Way Fixed Effects Model w Controls | |
---|---|---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||||
(Intercept) | 31.040*** | -4.145 | -28.422 | -46.353* |
(8.025) | (17.398) | (21.550) | (22.003) | |
Well-Being | 3.936 | 18.792*** | 23.277*** | 24.545*** |
(2.580) | (5.556) | (6.744) | (6.325) | |
Num.Obs. | 139 | 139 | 139 | 139 |
R2 | 0.017 | 0.497 | 0.698 | 0.766 |
R2 Adj. | 0.010 | 0.436 | 0.493 | 0.587 |
Fixed effects | N | Y | Y | Y |
Conrols | N | N | N | Y |
Exercises
Let’s explore Comparative Political Dataset. Take a look on their codebook.
Load the data from the Canvas, or by clicking this button:
Using the readxl
library, load the dataset.
The dataset contains the information on political systems of different countries in different years. Let’s explore the following variables:
country
nameyear
of the observationgov_right1
- cabinet posts of right-wing parties in percentage of total cabinet postsoutlays
- total outlays (disbursements) of general government as a percentage of GDP
Subset these variables from the dataset.
Visualize the distribution of the gov_right1
variable.
Check the class of outlays
variable. Using the head()
function, explore the first 6 rows of the cpds_sub
dataframe. Does the class seem right? If not, correct.
Using ggpairs()
from GGally
library, explore the distribution of gov_right1
and outlays
variables, and their relationship. What do you notice?
Set up a linear model. Predict gov_right1
by outlays
. Interpret the results.
Using \(\LaTeX\) write out the formula of the regression. Don’t forget about the subscripts.
Include the year
fixed effects. Make sure it is of right class. Does the result hold? Interpret the coefficient for outlays
.
Now, let’s set up a two-way fixed effect. Introduce the country
fixed effects. Make sure it is of right class. Did the result change? Interpret the results for outlays
.
Let’s explore this model. Using ggpredict()
plot the terms in the following order: outlays
, country
, year
. Choose the following countries: Austria, Canada, Sweden
, and the following years: 1961, 1996, 2016
. Do the slopes and confidence intervals explain the significance of the outlays
variable?
Using modelsummary()
, present the results for all the models. Don’t forget to customize the output.
Check List
I understand that the structure of the data can be different
I know what pooled regressions means
I know what the fixed effects types are
I am not scared of such terms as two way fixed effect (TWFE)
I know how to present only selected cases using ggpredict()
I know how to customize modelsummary()
Footnotes
The material for this lab is based on the University of Mannheim’s Quantitative Methods class. The paper replicated is Ward, G., 2020. Happiness and voting: Evidence from four decades of elections in Europe. American Journal of Political Science, 64(3), pp.504-518.↩︎