Practice and Replication

Week 6

Published

February 13, 2025

Before we start

  • Any questions?
Download script

Download data

Review of the previous week

Review

Load the tidyverse library and sipri dataset

library(tidyverse)

sipri = read.csv("data/sipri.csv")

Using GGally library, visualize the relationship between Import and Regime variables

Set up a linear model. Predict Import by Regime.

What is the reference category in this case?

Using relevel(), set the reference category to Minimally Democratic

Set-up the model again, and plot the ggpredict(). Did anything change?

Extract the model information using tidy() function from broom library.

Visualize the effects plot. Is it different to the plot generated by ggpredict()?

Review of the homework

  • Although the DV in the problem set was binary, we asked to set up a linear model. We didn’t discuss it, but it’s called Linear Probability Model. Alternative is logistic regression, which we have not covered yet.

Agenda

  • Replicating a paper, applying tools we learned

The paper

Today we are replicating George Ward’s paper.1

Ward is interested in a quite straightforward question: Is subjective well-being a predictor of election results?, or more specifically: Do government parties get more votes when the electorate is happy? Ward uses observational data to find an answer to this question. He did both micro and macro level analyses. We will focus on the macro level analysis.

Part of his data comes from a great source – Eurobarometer. Feel free to explore in your free time.

First, load the data.

load("data/ward2020.Rdata")

head(dta)
  country year vote_share_cab vote_share_cab_tmin1 satislfe_survey_mean
1     AUT 1995          66.40                62.60             3.240443
2     AUT 1999          60.10                66.40             3.125498
3     AUT 2002          52.31                53.80             3.119266
4     AUT 2006          38.40                42.30             3.016882
5     AUT 2008          55.24                69.60             2.996976
6     AUT 2013          50.80                55.24             3.026157
  parties_ingov seatshare_cabinet cab_ideol_sd ENEP_tmin1
1             2         0.6393443     1.940301       3.87
2             2         0.6721312     1.940301       3.59
3             2         0.5683060     1.326603       3.82
4             2         0.4972678     1.666439       3.02
5             2         0.7322404     1.940301       3.71
6             2         0.5901639     1.940301       4.79

The variables are the following:

  • country = country

  • year = election year

  • vote_share_cab = % votes won by cabinet parties

  • satislfe_survey_mean = Life Satisfaction, Country-Survey mean (1-4 scale)

Let’s pause for a second, and think about the data structure.

Say, you find a correlation between citizen’s happiness (satislfe_survey_mean) before an election and the electoral outcome (e.g. vote_share_cab). Would you consider this as strong evidence for the claim that happiness causes voters to vote for incumbent parties?

There may be ommitted confounding variables. Because the data structure is hierarchical, we can think of potential counfounders on the country level (variables that vary between countries but not constant over time, e.g., GDP per capita) and on the election level (variables that vary within countries, e.g. voter turnout).

Thus, we need to control for these confounding variables.

Simple linear regression model

Our dependent variables is vote share of cabinet parties (vote_share_cab) and we are interested in the effect of happiness (satislfe_survey_mean).

lm_pooling = lm(vote_share_cab ~ satislfe_survey_mean, data = dta)

Present the summary.

The model can be written as follows:

\[ Y_{it} = \beta_0 + \beta_1X_{it} + u_{it} \]

Subscripts:

  • \(i\) indicates countries.
  • \(t\) indicates time points, i.e. elections.

This approach is sometimes called complete pooling, because we throw all the data from all the countries and all the elections into one and the same pool. In other words: we ignore the data structure in the model specification.

Take a moment to plot the regression using ggplot().

ggplot()

Regression with fixed effects

There are different sorts of fixed effects, broadly speaking unit and time fixed effects. Let’s discuss it step-by-step.

Unit fixed effects

First of all, the fixed effects should be of class factor(). Let’s check it and correct if needed.

class(dta$country)
[1] "character"
dta$country = as.factor(dta$country)

Unit fixed effects regression models can be very useful to solve one of the problems we already identified: omitted confounders on some higher level unit of observation. This higher level unit, in our case, are countries. Thus, let’s add country fixed effects. Notably, the variable satislfe_survey_mean.

fe_model = lm(vote_share_cab ~ satislfe_survey_mean + country, data = dta)

summary(fe_model)

Call:
lm(formula = vote_share_cab ~ satislfe_survey_mean + country, 
    data = dta)

Residuals:
     Min       1Q   Median       3Q      Max 
-26.5698  -3.3365   0.7488   4.0599  20.5531 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -4.1445    17.3984  -0.238 0.812112    
satislfe_survey_mean  18.7915     5.5556   3.382 0.000964 ***
countryBEL            -4.9655     3.5399  -1.403 0.163220    
countryDEU            -2.9294     3.6301  -0.807 0.421237    
countryDNK           -25.9045     4.2882  -6.041 1.68e-08 ***
countryESP           -12.4643     3.9014  -3.195 0.001778 ** 
countryFIN            -0.3649     4.6387  -0.079 0.937433    
countryFRA           -15.7479     3.9968  -3.940 0.000136 ***
countryGBR           -18.0236     3.7206  -4.844 3.74e-06 ***
countryGRC            -4.0953     4.7534  -0.862 0.390613    
countryIRL           -14.2652     3.6413  -3.918 0.000147 ***
countryITA            -4.0091     4.2554  -0.942 0.347980    
countryLUX            -4.2656     3.9385  -1.083 0.280907    
countryNLD           -16.5458     3.9249  -4.216 4.78e-05 ***
countryPRT            -3.6596     4.7765  -0.766 0.445044    
countrySWE           -19.3069     4.6118  -4.186 5.35e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.133 on 123 degrees of freedom
Multiple R-squared:  0.4973,    Adjusted R-squared:  0.436 
F-statistic: 8.113 on 15 and 123 DF,  p-value: 1.679e-12

Let’s visualize the model. Load the ggeffects library.

library(ggeffects)
Warning: package 'ggeffects' was built under R version 4.4.2

To simplify the perception, let’s choose only the following countries: AUT, DNK, IRL. You can see the effect of adding the country variable quite vividly, right? Compare it to the previous graph.

ggpredict(..., terms = c("...", "... [AUT, DNK, IRL]")) %>%
  plot() 

Including country fixed effects has the following pros and cons:

  • Advantage: Confounders that only vary between countries are of no more concern.

  • Disadvantage: We lose all the higher level information. So if we were interested in why governing parties in country A seem to systematically receive more votes than governing parties in B, then we cannot answer them anymore as soon as we include country fixed effects. Why? Perfect colinearity between country fixed effects and time-invariant country characteristics (e.g, political system)

Two way fixed effects

Now, let’s add year fixed effects. Usually, when we have two fixed effects, we refer to it as two way fixed effects.

Again, the fixed effects should be of class factor(). Let’s check it and correct if needed.

class(dta$year)
[1] "numeric"
dta$year = as.factor(dta$year)

Just as country fixed effects help us to account for all unobserved confounders that only vary across countries (but not within countries), time fixed effects help us to account for all unobserved confounders that only vary across time (but not within time, i.e. between countries at one point in time).

twfe_model = lm(vote_share_cab ~ satislfe_survey_mean + year + country, data = dta)

summary(twfe_model)

Call:
lm(formula = vote_share_cab ~ satislfe_survey_mean + year + country, 
    data = dta)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.706  -2.384   0.000   2.998  16.267 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -28.422     21.550  -1.319 0.190868    
satislfe_survey_mean   23.277      6.744   3.452 0.000883 ***
year1974               17.376      8.051   2.158 0.033832 *  
year1975               -2.300      9.568  -0.240 0.810632    
year1976               13.916      8.764   1.588 0.116159    
year1977               10.132      7.836   1.293 0.199644    
year1978                7.845      8.841   0.887 0.377457    
year1979               15.994      7.814   2.047 0.043886 *  
year1980               16.637     10.084   1.650 0.102804    
year1981               11.353      7.621   1.490 0.140136    
year1982                9.686      8.324   1.164 0.247988    
year1983               19.800      8.243   2.402 0.018560 *  
year1984               15.034      8.459   1.777 0.079235 .  
year1985               12.587      8.853   1.422 0.158894    
year1986               14.395      8.283   1.738 0.085977 .  
year1987               12.513      7.516   1.665 0.099740 .  
year1988                9.391      8.511   1.103 0.273085    
year1989               12.756      7.890   1.617 0.109763    
year1990               10.808      8.507   1.270 0.207502    
year1991               16.207      8.921   1.817 0.072918 .  
year1992               14.106      8.310   1.697 0.093404 .  
year1993                6.628      8.332   0.796 0.428601    
year1994               13.128      7.845   1.673 0.098067 .  
year1995               11.993      8.353   1.436 0.154896    
year1996               13.466      8.889   1.515 0.133614    
year1997                6.247      8.266   0.756 0.451941    
year1998               13.457      7.840   1.716 0.089872 .  
year1999               14.511      7.855   1.847 0.068297 .  
year2000               11.505      9.075   1.268 0.208460    
year2001               10.043      8.114   1.238 0.219377    
year2002                8.987      7.665   1.172 0.244415    
year2003               11.397      8.285   1.376 0.172684    
year2004                9.556      8.362   1.143 0.256456    
year2005                5.740      7.931   0.724 0.471288    
year2006                5.110      8.076   0.633 0.528670    
year2007                7.857      7.727   1.017 0.312219    
year2008               12.285      8.343   1.472 0.144733    
year2009               14.633      7.962   1.838 0.069694 .  
year2010                2.023      8.086   0.250 0.803072    
year2011               -2.143      7.727  -0.277 0.782207    
year2012                7.787      8.270   0.942 0.349149    
year2013                9.693      8.295   1.169 0.245988    
year2014               11.236      8.940   1.257 0.212381    
countryBEL             -6.146      3.853  -1.595 0.114542    
countryDEU             -4.945      4.000  -1.236 0.219819    
countryDNK            -26.270      5.079  -5.172 1.61e-06 ***
countryESP            -11.341      4.437  -2.556 0.012428 *  
countryFIN              1.702      5.034   0.338 0.736101    
countryFRA            -13.166      4.581  -2.874 0.005159 ** 
countryGBR            -20.140      4.225  -4.767 8.00e-06 ***
countryGRC             -2.053      5.628  -0.365 0.716261    
countryIRL            -13.351      4.074  -3.278 0.001537 ** 
countryITA             -5.090      4.602  -1.106 0.271927    
countryLUX             -8.416      4.344  -1.937 0.056159 .  
countryNLD            -17.457      4.400  -3.967 0.000155 ***
countryPRT             -1.255      5.315  -0.236 0.813944    
countrySWE            -18.345      4.951  -3.705 0.000382 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.766 on 82 degrees of freedom
Multiple R-squared:  0.6985,    Adjusted R-squared:  0.4926 
F-statistic: 3.392 on 56 and 82 DF,  p-value: 2.595e-07

Let’s visualize to get the sense of what’s going on. Check the following countries AUT, DNK, IRL in the following years 1987, 1992, 2011. As it is a fixed effects model, we can interpret the same style as “holding everything else constant, on average one unit increase …”.

ggpredict(twfe_model, terms = c("satislfe_survey_mean", "country [AUT, DNK, IRL]", "year [1987, 1992, 2011]")) %>%
  plot()

Usually, we do not report estimates for the fixed effects. You would expect to see something like this:

library(modelsummary)
Warning: package 'modelsummary' was built under R version 4.4.2
modelsummary(list("Pooled Model" = lm_pooling,
                  "Fixed Effects Model" = fe_model,
                  "Two-Way Fixed Effects Model" = twfe_model),
             coef_omit = "country|year",
             stars = TRUE,
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE",
             add_rows = data.frame(term = "Fixed effects",
                                   pooled = "N",
                                   fixed = "Y",
                                   twfe = "Y"))
Pooled Model Fixed Effects Model Two-Way Fixed Effects Model
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 31.040*** -4.145 -28.422
(8.025) (17.398) (21.550)
satislfe_survey_mean 3.936 18.792*** 23.277***
(2.580) (5.556) (6.744)
Num.Obs. 139 139 139
R2 0.017 0.497 0.698
R2 Adj. 0.010 0.436 0.493
Fixed effects N Y Y

Including control variables

By including country and year fixed effects, we made sure that our effect estimate is not biased due to unobserved confounders that are time-invariant on the country level in a given year. However, we still need to think about potential time-variant confounders we need to control for. Take a moment to think this through.

Here, we are back in the familiar game: we must think about potential covariates, try to get the data and include them in our model. Ward includes a series of control variables:

  • parties_ingov = Number of parties in government

  • seatshare_cabinet = % seats held by governing coalition

  • cab_ideol_sd = Gov Ideologial Disparity (Standard deviation of government party positions on left-right scale)

  • ENEP_tmin1 = Party Fractionalisation - Last Election (Gallagher Fractionalization Index)

Let us set up the model.

twfe_controls_model = lm(vote_share_cab ~ satislfe_survey_mean + year + country + parties_ingov + seatshare_cabinet + cab_ideol_sd + ENEP_tmin1, data = dta)

By now, the summary below is unreadable.

summary(twfe_controls_model)

Let’s play around the modelsummary() to include the most important information.

modelsummary(twfe_controls_model,
             coef_omit = "country|year",
             stars = TRUE,
             output = "markdown")
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) -46.353*
(22.003)
satislfe_survey_mean 24.545***
(6.325)
parties_ingov 3.222*
(1.231)
seatshare_cabinet 22.265**
(8.411)
cab_ideol_sd -0.825
(1.359)
ENEP_tmin1 -0.069
(0.897)
Num.Obs. 139
R2 0.766
R2 Adj. 0.587
AIC 941.1
BIC 1123.0
Log.Lik. -408.542
F 4.267
RMSE 4.57

However, often authors present only the main explanatory variable. You should mention your controls, but you don’t have to report it directly (usually, in this case the extended table goes to appendix). Let’s present all of the models. And customize the display. This is what you can see in Table 1(column 1) of the article with one difference. The author standardized numeric variables. Why would you standardize the variable in this case?

modelsummary(list("Pooled Model" = lm_pooling,
                  "Fixed Effects Model" = fe_model,
                  "Two-Way Fixed Effects Model" = twfe_model,
                  "Two-Way Fixed Effects Model w Controls" = twfe_controls_model),
             coef_rename = c("(Intercept)", "Well-Being"),
             coef_omit = "country|year|parties_ingov|seatshare_cabinet|cab_ideol_sd|ENEP_tmin1",
             stars = TRUE,
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE",
             add_rows = data.frame(term = c("Fixed effects", "Conrols"),
                                   pooled = c("N", "N"),
                                   fixed = c("Y", "N"),
                                   twfe = c("Y", "N"),
                                   twfe_c = c("Y", "Y")))
Pooled Model Fixed Effects Model Two-Way Fixed Effects Model Two-Way Fixed Effects Model w Controls
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 31.040*** -4.145 -28.422 -46.353*
(8.025) (17.398) (21.550) (22.003)
Well-Being 3.936 18.792*** 23.277*** 24.545***
(2.580) (5.556) (6.744) (6.325)
Num.Obs. 139 139 139 139
R2 0.017 0.497 0.698 0.766
R2 Adj. 0.010 0.436 0.493 0.587
Fixed effects N Y Y Y
Conrols N N N Y

Exercises

Let’s explore Comparative Political Dataset. Take a look on their codebook.

Load the data from the Canvas, or by clicking this button:

Download data

Using the readxl library, load the dataset.

Solution
library(...)

cpds = read_excel("data/cpds.xlsx")

The dataset contains the information on political systems of different countries in different years. Let’s explore the following variables:

  • country name

  • year of the observation

  • gov_right1 - cabinet posts of right-wing parties in percentage of total cabinet posts

  • outlays - total outlays (disbursements) of general government as a percentage of GDP

Subset these variables from the dataset.

Solution
cpds_sub = ...

Visualize the distribution of the gov_right1 variable.

Solution

YOUR SOLUTION HERE

Check the class of outlays variable. Using the head() function, explore the first 6 rows of the cpds_sub dataframe. Does the class seem right? If not, correct.

Solution

YOUR SOLUTION HERE

Using ggpairs() from GGally library, explore the distribution of gov_right1 and outlays variables, and their relationship. What do you notice?

Solution

YOUR SOLUTION HERE

Set up a linear model. Predict gov_right1 by outlays. Interpret the results.

Solution

YOUR SOLUTION HERE

Using \(\LaTeX\) write out the formula of the regression. Don’t forget about the subscripts.

Solution

YOUR SOLUTION HERE

Include the year fixed effects. Make sure it is of right class. Does the result hold? Interpret the coefficient for outlays.

Solution

YOUR SOLUTION HERE

Now, let’s set up a two-way fixed effect. Introduce the country fixed effects. Make sure it is of right class. Did the result change? Interpret the results for outlays.

Solution

YOUR SOLUTION HERE

Let’s explore this model. Using ggpredict() plot the terms in the following order: outlays, country, year. Choose the following countries: Austria, Canada, Sweden, and the following years: 1961, 1996, 2016. Do the slopes and confidence intervals explain the significance of the outlays variable?

Solution

YOUR SOLUTION HERE

Using modelsummary(), present the results for all the models. Don’t forget to customize the output.

Solution

YOUR SOLUTION HERE

Check List

I understand that the structure of the data can be different

I know what pooled regressions means

I know what the fixed effects types are

I am not scared of such terms as two way fixed effect (TWFE)

I know how to present only selected cases using ggpredict()

I know how to customize modelsummary()

Footnotes

  1. The material for this lab is based on the University of Mannheim’s Quantitative Methods class. The paper replicated is Ward, G., 2020. Happiness and voting: Evidence from four decades of elections in Europe. American Journal of Political Science, 64(3), pp.504-518.↩︎