library(tidyverse)
= read.csv("data/WhoGov.csv") whogov
Regression Extensions
Week 6
Before we start
- Questions regarding Lab Assignment or the class?
Agenda
Using dummy variables in the OLS
Introducing fixed effects and interactions
Attempting to fix problems identified during diagnostics
Set Up
Today we are working with WhoGov dataset. The data provides information on elites in various countries. As usual, I recommend taking a look at their codebook.
First of all, these are the following variables we are going to work with today:
country_name
is a country namen_individuals
number of unique persons in the cabinetleaderexperience_continuous
the number of years the person has been leader of the country in total.leader_party
party of the leader
Start with exploring the distribution of number of unique persons in the cabinet (n_individuals
)
ggplot(whogov) +
geom_histogram(aes(x = n_individuals))
Dummy Variables
Let’s examine whether a country’s leader being independent from a political party is associated with having more or fewer members in their cabinet. First, let’s create a dummy variable indicating if a leader is independent or non-independent. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.
= whogov %>%
whogov mutate(indep = ifelse(leader_party == "independent", "Independent", "Partisan"))
Now, build a simple model and explore the effect. On average, being a non-independent (i.e. partisan) leader is associated with having 2.28 more members in their cabinet compared to independent leaders.
lm(n_individuals ~ indep, whogov) %>%
summary()
Call:
lm(formula = n_individuals ~ indep, data = whogov)
Residuals:
Min 1Q Median 3Q Max
-25.342 -7.342 -2.342 5.658 106.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.0592 0.2761 87.148 <2e-16 ***
indepPartisan 2.2833 0.3058 7.466 9e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.35 on 9127 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.006071, Adjusted R-squared: 0.005962
F-statistic: 55.75 on 1 and 9127 DF, p-value: 9.003e-14
What if we want to know the effect relative to Partisan leader? Let’s relevel()
the variable!
$indep = relevel(whogov$indep, ref = "Partisan") whogov
Oops! This is why classes of data are important. Fix it!
$indep = as.factor(whogov$indep) whogov
Now we can relevel the variable
$indep = relevel(whogov$indep, ref = "Partisan") whogov
Compare the models. Does the result sound reasonable? Pretty much. This is simply an inverse. But things get way more interesting if a categorical variable has more than 2 levels. You will see this later on.
lm(n_individuals ~ indep, whogov) %>%
summary()
Call:
lm(formula = n_individuals ~ indep, data = whogov)
Residuals:
Min 1Q Median 3Q Max
-25.342 -7.342 -2.342 5.658 106.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.3425 0.1315 200.264 <2e-16 ***
indepIndependent -2.2833 0.3058 -7.466 9e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.35 on 9127 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.006071, Adjusted R-squared: 0.005962
F-statistic: 55.75 on 1 and 9127 DF, p-value: 9.003e-14
Fixed Effects
Let’s explore how leader’s tenure is associated with the number of individuals in the government. We start with the simple linear regression.
lm(n_individuals ~ leaderexperience_continuous, whogov) %>%
summary()
Call:
lm(formula = n_individuals ~ leaderexperience_continuous, data = whogov)
Residuals:
Min 1Q Median 3Q Max
-28.666 -6.937 -1.937 5.301 109.063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.59948 0.16232 145.39 <2e-16 ***
leaderexperience_continuous 0.33702 0.01627 20.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.21 on 9151 degrees of freedom
Multiple R-squared: 0.04477, Adjusted R-squared: 0.04467
F-statistic: 428.9 on 1 and 9151 DF, p-value: < 2.2e-16
Now, let’s add a categorical variable, indep
, to the model. By doing so, we assume that the association between the leader’s tenure and the number of individuals in the government differs depending on whether the leader is independent or partisan.
Practically, this could be done in multiple ways. First, let’s discuss introduction of fixed effects to our model.
= lm(n_individuals ~ leaderexperience_continuous + indep, whogov)
model_fe summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep,
data = whogov)
Residuals:
Min 1Q Median 3Q Max
-29.437 -7.078 -1.796 5.377 108.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.01096 0.16605 144.60 <2e-16 ***
leaderexperience_continuous 0.35700 0.01617 22.08 <2e-16 ***
indepIndependent -3.02909 0.29988 -10.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.05649, Adjusted R-squared: 0.05628
F-statistic: 273.2 on 2 and 9126 DF, p-value: < 2.2e-16
We will use ggeffects
library for visualization of regression with the fixed effects. This is sort of an addition to ggplot2
library from tidyverse.
Don’t forget to install it using install.packages()
!
library(ggeffects)
Then, visualize the result. What can we see?
ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
plot()
Let’s customize the plot. It should be relatively straightforward given we know ggplot
functions. Details for the customization of plot()
function can be found on ggeffects website.
ggpredict(model_fe, terms= c("leaderexperience_continuous", "indep")) %>%
plot(show_ci = F) +
labs(title = "Fixed Effects Regression",
x = "Tenure of a Leader",
y = "Number of Individuals in a Cabinet",
color = "Leader's Status") +
theme_bw()
Some common fixed effects include:
Country/Region/State
Individual leaders/Parties
Year/Time
Policy presence or absence
By introducing fixed effects, we are able to control for unobserved confounders that vary across the units (not within!).
Interactions
Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption
and number of people in the cabinet (n_individuals
) depending on the independence of the party leader.
The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.
Load the World Happiness Report we used in the First Meeting.
= read.csv("data/WHR.csv") whr
Let’s prepare the data
= whr %>%
whr_subset select(Country_name, Perceptions_of_corruption)
= whogov %>%
whogov_subset filter(year == 2021) %>%
select(country_name, n_individuals, indep)
Now, we are merging them.
= whr_subset %>%
whr_whogov left_join(whogov_subset, by = c("Country_name" = "country_name"))
Check the result
head(whr_whogov)
Country_name Perceptions_of_corruption n_individuals indep
1 Finland 0.182 23 Partisan
2 Denmark 0.196 24 Partisan
3 Iceland 0.668 15 Partisan
4 Israel 0.708 33 Partisan
5 Netherlands 0.379 18 Partisan
6 Sweden 0.202 26 Partisan
Now, to interact variables we need to use asterisk *
, i.e. multiplication.
= lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
model_in summary(model_in)
Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep,
data = whr_whogov)
Residuals:
Min 1Q Median 3Q Max
-0.54000 -0.05737 0.04300 0.10864 0.24382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.575192 0.061646 9.331 5.59e-16 ***
n_individuals 0.004818 0.002151 2.240 0.0269 *
indepIndependent 0.386220 0.159570 2.420 0.0170 *
n_individuals:indepIndependent -0.010703 0.005501 -1.946 0.0540 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1737 on 123 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.08324, Adjusted R-squared: 0.06088
F-statistic: 3.723 on 3 and 123 DF, p-value: 0.01326
Let’s plot the result.
ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
plot(show_ci = FALSE) +
labs(title = "Regression with Interaction Term",
x = "Number of Individuals in a Cabinet",
y = "Perception of Corruption",
color = "Leader's Status") +
theme_bw()
And you can easily simulate the data (i.e., calculate the marginal effect) using ggpredict()
. For example,
ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))
# Predicted values of Perceptions_of_corruption
n_individuals | Predicted | 95% CI
--------------------------------------
12 | 0.89 | 0.71, 1.07
Diagnostics
Let’s first identify if we have a problem. Load the ggfortify
library to draw the diagnostic plots.
library(ggfortify)
And let’s see the interaction model. What stands out?
autoplot(model_in)
Let’s visualize that by the independence/partisanship of the leader. Any patterns?
autoplot(model_in, colour = "indep")
Check List
I know what dummy variables are
I know why I might need fixed effects
I have an intuition regarding implementation of interaction effects
I know how to visualize complex regressions using ggeffects()
.