Regression Extensions

Week 6

Published

May 8, 2025

Before we start

Questions regarding Lab Assignment or the class?

Download script
Download data

Agenda

Using dummy variables in the OLS
Introducing fixed effects and interactions
Attempting to fix problems identified during diagnostics

Set Up

Today we are working with WhoGov dataset. The data provides information on elites in various countries. As usual, I recommend taking a look at their codebook.

library(tidyverse)

whogov = read.csv("data/WhoGov.csv")

First of all, these are the following variables we are going to work with today:

country_name is a country name
n_individuals number of unique persons in the cabinet
leaderexperience_continuous the number of years the person has been leader of the country in total.
leader_party party of the leader

Start with exploring the distribution of number of unique persons in the cabinet (n_individuals)

ggplot(whogov) +
  geom_histogram(aes(x = n_individuals))

Dummy Variables

Let’s examine whether a country’s leader being independent from a political party is associated with having more or fewer members in their cabinet. First, let’s create a dummy variable indicating if a leader is independent or non-independent. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.

whogov = whogov %>%
  mutate(indep = ifelse(leader_party == "independent", "Independent", "Partisan"))

Now, build a simple model and explore the effect. On average, being a non-independent (i.e. partisan) leader is associated with having 2.28 more members in their cabinet compared to independent leaders.

lm(n_individuals ~ indep, whogov) %>%
  summary()


Call:
lm(formula = n_individuals ~ indep, data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.342  -7.342  -2.342   5.658 106.658 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24.0592     0.2761  87.148   <2e-16 ***
indepPartisan   2.2833     0.3058   7.466    9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.35 on 9127 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.006071,  Adjusted R-squared:  0.005962 
F-statistic: 55.75 on 1 and 9127 DF,  p-value: 9.003e-14

What if we want to know the effect relative to Partisan leader? Let’s relevel() the variable!

whogov$indep = relevel(whogov$indep, ref = "Partisan")

Oops! This is why classes of data are important. Fix it!

whogov$indep = as.factor(whogov$indep)

Now we can relevel the variable

whogov$indep = relevel(whogov$indep, ref = "Partisan")

Compare the models. Does the result sound reasonable? Pretty much. This is simply an inverse. But things get way more interesting if a categorical variable has more than 2 levels. You will see this later on.

lm(n_individuals ~ indep, whogov) %>%
  summary()


Call:
lm(formula = n_individuals ~ indep, data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.342  -7.342  -2.342   5.658 106.658 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       26.3425     0.1315 200.264   <2e-16 ***
indepIndependent  -2.2833     0.3058  -7.466    9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.35 on 9127 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.006071,  Adjusted R-squared:  0.005962 
F-statistic: 55.75 on 1 and 9127 DF,  p-value: 9.003e-14

Fixed Effects

Let’s explore how leader’s tenure is associated with the number of individuals in the government. We start with the simple linear regression.

lm(n_individuals ~ leaderexperience_continuous, whogov) %>%
  summary()


Call:
lm(formula = n_individuals ~ leaderexperience_continuous, data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.666  -6.937  -1.937   5.301 109.063 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 23.59948    0.16232  145.39   <2e-16 ***
leaderexperience_continuous  0.33702    0.01627   20.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.21 on 9151 degrees of freedom
Multiple R-squared:  0.04477,   Adjusted R-squared:  0.04467 
F-statistic: 428.9 on 1 and 9151 DF,  p-value: < 2.2e-16

Coding Task

Take a moment and draw a scatterplot for n_individuals and leaderexperience_continuous. Add a regression line to the plot.

Now, let’s add a categorical variable, indep, to the model. By doing so, we assume that the association between the leader’s tenure and the number of individuals in the government differs depending on whether the leader is independent or partisan.

Practically, this could be done in multiple ways. First, let’s discuss introduction of fixed effects to our model.

model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov) 
summary(model_fe)


Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep, 
    data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.437  -7.078  -1.796   5.377 108.632 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 24.01096    0.16605  144.60   <2e-16 ***
leaderexperience_continuous  0.35700    0.01617   22.08   <2e-16 ***
indepIndependent            -3.02909    0.29988  -10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.06 on 9126 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.05649,   Adjusted R-squared:  0.05628 
F-statistic: 273.2 on 2 and 9126 DF,  p-value: < 2.2e-16

We will use ggeffects library for visualization of regression with the fixed effects. This is sort of an addition to ggplot2 library from tidyverse. Don’t forget to install it using install.packages()!

library(ggeffects)

Then, visualize the result. What can we see?

ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
  plot()

Let’s customize the plot. It should be relatively straightforward given we know ggplot functions. Details for the customization of plot() function can be found on ggeffects website.

ggpredict(model_fe, terms= c("leaderexperience_continuous", "indep")) %>%
  plot(show_ci = F) +
  labs(title = "Fixed Effects Regression",
       x = "Tenure of a Leader",
       y = "Number of Individuals in a Cabinet",
       color = "Leader's Status") +
  theme_bw()

Some common fixed effects include:

Country/Region/State
Individual leaders/Parties
Year/Time
Policy presence or absence

By introducing fixed effects, we are able to control for unobserved confounders that vary across the units (not within!).

Interactions

Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption and number of people in the cabinet (n_individuals) depending on the independence of the party leader.

The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.

Load the World Happiness Report we used in the First Meeting.

whr = read.csv("data/WHR.csv")

Let’s prepare the data

whr_subset = whr %>%
  select(Country_name, Perceptions_of_corruption)

whogov_subset = whogov %>%
  filter(year == 2021) %>%
  select(country_name, n_individuals, indep)

Now, we are merging them.

whr_whogov = whr_subset %>%
  left_join(whogov_subset, by = c("Country_name" = "country_name"))

Check the result

head(whr_whogov)

  Country_name Perceptions_of_corruption n_individuals    indep
1      Finland                     0.182            23 Partisan
2      Denmark                     0.196            24 Partisan
3      Iceland                     0.668            15 Partisan
4       Israel                     0.708            33 Partisan
5  Netherlands                     0.379            18 Partisan
6       Sweden                     0.202            26 Partisan

Now, to interact variables we need to use asterisk *, i.e. multiplication.

model_in = lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
summary(model_in)


Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep, 
    data = whr_whogov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54000 -0.05737  0.04300  0.10864  0.24382 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.575192   0.061646   9.331 5.59e-16 ***
n_individuals                   0.004818   0.002151   2.240   0.0269 *  
indepIndependent                0.386220   0.159570   2.420   0.0170 *  
n_individuals:indepIndependent -0.010703   0.005501  -1.946   0.0540 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1737 on 123 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.08324,   Adjusted R-squared:  0.06088 
F-statistic: 3.723 on 3 and 123 DF,  p-value: 0.01326

Let’s plot the result.

ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
  plot(show_ci = FALSE) +
  labs(title = "Regression with Interaction Term",
       x = "Number of Individuals in a Cabinet",
       y = "Perception of Corruption",
       color = "Leader's Status") +
  theme_bw()

And you can easily simulate the data (i.e., calculate the marginal effect) using ggpredict(). For example,

ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))

# Predicted values of Perceptions_of_corruption

n_individuals | Predicted |     95% CI
--------------------------------------
           12 |      0.89 | 0.71, 1.07

Diagnostics

Let’s first identify if we have a problem. Load the ggfortify library to draw the diagnostic plots.

library(ggfortify)

And let’s see the interaction model. What stands out?

autoplot(model_in)

Let’s visualize that by the independence/partisanship of the leader. Any patterns?

autoplot(model_in, colour = "indep")

Coding Task

Let’s try to fix that.

Add system_category from WhoGov dataset as a control variable
Apply a quadratic transformation to the dependent variable
Draw diagnostics plots. What have changed?

Check List

I know what dummy variables are

I know why I might need fixed effects

I have an intuition regarding implementation of interaction effects

I know how to visualize complex regressions using ggeffects().