Multiple Linear Regression

Week 3

Published

January 23, 2025

Before we start

Congrats with submitting the first HW! How are you feeling?
The discussion section structure (review, comments about HW, new material and exercises). I would be happy to hear your feedback after the classes or via email.

Download script
Download data

Review of the previous week

Review

Load the tidyverse and modelsummary libraries

library(tidyverse)
library(modelsummary)

Load the World Happiness Report data (whr.csv)

whr = read.csv("data/WHR.csv")

Using ggplot(), draw a histogram of Logged_GDP_per_capita variable. Add a geom_vline() to indicate an average

Build a linear model, where Happiness Score (Ladder_score) is a dependent variable, and Logged_GDP_per_capita is independent variable

Create a scatterplot using ggplot() and geom_point(), with Logged_GDP_per_capita on the X-axis and Ladder_score on the Y-axis. Add a linear regression line to the plot (geom_smooth()). Don’t forget to give it a title and label the axes.

Present the regression output

modelsummary(list("Model" = ...),
             ... = c("Intercept", 
                     "GDP per capita (log)"),
             ... = "BIC|AIC|RMSE|Log.Lik.",    
             stars = TRUE)

Agenda for Today

Using dummy variables in the OLS
Introducing fixed effects and interactions
Discussing polynomials
A brief note on joins

Dummy Variables

Quite often we deal with binary or dummy variables in statistics. Those can take the following shape:

data.frame(
  Percentage = c(35, 32, 21, 12),
  Green_Party = c(1, 0, 0, 0),
  Socialist_Party = c(0, 1, 0, 0),
  Conservative_Party = c(0, 0, 1, 0)
)

  Percentage Green_Party Socialist_Party Conservative_Party
1         35           1               0                  0
2         32           0               1                  0
3         21           0               0                  1
4         12           0               0                  0

However, more frequently, we work with representations like the one shown below:

data.frame(
  Percentage = c(35, 32, 21, 12),
  Parties = c("Green Party", "Socialist Party", "Conservative Party", "Other Parties")
)

  Percentage            Parties
1         35        Green Party
2         32    Socialist Party
3         21 Conservative Party
4         12      Other Parties

What is the class of such data in R?

Dummy variables and OLS regression

Today we are working with WhoGov dataset. As usual, I recomment taking a look at their codebook.

whogov = read.csv("data/WhoGov.csv")

First of all, these are the following variables we are going to work with today:

country_name is a country name
n_individuals number of unique persons in the cabinet
leaderexperience_continuous the number of years the person has been leader of the country in total.
leader_party party of the leader

Start with exploring the distribution of number of unique persons in the cabinet (n_individuals)

ggplot(whogov) +
  geom_histogram(aes(x = n_individuals))

Present the descriptive statistics of n_individuals variable.

...

Let’s examine whether a country’s leader being independent from a political party is associated with having more or fewer members in their cabinet. First, let’s create a dummy variable indicating if a leader is independent or non-independent. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.

whogov = whogov %>%
  mutate(indep = ifelse(leader_party == "independent", "Independent", "Non-independent"))

Now, build a simple model and explore the effect. On average, being a non-independent (i.e. partisan) leader is associated with having 2.28 more members in their cabinet compared to independent leaders.

lm(n_individuals ~ indep, whogov) %>%
  summary()


Call:
lm(formula = n_individuals ~ indep, data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.342  -7.342  -2.342   5.658 106.658 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           24.0592     0.2761  87.148   <2e-16 ***
indepNon-independent   2.2833     0.3058   7.466    9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.35 on 9127 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.006071,  Adjusted R-squared:  0.005962 
F-statistic: 55.75 on 1 and 9127 DF,  p-value: 9.003e-14

What if we want to know the effect relative to Non-independent leader? Let’s relevel() the variable!

whogov$indep = relevel(whogov$indep, ref = "Non-independent")

Oops! This is why classes of data are important. Fix it!

whogov$indep = as.factor(whogov$indep)

Now we can relevel the variable

whogov$indep = relevel(whogov$indep, ref = "Non-independent")

Compare the models. Does the result sound reasonable? Pretty much. This is simply an inverse. But things get way more interesting if a categorical variable has more than 2 levels. You will see this later on.

lm(n_individuals ~ indep, whogov) %>%
  summary()


Call:
lm(formula = n_individuals ~ indep, data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.342  -7.342  -2.342   5.658 106.658 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       26.3425     0.1315 200.264   <2e-16 ***
indepIndependent  -2.2833     0.3058  -7.466    9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.35 on 9127 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.006071,  Adjusted R-squared:  0.005962 
F-statistic: 55.75 on 1 and 9127 DF,  p-value: 9.003e-14

Fixed Effects

Let’s explore how leader’s tenure is associated with the number of individuals in the government. We start with the simple linear regression. Take a moment to interpret the result and \(R^2\).

lm(n_individuals ~ leaderexperience_continuous, whogov) %>%
  summary()


Call:
lm(formula = n_individuals ~ leaderexperience_continuous, data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.666  -6.937  -1.937   5.301 109.063 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 23.59948    0.16232  145.39   <2e-16 ***
leaderexperience_continuous  0.33702    0.01627   20.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.21 on 9151 degrees of freedom
Multiple R-squared:  0.04477,   Adjusted R-squared:  0.04467 
F-statistic: 428.9 on 1 and 9151 DF,  p-value: < 2.2e-16

Take a moment and draw a scatterplot for n_individuals and leaderexperience_continuous. Add a regression line to the plot.

...

Now, let’s add a categorical variable, indep, to the model. By doing so, we assume that the association between the leader’s tenure and the number of individuals in the government differs depending on whether the leader is independent or partisan.

Practically, this could be done in multiple ways. First, let’s discuss introduction of fixed effects to our model. Moreover, this is a Multiple linear regression!

model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov) 
summary(model_fe)


Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep, 
    data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.437  -7.078  -1.796   5.377 108.632 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 24.01096    0.16605  144.60   <2e-16 ***
leaderexperience_continuous  0.35700    0.01617   22.08   <2e-16 ***
indepIndependent            -3.02909    0.29988  -10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.06 on 9126 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.05649,   Adjusted R-squared:  0.05628 
F-statistic: 273.2 on 2 and 9126 DF,  p-value: < 2.2e-16

We will use ggeffects library for visualization of regression with the fixed effects. This is sort of an addition to ggplot2 library from tidyverse. Don’t forget to install it using install.packages()!

library(ggeffects)

Then, visualize the result. What can we see?

ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
  plot()

Let’s customize the plot. It should be relatively straightforward given we know ggplot functions. Details for the customization of plot() function can be found on ggeffects website.

ggpredict(model_fe, terms= c("leaderexperience_continuous", "indep")) %>%
  plot(show_ci = F) +
  labs(title = "Fixed Effects Regression",
       x = "Tenure of a Leader",
       y = "Number of Individuals in a Cabinet",
       color = "Leader's Status") +
  theme_bw()

Some common fixed effects include:

Country/Region/State
Individual leaders/Parties
Year/Time
Policy presence or absence

By introducing fixed effects, we are able to control for unobserved confounders that vary across the units (not within!).

Interactions

Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption and number of people in the cabinet (n_individuals) depending on the independence of the party leader.

The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.

whr_subset = whr %>%
  select(Country_name, Perceptions_of_corruption)

whogov_subset = whogov %>%
  filter(year == 2021) %>%
  select(country_name, n_individuals, indep)

Now, we are merging them. It’s not easy to understand what’s going on, but here is a great resource on joins (check “the {dplyr} way”)!

whr_whogov = whr_subset %>%
  left_join(whogov_subset, by = c("Country_name" = "country_name"))

Check the result

head(whr_whogov)

  Country_name Perceptions_of_corruption n_individuals           indep
1      Finland                     0.182            23 Non-independent
2      Denmark                     0.196            24 Non-independent
3      Iceland                     0.668            15 Non-independent
4       Israel                     0.708            33 Non-independent
5  Netherlands                     0.379            18 Non-independent
6       Sweden                     0.202            26 Non-independent

Now, to interact variables we need to use asterisk *, i.e. multiplication.

model_in = lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
summary(model_in)


Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep, 
    data = whr_whogov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54000 -0.05737  0.04300  0.10864  0.24382 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.575192   0.061646   9.331 5.59e-16 ***
n_individuals                   0.004818   0.002151   2.240   0.0269 *  
indepIndependent                0.386220   0.159570   2.420   0.0170 *  
n_individuals:indepIndependent -0.010703   0.005501  -1.946   0.0540 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1737 on 123 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.08324,   Adjusted R-squared:  0.06088 
F-statistic: 3.723 on 3 and 123 DF,  p-value: 0.01326

Let’s plot the result. Try to change show_ci to TRUE. Does it explain the p-value now?

ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
  plot(show_ci = FALSE) +
  labs(title = "Regression with Interaction Term",
       x = "Number of Individuals in a Cabinet",
       y = "Perception of Corruption",
       color = "Leader's Status") +
  theme_bw()

And you can easily simulate the data (i.e., calculate the marginal effect) using ggpredict(). For example,

ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))

# Predicted values of Perceptions_of_corruption

n_individuals | Predicted |     95% CI
--------------------------------------
           12 |      0.89 | 0.71, 1.07

IV Transformations

One of the assumptions for OLS is that there is a linear relationship between the dependent and independent variables. However, this quite often not the case. Let’s reverse the correction made in World Happiness report data. See below. Does it look linear?

ggplot(whr) +
  geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
  labs(x = "GDP per capita")

It doesn’t. It’s hard to describe this relationship in a linear manner. Natural log would explain this better, right?

ggplot(whr) +
  geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
  geom_function(fun = function(x){log(x) - 4}) +
  labs(x = "GDP per capita")

This is why we use the natural logarithm to transform GDP per capita. The transformation reveals a linear relationship between the two variables, allowing us to capture non-linear patterns in a linear format when using OLS regression. Another commonly used transformation is quadratic (\(x^2\)), which serves the same purpose of addressing non-linear relationships. We call latter ones polynomial terms.

ggplot(whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
  geom_point() +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Exercises

Extract the following variables from the whr dataset: Country_name, Ladder_score, and Continent

Solution

whr_sub = ...

Prepare to subset the whogov dataset. Leave only year 2020.

Solution

whogov_sub = ...

Now, extract the following variables from the whogov_sub dataset: country_name, n_individuals, system_category

Solution

whogov_sub = whogov_sub %>%
  ...(country_name, n_individuals, system_category)

Using left_join() merge whr_sub and whogov_sub. Don’t forget to specify by argument. You can take a look on how we did it in above!

Solution

whr_whogov = whr_sub %>%
  ...(..., by = c(...))

Explore which political regime (system_category) is associated with more happiness (Ladder_score). Name the political regime that is associated with the most happiness.

Solution

YOUR SOLUTION HERE

However, we know that those coefficients are calculated relative to the reference category. We didn’t specify it, so we have no idea against which regime we are comparing the result. Thus, to gain control back, relevel system_category so that the reference category is Mixed democratic. Don’t forget to transform the variable to factor!

Solution

YOUR SOLUTION HERE

Run the regression again. Coefficients did change. But did the regime that is associated with the most happiness change?

Solution

YOUR SOLUTION HERE

Now, run the model explaining the happiness (Ladder_score) by number of individuals in the cabinet (n_individuals). Interpret the results.

Solution

YOUR SOLUTION HERE

Introduce the categorical variable Continent into this model. What do we call it when we introduce categorical variables into the model? How did it change the effect of n_individuals variable? Why?

Solution

YOUR SOLUTION HERE

Create a new subset, leaving only the following continents (Continent): Europe (Europe) and North America (North America)

Solution

whr_whogov_euna = whr_whogov %>%
  ...(Continent %in% c(..., ...))

Create a model, interacting n_individuals and Continent. Plot the results using ggpredict() and plot(). Interpret the result.

Solution

lm(Ladder_score ~ ... * ..., whr_whogov_euna) ...
  ggpredict(...) %>%
  ...(show_ci = FALSE)

Now, change show_ci to TRUE. Visualize the results again. Are they meaningful? Why can it be the case?

Solution

...

Check List

I know that wrong class of a variable can hinder statistical models and visualizations

I know how dummy variables can look like

I know how to relevel a categorical variable

I know how introduce fixed effects into the model (and I know that I should make it as.factor()!)

I know how to introduce interaction effects into the model

I know how to customize ggeffects() plots

I know how to introduce polynomials to OLS regression