library(tidyverse)
library(modelsummary)
Multiple Linear Regression
Week 3
Before we start
Congrats with submitting the first HW! How are you feeling?
The discussion section structure (review, comments about HW, new material and exercises). I would be happy to hear your feedback after the classes or via email.
Review of the previous week
Load the tidyverse
and modelsummary
libraries
Load the World Happiness Report data (whr.csv
)
= read.csv("data/WHR.csv") whr
Using ggplot()
, draw a histogram of Logged_GDP_per_capita
variable. Add a geom_vline()
to indicate an average
Build a linear model, where Happiness Score (Ladder_score
) is a dependent variable, and Logged_GDP_per_capita
is independent variable
Create a scatterplot using ggplot()
and geom_point()
, with Logged_GDP_per_capita
on the X-axis and Ladder_score
on the Y-axis. Add a linear regression line to the plot (geom_smooth()
). Don’t forget to give it a title and label the axes.
Present the regression output
modelsummary(list("Model" = ...),
... = c("Intercept",
"GDP per capita (log)"),
... = "BIC|AIC|RMSE|Log.Lik.",
stars = TRUE)
Agenda for Today
Using dummy variables in the OLS
Introducing fixed effects and interactions
Discussing polynomials
A brief note on joins
Dummy Variables
Quite often we deal with binary or dummy variables in statistics. Those can take the following shape:
data.frame(
Percentage = c(35, 32, 21, 12),
Green_Party = c(1, 0, 0, 0),
Socialist_Party = c(0, 1, 0, 0),
Conservative_Party = c(0, 0, 1, 0)
)
Percentage Green_Party Socialist_Party Conservative_Party
1 35 1 0 0
2 32 0 1 0
3 21 0 0 1
4 12 0 0 0
However, more frequently, we work with representations like the one shown below:
data.frame(
Percentage = c(35, 32, 21, 12),
Parties = c("Green Party", "Socialist Party", "Conservative Party", "Other Parties")
)
Percentage Parties
1 35 Green Party
2 32 Socialist Party
3 21 Conservative Party
4 12 Other Parties
What is the class of such data in R?
Dummy variables and OLS regression
Today we are working with WhoGov dataset. As usual, I recomment taking a look at their codebook.
= read.csv("data/WhoGov.csv") whogov
First of all, these are the following variables we are going to work with today:
country_name
is a country namen_individuals
number of unique persons in the cabinetleaderexperience_continuous
the number of years the person has been leader of the country in total.leader_party
party of the leader
Start with exploring the distribution of number of unique persons in the cabinet (n_individuals
)
ggplot(whogov) +
geom_histogram(aes(x = n_individuals))
Present the descriptive statistics of n_individuals
variable.
...
Let’s examine whether a country’s leader being independent from a political party is associated with having more or fewer members in their cabinet. First, let’s create a dummy variable indicating if a leader is independent or non-independent. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.
= whogov %>%
whogov mutate(indep = ifelse(leader_party == "independent", "Independent", "Non-independent"))
Now, build a simple model and explore the effect. On average, being a non-independent (i.e. partisan) leader is associated with having 2.28 more members in their cabinet compared to independent leaders.
lm(n_individuals ~ indep, whogov) %>%
summary()
Call:
lm(formula = n_individuals ~ indep, data = whogov)
Residuals:
Min 1Q Median 3Q Max
-25.342 -7.342 -2.342 5.658 106.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.0592 0.2761 87.148 <2e-16 ***
indepNon-independent 2.2833 0.3058 7.466 9e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.35 on 9127 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.006071, Adjusted R-squared: 0.005962
F-statistic: 55.75 on 1 and 9127 DF, p-value: 9.003e-14
What if we want to know the effect relative to Non-independent leader? Let’s relevel()
the variable!
$indep = relevel(whogov$indep, ref = "Non-independent") whogov
Oops! This is why classes of data are important. Fix it!
$indep = as.factor(whogov$indep) whogov
Now we can relevel the variable
$indep = relevel(whogov$indep, ref = "Non-independent") whogov
Compare the models. Does the result sound reasonable? Pretty much. This is simply an inverse. But things get way more interesting if a categorical variable has more than 2 levels. You will see this later on.
lm(n_individuals ~ indep, whogov) %>%
summary()
Call:
lm(formula = n_individuals ~ indep, data = whogov)
Residuals:
Min 1Q Median 3Q Max
-25.342 -7.342 -2.342 5.658 106.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.3425 0.1315 200.264 <2e-16 ***
indepIndependent -2.2833 0.3058 -7.466 9e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.35 on 9127 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.006071, Adjusted R-squared: 0.005962
F-statistic: 55.75 on 1 and 9127 DF, p-value: 9.003e-14
Fixed Effects
Let’s explore how leader’s tenure is associated with the number of individuals in the government. We start with the simple linear regression. Take a moment to interpret the result and \(R^2\).
lm(n_individuals ~ leaderexperience_continuous, whogov) %>%
summary()
Call:
lm(formula = n_individuals ~ leaderexperience_continuous, data = whogov)
Residuals:
Min 1Q Median 3Q Max
-28.666 -6.937 -1.937 5.301 109.063
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.59948 0.16232 145.39 <2e-16 ***
leaderexperience_continuous 0.33702 0.01627 20.71 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.21 on 9151 degrees of freedom
Multiple R-squared: 0.04477, Adjusted R-squared: 0.04467
F-statistic: 428.9 on 1 and 9151 DF, p-value: < 2.2e-16
Take a moment and draw a scatterplot for n_individuals
and leaderexperience_continuous
. Add a regression line to the plot.
...
Now, let’s add a categorical variable, indep
, to the model. By doing so, we assume that the association between the leader’s tenure and the number of individuals in the government differs depending on whether the leader is independent or partisan.
Practically, this could be done in multiple ways. First, let’s discuss introduction of fixed effects to our model. Moreover, this is a Multiple linear regression!
= lm(n_individuals ~ leaderexperience_continuous + indep, whogov)
model_fe summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep,
data = whogov)
Residuals:
Min 1Q Median 3Q Max
-29.437 -7.078 -1.796 5.377 108.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.01096 0.16605 144.60 <2e-16 ***
leaderexperience_continuous 0.35700 0.01617 22.08 <2e-16 ***
indepIndependent -3.02909 0.29988 -10.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.05649, Adjusted R-squared: 0.05628
F-statistic: 273.2 on 2 and 9126 DF, p-value: < 2.2e-16
We will use ggeffects
library for visualization of regression with the fixed effects. This is sort of an addition to ggplot2
library from tidyverse
. Don’t forget to install it using install.packages()
!
library(ggeffects)
Then, visualize the result. What can we see?
ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
plot()
Let’s customize the plot. It should be relatively straightforward given we know ggplot
functions. Details for the customization of plot()
function can be found on ggeffects website.
ggpredict(model_fe, terms= c("leaderexperience_continuous", "indep")) %>%
plot(show_ci = F) +
labs(title = "Fixed Effects Regression",
x = "Tenure of a Leader",
y = "Number of Individuals in a Cabinet",
color = "Leader's Status") +
theme_bw()
Some common fixed effects include:
Country/Region/State
Individual leaders/Parties
Year/Time
Policy presence or absence
By introducing fixed effects, we are able to control for unobserved confounders that vary across the units (not within!).
Interactions
Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption
and number of people in the cabinet (n_individuals
) depending on the independence of the party leader.
The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.
= whr %>%
whr_subset select(Country_name, Perceptions_of_corruption)
= whogov %>%
whogov_subset filter(year == 2021) %>%
select(country_name, n_individuals, indep)
Now, we are merging them. It’s not easy to understand what’s going on, but here is a great resource on joins (check “the {dplyr} way”)!
= whr_subset %>%
whr_whogov left_join(whogov_subset, by = c("Country_name" = "country_name"))
Check the result
head(whr_whogov)
Country_name Perceptions_of_corruption n_individuals indep
1 Finland 0.182 23 Non-independent
2 Denmark 0.196 24 Non-independent
3 Iceland 0.668 15 Non-independent
4 Israel 0.708 33 Non-independent
5 Netherlands 0.379 18 Non-independent
6 Sweden 0.202 26 Non-independent
Now, to interact variables we need to use asterisk *
, i.e. multiplication.
= lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
model_in summary(model_in)
Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep,
data = whr_whogov)
Residuals:
Min 1Q Median 3Q Max
-0.54000 -0.05737 0.04300 0.10864 0.24382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.575192 0.061646 9.331 5.59e-16 ***
n_individuals 0.004818 0.002151 2.240 0.0269 *
indepIndependent 0.386220 0.159570 2.420 0.0170 *
n_individuals:indepIndependent -0.010703 0.005501 -1.946 0.0540 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1737 on 123 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.08324, Adjusted R-squared: 0.06088
F-statistic: 3.723 on 3 and 123 DF, p-value: 0.01326
Let’s plot the result. Try to change show_ci
to TRUE
. Does it explain the p-value now?
ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
plot(show_ci = FALSE) +
labs(title = "Regression with Interaction Term",
x = "Number of Individuals in a Cabinet",
y = "Perception of Corruption",
color = "Leader's Status") +
theme_bw()
And you can easily simulate the data (i.e., calculate the marginal effect) using ggpredict()
. For example,
ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))
# Predicted values of Perceptions_of_corruption
n_individuals | Predicted | 95% CI
--------------------------------------
12 | 0.89 | 0.71, 1.07
IV Transformations
One of the assumptions for OLS is that there is a linear relationship between the dependent and independent variables. However, this quite often not the case. Let’s reverse the correction made in World Happiness report data. See below. Does it look linear?
ggplot(whr) +
geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
labs(x = "GDP per capita")
It doesn’t. It’s hard to describe this relationship in a linear manner. Natural log would explain this better, right?
ggplot(whr) +
geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
geom_function(fun = function(x){log(x) - 4}) +
labs(x = "GDP per capita")
This is why we use the natural logarithm to transform GDP per capita. The transformation reveals a linear relationship between the two variables, allowing us to capture non-linear patterns in a linear format when using OLS regression. Another commonly used transformation is quadratic (\(x^2\)), which serves the same purpose of addressing non-linear relationships. We call latter ones polynomial terms.
ggplot(whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Exercises
Extract the following variables from the whr
dataset: Country_name
, Ladder_score
, and Continent
Prepare to subset the whogov
dataset. Leave only year 2020.
Now, extract the following variables from the whogov_sub
dataset: country_name
, n_individuals
, system_category
Using left_join()
merge whr_sub
and whogov_sub
. Don’t forget to specify by
argument. You can take a look on how we did it in above!
Explore which political regime (system_category
) is associated with more happiness (Ladder_score
). Name the political regime that is associated with the most happiness.
However, we know that those coefficients are calculated relative to the reference category. We didn’t specify it, so we have no idea against which regime we are comparing the result. Thus, to gain control back, relevel system_category
so that the reference category is Mixed democratic
. Don’t forget to transform the variable to factor!
Run the regression again. Coefficients did change. But did the regime that is associated with the most happiness change?
Now, run the model explaining the happiness (Ladder_score
) by number of individuals in the cabinet (n_individuals
). Interpret the results.
Introduce the categorical variable Continent
into this model. What do we call it when we introduce categorical variables into the model? How did it change the effect of n_individuals
variable? Why?
Create a new subset, leaving only the following continents (Continent
): Europe (Europe
) and North America (North America
)
Create a model, interacting n_individuals
and Continent
. Plot the results using ggpredict()
and plot()
. Interpret the result.
Now, change show_ci
to TRUE
. Visualize the results again. Are they meaningful? Why can it be the case?
Check List
I know that wrong class of a variable can hinder statistical models and visualizations
I know how dummy variables can look like
I know how to relevel a categorical variable
I know how introduce fixed effects into the model (and I know that I should make it as.factor()
!)
I know how to introduce interaction effects into the model
I know how to customize ggeffects()
plots
I know how to introduce polynomials to OLS regression