library(tidyverse)
library(modelsummary)Multiple Linear Regression
Week 3
Before we start
- Congrats with submitting the first HW! How are you feeling? 
- The discussion section structure (review, comments about HW, new material and exercises). I would be happy to hear your feedback after the classes or via email. 
Review of the previous week
Load the tidyverse and modelsummary libraries
Load the World Happiness Report data (whr.csv)
whr = read.csv("data/WHR.csv")Using ggplot(), draw a histogram of Logged_GDP_per_capita variable. Add a geom_vline() to indicate an average
Build a linear model, where Happiness Score (Ladder_score) is a dependent variable, and Logged_GDP_per_capita is independent variable
Create a scatterplot using ggplot() and geom_point(), with Logged_GDP_per_capita on the X-axis and Ladder_score on the Y-axis. Add a linear regression line to the plot (geom_smooth()). Don’t forget to give it a title and label the axes.
Present the regression output
modelsummary(list("Model" = ...),
             ... = c("Intercept", 
                     "GDP per capita (log)"),
             ... = "BIC|AIC|RMSE|Log.Lik.",    
             stars = TRUE) Agenda for Today
- Using dummy variables in the OLS 
- Introducing fixed effects and interactions 
- Discussing polynomials 
- A brief note on joins 
Dummy Variables
Quite often we deal with binary or dummy variables in statistics. Those can take the following shape:
data.frame(
  Percentage = c(35, 32, 21, 12),
  Green_Party = c(1, 0, 0, 0),
  Socialist_Party = c(0, 1, 0, 0),
  Conservative_Party = c(0, 0, 1, 0)
)  Percentage Green_Party Socialist_Party Conservative_Party
1         35           1               0                  0
2         32           0               1                  0
3         21           0               0                  1
4         12           0               0                  0However, more frequently, we work with representations like the one shown below:
data.frame(
  Percentage = c(35, 32, 21, 12),
  Parties = c("Green Party", "Socialist Party", "Conservative Party", "Other Parties")
)  Percentage            Parties
1         35        Green Party
2         32    Socialist Party
3         21 Conservative Party
4         12      Other PartiesWhat is the class of such data in R?
Dummy variables and OLS regression
Today we are working with WhoGov dataset. As usual, I recomment taking a look at their codebook.
whogov = read.csv("data/WhoGov.csv")First of all, these are the following variables we are going to work with today:
- country_nameis a country name
- n_individualsnumber of unique persons in the cabinet
- leaderexperience_continuousthe number of years the person has been leader of the country in total.
- leader_partyparty of the leader
Start with exploring the distribution of number of unique persons in the cabinet (n_individuals)
ggplot(whogov) +
  geom_histogram(aes(x = n_individuals)) Present the descriptive statistics of n_individuals variable.
...Let’s examine whether a country’s leader being independent from a political party is associated with having more or fewer members in their cabinet. First, let’s create a dummy variable indicating if a leader is independent or non-independent. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.
whogov = whogov %>%
  mutate(indep = ifelse(leader_party == "independent", "Independent", "Non-independent"))Now, build a simple model and explore the effect. On average, being a non-independent (i.e. partisan) leader is associated with having 2.28 more members in their cabinet compared to independent leaders.
lm(n_individuals ~ indep, whogov) %>%
  summary()
Call:
lm(formula = n_individuals ~ indep, data = whogov)
Residuals:
    Min      1Q  Median      3Q     Max 
-25.342  -7.342  -2.342   5.658 106.658 
Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           24.0592     0.2761  87.148   <2e-16 ***
indepNon-independent   2.2833     0.3058   7.466    9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.35 on 9127 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.006071,  Adjusted R-squared:  0.005962 
F-statistic: 55.75 on 1 and 9127 DF,  p-value: 9.003e-14What if we want to know the effect relative to Non-independent leader? Let’s relevel() the variable!
whogov$indep = relevel(whogov$indep, ref = "Non-independent")Oops! This is why classes of data are important. Fix it!
whogov$indep = as.factor(whogov$indep)Now we can relevel the variable
whogov$indep = relevel(whogov$indep, ref = "Non-independent")Compare the models. Does the result sound reasonable? Pretty much. This is simply an inverse. But things get way more interesting if a categorical variable has more than 2 levels. You will see this later on.
lm(n_individuals ~ indep, whogov) %>%
  summary()
Call:
lm(formula = n_individuals ~ indep, data = whogov)
Residuals:
    Min      1Q  Median      3Q     Max 
-25.342  -7.342  -2.342   5.658 106.658 
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       26.3425     0.1315 200.264   <2e-16 ***
indepIndependent  -2.2833     0.3058  -7.466    9e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.35 on 9127 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.006071,  Adjusted R-squared:  0.005962 
F-statistic: 55.75 on 1 and 9127 DF,  p-value: 9.003e-14Fixed Effects
Let’s explore how leader’s tenure is associated with the number of individuals in the government. We start with the simple linear regression. Take a moment to interpret the result and \(R^2\).
lm(n_individuals ~ leaderexperience_continuous, whogov) %>%
  summary()
Call:
lm(formula = n_individuals ~ leaderexperience_continuous, data = whogov)
Residuals:
    Min      1Q  Median      3Q     Max 
-28.666  -6.937  -1.937   5.301 109.063 
Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 23.59948    0.16232  145.39   <2e-16 ***
leaderexperience_continuous  0.33702    0.01627   20.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.21 on 9151 degrees of freedom
Multiple R-squared:  0.04477,   Adjusted R-squared:  0.04467 
F-statistic: 428.9 on 1 and 9151 DF,  p-value: < 2.2e-16Take a moment and draw a scatterplot for n_individuals and leaderexperience_continuous. Add a regression line to the plot.
...Now, let’s add a categorical variable, indep, to the model. By doing so, we assume that the association between the leader’s tenure and the number of individuals in the government differs depending on whether the leader is independent or partisan.
Practically, this could be done in multiple ways. First, let’s discuss introduction of fixed effects to our model. Moreover, this is a Multiple linear regression!
model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov) 
summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep, 
    data = whogov)
Residuals:
    Min      1Q  Median      3Q     Max 
-29.437  -7.078  -1.796   5.377 108.632 
Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 24.01096    0.16605  144.60   <2e-16 ***
leaderexperience_continuous  0.35700    0.01617   22.08   <2e-16 ***
indepIndependent            -3.02909    0.29988  -10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.05649,   Adjusted R-squared:  0.05628 
F-statistic: 273.2 on 2 and 9126 DF,  p-value: < 2.2e-16We will use ggeffects library for visualization of regression with the fixed effects. This is sort of an addition to ggplot2 library from tidyverse. Don’t forget to install it using install.packages()!
library(ggeffects)Then, visualize the result. What can we see?
ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
  plot()Let’s customize the plot. It should be relatively straightforward given we know ggplot functions. Details for the customization of plot() function can be found on ggeffects website.
ggpredict(model_fe, terms= c("leaderexperience_continuous", "indep")) %>%
  plot(show_ci = F) +
  labs(title = "Fixed Effects Regression",
       x = "Tenure of a Leader",
       y = "Number of Individuals in a Cabinet",
       color = "Leader's Status") +
  theme_bw()Some common fixed effects include:
- Country/Region/State 
- Individual leaders/Parties 
- Year/Time 
- Policy presence or absence 
By introducing fixed effects, we are able to control for unobserved confounders that vary across the units (not within!).
Interactions
Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption and number of people in the cabinet (n_individuals) depending on the independence of the party leader.
The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.
whr_subset = whr %>%
  select(Country_name, Perceptions_of_corruption)
whogov_subset = whogov %>%
  filter(year == 2021) %>%
  select(country_name, n_individuals, indep)Now, we are merging them. It’s not easy to understand what’s going on, but here is a great resource on joins (check “the {dplyr} way”)!
whr_whogov = whr_subset %>%
  left_join(whogov_subset, by = c("Country_name" = "country_name")) Check the result
head(whr_whogov)  Country_name Perceptions_of_corruption n_individuals           indep
1      Finland                     0.182            23 Non-independent
2      Denmark                     0.196            24 Non-independent
3      Iceland                     0.668            15 Non-independent
4       Israel                     0.708            33 Non-independent
5  Netherlands                     0.379            18 Non-independent
6       Sweden                     0.202            26 Non-independentNow, to interact variables we need to use asterisk *, i.e. multiplication.
model_in = lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
summary(model_in)
Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep, 
    data = whr_whogov)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.54000 -0.05737  0.04300  0.10864  0.24382 
Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.575192   0.061646   9.331 5.59e-16 ***
n_individuals                   0.004818   0.002151   2.240   0.0269 *  
indepIndependent                0.386220   0.159570   2.420   0.0170 *  
n_individuals:indepIndependent -0.010703   0.005501  -1.946   0.0540 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1737 on 123 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.08324,   Adjusted R-squared:  0.06088 
F-statistic: 3.723 on 3 and 123 DF,  p-value: 0.01326Let’s plot the result. Try to change show_ci to TRUE. Does it explain the p-value now?
ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
  plot(show_ci = FALSE) +
  labs(title = "Regression with Interaction Term",
       x = "Number of Individuals in a Cabinet",
       y = "Perception of Corruption",
       color = "Leader's Status") +
  theme_bw()And you can easily simulate the data (i.e., calculate the marginal effect) using ggpredict(). For example,
ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))# Predicted values of Perceptions_of_corruption
n_individuals | Predicted |     95% CI
--------------------------------------
           12 |      0.89 | 0.71, 1.07IV Transformations
One of the assumptions for OLS is that there is a linear relationship between the dependent and independent variables. However, this quite often not the case. Let’s reverse the correction made in World Happiness report data. See below. Does it look linear?
ggplot(whr) +
  geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
  labs(x = "GDP per capita")It doesn’t. It’s hard to describe this relationship in a linear manner. Natural log would explain this better, right?
ggplot(whr) +
  geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
  geom_function(fun = function(x){log(x) - 4}) +
  labs(x = "GDP per capita")This is why we use the natural logarithm to transform GDP per capita. The transformation reveals a linear relationship between the two variables, allowing us to capture non-linear patterns in a linear format when using OLS regression. Another commonly used transformation is quadratic (\(x^2\)), which serves the same purpose of addressing non-linear relationships. We call latter ones polynomial terms.
ggplot(whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
  geom_point() +
  geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'Exercises
Extract the following variables from the whr dataset: Country_name, Ladder_score, and Continent
Prepare to subset the whogov dataset. Leave only year 2020.
Now, extract the following variables from the whogov_sub dataset: country_name, n_individuals, system_category
Using left_join() merge whr_sub and whogov_sub. Don’t forget to specify by argument. You can take a look on how we did it in above!
Explore which political regime (system_category) is associated with more happiness (Ladder_score). Name the political regime that is associated with the most happiness.
However, we know that those coefficients are calculated relative to the reference category. We didn’t specify it, so we have no idea against which regime we are comparing the result. Thus, to gain control back, relevel system_category so that the reference category is Mixed democratic. Don’t forget to transform the variable to factor!
Run the regression again. Coefficients did change. But did the regime that is associated with the most happiness change?
Now, run the model explaining the happiness (Ladder_score) by number of individuals in the cabinet (n_individuals). Interpret the results.
Introduce the categorical variable Continent into this model. What do we call it when we introduce categorical variables into the model? How did it change the effect of n_individuals variable? Why?
Create a new subset, leaving only the following continents (Continent): Europe (Europe) and North America (North America)
Create a model, interacting n_individuals and Continent. Plot the results using ggpredict() and plot(). Interpret the result.
Now, change show_ci to TRUE. Visualize the results again. Are they meaningful? Why can it be the case?
Check List
I know that wrong class of a variable can hinder statistical models and visualizations
I know how dummy variables can look like
I know how to relevel a categorical variable
 I know how introduce fixed effects into the model (and I know that I should make it as.factor()!)
I know how to introduce interaction effects into the model
 I know how to customize ggeffects() plots
I know how to introduce polynomials to OLS regression