Simple Linear Regression

Week 2

Published

January 16, 2025

Before we start

  • Download the data

  • Organize your directory

Download script

Download data

Review of the Previous Session

Let’s refresh what we did last week.

Review

First, load the tidyverse library

library(tidyverse)

Then, load the V-Dem data

load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))

You are interested in corruption in various geographical regions. Select year (year), region (e_regionpol_7C) and corruption (v2x_corr) variables from the vdem dataset.

corruption_data = ... %>%
  ...(year, ..., v2x_corr) 

As we are working with the V-Dem data, let’s rename variables so it’s more straightforward.

corruption_data = corruption_data ...
  ...(region = e_regionpol_7C,
      corruption = ...)

Calculate the average of the corruption variable. Don’t forget about the na.rm = TURE argument!

...

Using group_by() and summarize() calculate the average corruption for all regions in the dataset.

corruption_data ...
  (...) %>%
  ...(average = ...(..., na.rm = T))

Lastly, let’s see the distributions of corruption across regions. Draw a boxplot below with region variable on the X axis, and corruption variable on the Y axis.

ggplot(data = corruption_data) +
  ...(aes(x = ..., y = ...))

Something is wrong, right? We haven’t checked the classes of the variables, and apparently the region variable has to be changed. Let’s check it’s class first.

...(corruption_data$...)

What class should it be? Let’s recode it directly on the plot.

...(...) +
  geom_boxplot(aes(x = ...(region), y = ...))

Agenda for Today

  • Clearing the environment

  • Loading CSV data to R

  • Building a simple OLS regression

Simple Linear Regression

First of all, don’t forget to download the data! Today we are working with the World Happiness Report. For loading the dataset in R, getwd() and setwd() can be helpful. The codebook is here to help us.

  • Country_name is the name of the country

  • Ladder_score is the happiness score

  • Logged_GDP_per_capita is the log of GDP per capita

  • Social_support is an index that measures the extent to which an individual has someone to rely on in times of trouble.

  • Healthy_life_expectancy is the expected age of healthy living.

And many others.

whr = read.csv("data/WHR.csv")

Explore what we have in the dataset by accessing the column names

colnames(whr)
 [1] "Country_name"                             
 [2] "Ladder_score"                             
 [3] "Standard_error_of_ladder_score"           
 [4] "upperwhisker"                             
 [5] "lowerwhisker"                             
 [6] "Logged_GDP_per_capita"                    
 [7] "Social_support"                           
 [8] "Healthy_life_expectancy"                  
 [9] "Freedom_to_make_life_choices"             
[10] "Generosity"                               
[11] "Perceptions_of_corruption"                
[12] "Ladder_score_in_Dystopia"                 
[13] "Explained_by_Log_GDP_per_capita"          
[14] "Explained_by_Social_support"              
[15] "Explained_by_Healthy_life_expectancy"     
[16] "Explained_by_Freedom_to_make_life_choices"
[17] "Explained_by_Generosity"                  
[18] "Explained_by_Perceptions_of_corruption"   
[19] "Dystopia_residual"                        
[20] "Continent"                                

First, let’s draw a histogram for Ladder_score variable.

ggplot(whr) +
  geom_histogram(aes(x = Ladder_score)) +
  labs(x = "Happiness Score") +
  scale_y_continuous(breaks = c(0:10)) +
  theme_classic()

Now, let’s plot happiness scores (Ladder_score) against Social_support. What can we see?

ggplot(whr) +
  geom_point(aes(x = Social_support, y = Ladder_score))

Model Building

Let’s run a simple model, and then check it’s summary.

basic_model = lm(Ladder_score ~ Social_support, whr)
  
summary(basic_model)

Call:
lm(formula = Ladder_score ~ Social_support, data = whr)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76562 -0.36701  0.01165  0.46577  1.49971 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.3428     0.3386  -1.013    0.313    
Social_support   7.3618     0.4183  17.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared:  0.6964,    Adjusted R-squared:  0.6942 
F-statistic: 309.7 on 1 and 135 DF,  p-value: < 2.2e-16

A one unit increase in Social Support is associated with a 7.4 increase in the happiness score. What is the maximum value the Happiness Score can take?

max(whr$Ladder_score)
[1] 7.804

And now, let’s draw a histogram of the Social Support. So, how much does this model tell us?

ggplot(whr) +
  ...(aes(x = Social_support))

Let’s correct the Social_support variable a bit, transforming it to 0-100 scale. What do you think about the model now? What do you think about \(R^2\)?

whr = whr %>%
  mutate(Social_support_percentage = Social_support * 100)

adjusted_model = lm(Ladder_score ~ Social_support_percentage, whr)
  
summary(adjusted_model)

Call:
lm(formula = Ladder_score ~ Social_support_percentage, data = whr)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76562 -0.36701  0.01165  0.46577  1.49971 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -0.342811   0.338568  -1.013    0.313    
Social_support_percentage  0.073618   0.004183  17.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared:  0.6964,    Adjusted R-squared:  0.6942 
F-statistic: 309.7 on 1 and 135 DF,  p-value: < 2.2e-16

Let’s write this regression formula out. Do you remember the general form?

\[ Y = \beta_0 + \beta_1X_1+\epsilon \]

In our case, this can be presented as

\[ \text{Happines} = -0.34 + 0.07\text{ Social Support} + e \]

Alternatively,

\[ Y = -0.34+0.07x+u \]

Now, visualize the regression.

ggplot(whr, aes(x = Social_support_percentage, y = Ladder_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Social Support (%)",
       y = "Happiness Score")

Diagnostics

Let’s analyze the regression. First, extract the residuals and plot their distribution. Does it follow \(N(0, \sigma^2)\)?

res = adjusted_model$residuals

ggplot() +
  geom_histogram(aes(x = res), bins = 20) +
  geom_vline(xintercept = mean(res), color = "red", size = 1.5)

Now we need to check the constant variance assumption. Does it hold? What term is used to describe this satisfied assumption?

yhat = adjusted_model$fitted.values

ggplot() +
  geom_point(aes(x = yhat, y = res)) +
  geom_hline(yintercept = 0, color = "blue") +
  labs(title = "Residuals vs fitted values plot")

Explore different patterns below

Reporting Regression

This is a brief reminder on how you can present the regression output.

First, let’s load the library modelsummary.

library(modelsummary)

Now, simply run the chunk below. Note the syntax. Let’s spend some time analyzing what’s going on. Compare it with summary().

summary(basic_model)

Call:
lm(formula = Ladder_score ~ Social_support, data = whr)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76562 -0.36701  0.01165  0.46577  1.49971 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.3428     0.3386  -1.013    0.313    
Social_support   7.3618     0.4183  17.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared:  0.6964,    Adjusted R-squared:  0.6942 
F-statistic: 309.7 on 1 and 135 DF,  p-value: < 2.2e-16
modelsummary(basic_model)
(1)
(Intercept) -0.343
(0.339)
Social_support 7.362
(0.418)
Num.Obs. 137
R2 0.696
R2 Adj. 0.694
AIC 266.3
BIC 275.1
Log.Lik. -130.170
F 309.727
RMSE 0.63

We can present several models easily.

modelsummary(list(basic_model, adjusted_model))
(1) (2)
(Intercept) -0.343 -0.343
(0.339) (0.339)
Social_support 7.362
(0.418)
Social_support_percentage 0.074
(0.004)
Num.Obs. 137 137
R2 0.696 0.696
R2 Adj. 0.694 0.694
AIC 266.3 266.3
BIC 275.1 275.1
Log.Lik. -130.170 -130.170
F 309.727 309.727
RMSE 0.63 0.63

Ok, there is a lot of things going on. Let’s customize the output so it presents only relevant information. Once again, pay attention to the syntax and the arguments.

  • coef_rename renames the \(\beta\) coefficient names

  • gof_omit removes statistics, note that if you want to remove multiple statistics you should use |

  • stars = TRUE indicates p-values stars

modelsummary(list("Basic model" = basic_model, 
                  "Adjusted model" = adjusted_model),
             coef_rename = c("Intercept", 
                             "Social Support", 
                             "Social Support (%)"),
             gof_omit = "BIC|AIC|RMSE|Log.Lik.",    
             stars = TRUE) 
Basic model Adjusted model
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept -0.343 -0.343
(0.339) (0.339)
Social Support 7.362***
(0.418)
Social Support (%) 0.074***
(0.004)
Num.Obs. 137 137
R2 0.696 0.696
R2 Adj. 0.694 0.694
F 309.727 309.727

Exercises

Now we are working with healthy lifestyle (Healthy_life_expectancy) in the whr dataset. First, draw a histogram. What is the variable’s scale?

Solution

YOUR SOLUTION HERE

...

Using geom_vline() add an average for Healthy_life_expectancy to the plot.

Solution

YOUR SOLUTION HERE

Insert a chunk below and build an lm() model where dependent variable is Happiness score (Ladder_score) and Independent variable is Healthy_life_expectancy. Present the summary.

Solution

YOUR SOLUTION HERE

Interpret coefficients and \(R^2\).

Solution

YOUR SOLUTION HERE

Using \(\LaTeX\) write out the regression function below.

Solution

\[ \text{Happiness Score} = ... \]

Using ggplot() visualize plot Happiness score (Ladder_score) against Healthy_life_expectancy. Add a regression line.

Solution
ggplot(...) +
  geom_...() +
  geom_smooth()

Extract the residuals from the model and plot their distribution.

Solution

YOUR SOLUTION HERE

Name X and Y axes. Title the plot. Add a theme_bw() to the plot.

Solution

YOUR SOLUTION HERE

Are residuals distributed approximately normally?

Solution

YOUR SOLUTION HERE

Draw residuals vs fitted plot. Is the data homoscedastic or heteroscedastic? Why?

Solution

YOUR SOLUTION HERE

Using modelsummary() present the regression output. Don’t forget to customize it.

Solution

YOUR SOLUTION HERE

Check List

I know that wrong class of a variable can hinder statistical models and visualizations

I know how directories work and how to load data in R

I know how to calculate and visualize a simple regression

I know that I can adjust the scales of the variable to ease the interpretation

I know how access residuals of the model for the further diagnostics