library(tidyverse)Simple Linear Regression
Week 2
Before we start
- Download the data 
- Organize your directory 
Review of the Previous Session
Let’s refresh what we did last week.
First, load the tidyverse library
Then, load the V-Dem data
load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))You are interested in corruption in various geographical regions. Select year (year), region (e_regionpol_7C) and corruption (v2x_corr) variables from the vdem dataset.
corruption_data = ... %>%
  ...(year, ..., v2x_corr) As we are working with the V-Dem data, let’s rename variables so it’s more straightforward.
corruption_data = corruption_data ...
  ...(region = e_regionpol_7C,
      corruption = ...)Calculate the average of the corruption variable. Don’t forget about the na.rm = TURE argument!
...Using group_by() and summarize() calculate the average corruption for all regions in the dataset.
corruption_data ...
  (...) %>%
  ...(average = ...(..., na.rm = T))Lastly, let’s see the distributions of corruption across regions. Draw a boxplot below with region variable on the X axis, and corruption variable on the Y axis.
ggplot(data = corruption_data) +
  ...(aes(x = ..., y = ...))Something is wrong, right? We haven’t checked the classes of the variables, and apparently the region variable has to be changed. Let’s check it’s class first.
...(corruption_data$...)What class should it be? Let’s recode it directly on the plot.
...(...) +
  geom_boxplot(aes(x = ...(region), y = ...))Agenda for Today
- Clearing the environment 
- Loading CSV data to R 
- Building a simple OLS regression 
Simple Linear Regression
First of all, don’t forget to download the data! Today we are working with the World Happiness Report. For loading the dataset in R, getwd() and setwd() can be helpful. The codebook is here to help us.
- Country_nameis the name of the country
- Ladder_scoreis the happiness score
- Logged_GDP_per_capitais the log of GDP per capita
- Social_supportis an index that measures the extent to which an individual has someone to rely on in times of trouble.
- Healthy_life_expectancyis the expected age of healthy living.
And many others.
whr = read.csv("data/WHR.csv")Explore what we have in the dataset by accessing the column names
colnames(whr) [1] "Country_name"                             
 [2] "Ladder_score"                             
 [3] "Standard_error_of_ladder_score"           
 [4] "upperwhisker"                             
 [5] "lowerwhisker"                             
 [6] "Logged_GDP_per_capita"                    
 [7] "Social_support"                           
 [8] "Healthy_life_expectancy"                  
 [9] "Freedom_to_make_life_choices"             
[10] "Generosity"                               
[11] "Perceptions_of_corruption"                
[12] "Ladder_score_in_Dystopia"                 
[13] "Explained_by_Log_GDP_per_capita"          
[14] "Explained_by_Social_support"              
[15] "Explained_by_Healthy_life_expectancy"     
[16] "Explained_by_Freedom_to_make_life_choices"
[17] "Explained_by_Generosity"                  
[18] "Explained_by_Perceptions_of_corruption"   
[19] "Dystopia_residual"                        
[20] "Continent"                                First, let’s draw a histogram for Ladder_score variable.
ggplot(whr) +
  geom_histogram(aes(x = Ladder_score)) +
  labs(x = "Happiness Score") +
  scale_y_continuous(breaks = c(0:10)) +
  theme_classic()Now, let’s plot happiness scores (Ladder_score) against Social_support. What can we see?
ggplot(whr) +
  geom_point(aes(x = Social_support, y = Ladder_score))Model Building
Let’s run a simple model, and then check it’s summary.
basic_model = lm(Ladder_score ~ Social_support, whr)
  
summary(basic_model)
Call:
lm(formula = Ladder_score ~ Social_support, data = whr)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.76562 -0.36701  0.01165  0.46577  1.49971 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.3428     0.3386  -1.013    0.313    
Social_support   7.3618     0.4183  17.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared:  0.6964,    Adjusted R-squared:  0.6942 
F-statistic: 309.7 on 1 and 135 DF,  p-value: < 2.2e-16A one unit increase in Social Support is associated with a 7.4 increase in the happiness score. What is the maximum value the Happiness Score can take?
max(whr$Ladder_score)[1] 7.804And now, let’s draw a histogram of the Social Support. So, how much does this model tell us?
ggplot(whr) +
  ...(aes(x = Social_support))Let’s correct the Social_support variable a bit, transforming it to 0-100 scale. What do you think about the model now? What do you think about \(R^2\)?
whr = whr %>%
  mutate(Social_support_percentage = Social_support * 100)
adjusted_model = lm(Ladder_score ~ Social_support_percentage, whr)
  
summary(adjusted_model)
Call:
lm(formula = Ladder_score ~ Social_support_percentage, data = whr)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.76562 -0.36701  0.01165  0.46577  1.49971 
Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -0.342811   0.338568  -1.013    0.313    
Social_support_percentage  0.073618   0.004183  17.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared:  0.6964,    Adjusted R-squared:  0.6942 
F-statistic: 309.7 on 1 and 135 DF,  p-value: < 2.2e-16Let’s write this regression formula out. Do you remember the general form?
\[ Y = \beta_0 + \beta_1X_1+\epsilon \]
In our case, this can be presented as
\[ \text{Happines} = -0.34 + 0.07\text{ Social Support} + e \]
Alternatively,
\[ Y = -0.34+0.07x+u \]
Now, visualize the regression.
ggplot(whr, aes(x = Social_support_percentage, y = Ladder_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Social Support (%)",
       y = "Happiness Score")Diagnostics
Let’s analyze the regression. First, extract the residuals and plot their distribution. Does it follow \(N(0, \sigma^2)\)?
res = adjusted_model$residuals
ggplot() +
  geom_histogram(aes(x = res), bins = 20) +
  geom_vline(xintercept = mean(res), color = "red", size = 1.5)Now we need to check the constant variance assumption. Does it hold? What term is used to describe this satisfied assumption?
yhat = adjusted_model$fitted.values
ggplot() +
  geom_point(aes(x = yhat, y = res)) +
  geom_hline(yintercept = 0, color = "blue") +
  labs(title = "Residuals vs fitted values plot")Explore different patterns below
Reporting Regression
This is a brief reminder on how you can present the regression output.
First, let’s load the library modelsummary.
library(modelsummary)Now, simply run the chunk below. Note the syntax. Let’s spend some time analyzing what’s going on. Compare it with summary().
summary(basic_model)
Call:
lm(formula = Ladder_score ~ Social_support, data = whr)
Residuals:
     Min       1Q   Median       3Q      Max 
-1.76562 -0.36701  0.01165  0.46577  1.49971 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.3428     0.3386  -1.013    0.313    
Social_support   7.3618     0.4183  17.599   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared:  0.6964,    Adjusted R-squared:  0.6942 
F-statistic: 309.7 on 1 and 135 DF,  p-value: < 2.2e-16modelsummary(basic_model)| (1) | |
|---|---|
| (Intercept) | -0.343 | 
| (0.339) | |
| Social_support | 7.362 | 
| (0.418) | |
| Num.Obs. | 137 | 
| R2 | 0.696 | 
| R2 Adj. | 0.694 | 
| AIC | 266.3 | 
| BIC | 275.1 | 
| Log.Lik. | -130.170 | 
| F | 309.727 | 
| RMSE | 0.63 | 
We can present several models easily.
modelsummary(list(basic_model, adjusted_model))| (1) | (2) | |
|---|---|---|
| (Intercept) | -0.343 | -0.343 | 
| (0.339) | (0.339) | |
| Social_support | 7.362 | |
| (0.418) | ||
| Social_support_percentage | 0.074 | |
| (0.004) | ||
| Num.Obs. | 137 | 137 | 
| R2 | 0.696 | 0.696 | 
| R2 Adj. | 0.694 | 0.694 | 
| AIC | 266.3 | 266.3 | 
| BIC | 275.1 | 275.1 | 
| Log.Lik. | -130.170 | -130.170 | 
| F | 309.727 | 309.727 | 
| RMSE | 0.63 | 0.63 | 
Ok, there is a lot of things going on. Let’s customize the output so it presents only relevant information. Once again, pay attention to the syntax and the arguments.
- coef_renamerenames the \(\beta\) coefficient names
- gof_omitremoves statistics, note that if you want to remove multiple statistics you should use- |
- stars = TRUEindicates p-values stars
modelsummary(list("Basic model" = basic_model, 
                  "Adjusted model" = adjusted_model),
             coef_rename = c("Intercept", 
                             "Social Support", 
                             "Social Support (%)"),
             gof_omit = "BIC|AIC|RMSE|Log.Lik.",    
             stars = TRUE) | Basic model | Adjusted model | |
|---|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
| Intercept | -0.343 | -0.343 | 
| (0.339) | (0.339) | |
| Social Support | 7.362*** | |
| (0.418) | ||
| Social Support (%) | 0.074*** | |
| (0.004) | ||
| Num.Obs. | 137 | 137 | 
| R2 | 0.696 | 0.696 | 
| R2 Adj. | 0.694 | 0.694 | 
| F | 309.727 | 309.727 | 
Exercises
Now we are working with healthy lifestyle (Healthy_life_expectancy) in the whr dataset. First, draw a histogram. What is the variable’s scale?
Using geom_vline() add an average for Healthy_life_expectancy to the plot.
Insert a chunk below and build an lm() model where dependent variable is Happiness score (Ladder_score) and Independent variable is Healthy_life_expectancy. Present the summary.
Interpret coefficients and \(R^2\).
Using \(\LaTeX\) write out the regression function below.
Using ggplot() visualize plot Happiness score (Ladder_score) against Healthy_life_expectancy. Add a regression line.
Extract the residuals from the model and plot their distribution.
Name X and Y axes. Title the plot. Add a theme_bw() to the plot.
Are residuals distributed approximately normally?
Draw residuals vs fitted plot. Is the data homoscedastic or heteroscedastic? Why?
Using modelsummary() present the regression output. Don’t forget to customize it.
Check List
I know that wrong class of a variable can hinder statistical models and visualizations
I know how directories work and how to load data in R
I know how to calculate and visualize a simple regression
I know that I can adjust the scales of the variable to ease the interpretation
I know how access residuals of the model for the further diagnostics