library(tidyverse)
Simple Linear Regression
Week 2
Before we start
Download the data
Organize your directory
Review of the Previous Session
Let’s refresh what we did last week.
First, load the tidyverse
library
Then, load the V-Dem data
load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))
You are interested in corruption in various geographical regions. Select year (year
), region (e_regionpol_7C
) and corruption (v2x_corr
) variables from the vdem
dataset.
= ... %>%
corruption_data ...(year, ..., v2x_corr)
As we are working with the V-Dem data, let’s rename variables so it’s more straightforward.
= corruption_data ...
corruption_data ...(region = e_regionpol_7C,
corruption = ...)
Calculate the average of the corruption
variable. Don’t forget about the na.rm = TURE
argument!
...
Using group_by()
and summarize()
calculate the average corruption for all regions in the dataset.
corruption_data ...%>%
(...) ...(average = ...(..., na.rm = T))
Lastly, let’s see the distributions of corruption across regions. Draw a boxplot below with region
variable on the X axis, and corruption
variable on the Y axis.
ggplot(data = corruption_data) +
...(aes(x = ..., y = ...))
Something is wrong, right? We haven’t checked the classes of the variables, and apparently the region
variable has to be changed. Let’s check it’s class first.
...(corruption_data$...)
What class should it be? Let’s recode it directly on the plot.
...(...) +
geom_boxplot(aes(x = ...(region), y = ...))
Agenda for Today
Clearing the environment
Loading CSV data to R
Building a simple OLS regression
Simple Linear Regression
First of all, don’t forget to download the data! Today we are working with the World Happiness Report. For loading the dataset in R, getwd()
and setwd()
can be helpful. The codebook is here to help us.
Country_name
is the name of the countryLadder_score
is the happiness scoreLogged_GDP_per_capita
is the log of GDP per capitaSocial_support
is an index that measures the extent to which an individual has someone to rely on in times of trouble.Healthy_life_expectancy
is the expected age of healthy living.
And many others.
= read.csv("data/WHR.csv") whr
Explore what we have in the dataset by accessing the column names
colnames(whr)
[1] "Country_name"
[2] "Ladder_score"
[3] "Standard_error_of_ladder_score"
[4] "upperwhisker"
[5] "lowerwhisker"
[6] "Logged_GDP_per_capita"
[7] "Social_support"
[8] "Healthy_life_expectancy"
[9] "Freedom_to_make_life_choices"
[10] "Generosity"
[11] "Perceptions_of_corruption"
[12] "Ladder_score_in_Dystopia"
[13] "Explained_by_Log_GDP_per_capita"
[14] "Explained_by_Social_support"
[15] "Explained_by_Healthy_life_expectancy"
[16] "Explained_by_Freedom_to_make_life_choices"
[17] "Explained_by_Generosity"
[18] "Explained_by_Perceptions_of_corruption"
[19] "Dystopia_residual"
[20] "Continent"
First, let’s draw a histogram for Ladder_score
variable.
ggplot(whr) +
geom_histogram(aes(x = Ladder_score)) +
labs(x = "Happiness Score") +
scale_y_continuous(breaks = c(0:10)) +
theme_classic()
Now, let’s plot happiness scores (Ladder_score
) against Social_support
. What can we see?
ggplot(whr) +
geom_point(aes(x = Social_support, y = Ladder_score))
Model Building
Let’s run a simple model, and then check it’s summary.
= lm(Ladder_score ~ Social_support, whr)
basic_model
summary(basic_model)
Call:
lm(formula = Ladder_score ~ Social_support, data = whr)
Residuals:
Min 1Q Median 3Q Max
-1.76562 -0.36701 0.01165 0.46577 1.49971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3428 0.3386 -1.013 0.313
Social_support 7.3618 0.4183 17.599 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared: 0.6964, Adjusted R-squared: 0.6942
F-statistic: 309.7 on 1 and 135 DF, p-value: < 2.2e-16
A one unit increase in Social Support is associated with a 7.4 increase in the happiness score. What is the maximum value the Happiness Score can take?
max(whr$Ladder_score)
[1] 7.804
And now, let’s draw a histogram of the Social Support. So, how much does this model tell us?
ggplot(whr) +
...(aes(x = Social_support))
Let’s correct the Social_support
variable a bit, transforming it to 0-100 scale. What do you think about the model now? What do you think about \(R^2\)?
= whr %>%
whr mutate(Social_support_percentage = Social_support * 100)
= lm(Ladder_score ~ Social_support_percentage, whr)
adjusted_model
summary(adjusted_model)
Call:
lm(formula = Ladder_score ~ Social_support_percentage, data = whr)
Residuals:
Min 1Q Median 3Q Max
-1.76562 -0.36701 0.01165 0.46577 1.49971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.342811 0.338568 -1.013 0.313
Social_support_percentage 0.073618 0.004183 17.599 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared: 0.6964, Adjusted R-squared: 0.6942
F-statistic: 309.7 on 1 and 135 DF, p-value: < 2.2e-16
Let’s write this regression formula out. Do you remember the general form?
\[ Y = \beta_0 + \beta_1X_1+\epsilon \]
In our case, this can be presented as
\[ \text{Happines} = -0.34 + 0.07\text{ Social Support} + e \]
Alternatively,
\[ Y = -0.34+0.07x+u \]
Now, visualize the regression.
ggplot(whr, aes(x = Social_support_percentage, y = Ladder_score)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Social Support (%)",
y = "Happiness Score")
Diagnostics
Let’s analyze the regression. First, extract the residuals and plot their distribution. Does it follow \(N(0, \sigma^2)\)?
= adjusted_model$residuals
res
ggplot() +
geom_histogram(aes(x = res), bins = 20) +
geom_vline(xintercept = mean(res), color = "red", size = 1.5)
Now we need to check the constant variance assumption. Does it hold? What term is used to describe this satisfied assumption?
= adjusted_model$fitted.values
yhat
ggplot() +
geom_point(aes(x = yhat, y = res)) +
geom_hline(yintercept = 0, color = "blue") +
labs(title = "Residuals vs fitted values plot")
Explore different patterns below
Reporting Regression
This is a brief reminder on how you can present the regression output.
First, let’s load the library modelsummary
.
library(modelsummary)
Now, simply run the chunk below. Note the syntax. Let’s spend some time analyzing what’s going on. Compare it with summary()
.
summary(basic_model)
Call:
lm(formula = Ladder_score ~ Social_support, data = whr)
Residuals:
Min 1Q Median 3Q Max
-1.76562 -0.36701 0.01165 0.46577 1.49971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3428 0.3386 -1.013 0.313
Social_support 7.3618 0.4183 17.599 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6304 on 135 degrees of freedom
Multiple R-squared: 0.6964, Adjusted R-squared: 0.6942
F-statistic: 309.7 on 1 and 135 DF, p-value: < 2.2e-16
modelsummary(basic_model)
(1) | |
---|---|
(Intercept) | -0.343 |
(0.339) | |
Social_support | 7.362 |
(0.418) | |
Num.Obs. | 137 |
R2 | 0.696 |
R2 Adj. | 0.694 |
AIC | 266.3 |
BIC | 275.1 |
Log.Lik. | -130.170 |
F | 309.727 |
RMSE | 0.63 |
We can present several models easily.
modelsummary(list(basic_model, adjusted_model))
(1) | (2) | |
---|---|---|
(Intercept) | -0.343 | -0.343 |
(0.339) | (0.339) | |
Social_support | 7.362 | |
(0.418) | ||
Social_support_percentage | 0.074 | |
(0.004) | ||
Num.Obs. | 137 | 137 |
R2 | 0.696 | 0.696 |
R2 Adj. | 0.694 | 0.694 |
AIC | 266.3 | 266.3 |
BIC | 275.1 | 275.1 |
Log.Lik. | -130.170 | -130.170 |
F | 309.727 | 309.727 |
RMSE | 0.63 | 0.63 |
Ok, there is a lot of things going on. Let’s customize the output so it presents only relevant information. Once again, pay attention to the syntax and the arguments.
coef_rename
renames the \(\beta\) coefficient namesgof_omit
removes statistics, note that if you want to remove multiple statistics you should use|
stars = TRUE
indicates p-values stars
modelsummary(list("Basic model" = basic_model,
"Adjusted model" = adjusted_model),
coef_rename = c("Intercept",
"Social Support",
"Social Support (%)"),
gof_omit = "BIC|AIC|RMSE|Log.Lik.",
stars = TRUE)
Basic model | Adjusted model | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
Intercept | -0.343 | -0.343 |
(0.339) | (0.339) | |
Social Support | 7.362*** | |
(0.418) | ||
Social Support (%) | 0.074*** | |
(0.004) | ||
Num.Obs. | 137 | 137 |
R2 | 0.696 | 0.696 |
R2 Adj. | 0.694 | 0.694 |
F | 309.727 | 309.727 |
Exercises
Now we are working with healthy lifestyle (Healthy_life_expectancy
) in the whr
dataset. First, draw a histogram. What is the variable’s scale?
Using geom_vline()
add an average for Healthy_life_expectancy
to the plot.
Insert a chunk below and build an lm()
model where dependent variable is Happiness score (Ladder_score
) and Independent variable is Healthy_life_expectancy
. Present the summary.
Interpret coefficients and \(R^2\).
Using \(\LaTeX\) write out the regression function below.
Using ggplot()
visualize plot Happiness score (Ladder_score
) against Healthy_life_expectancy
. Add a regression line.
Extract the residuals from the model and plot their distribution.
Name X and Y axes. Title the plot. Add a theme_bw()
to the plot.
Are residuals distributed approximately normally?
Draw residuals vs fitted plot. Is the data homoscedastic or heteroscedastic? Why?
Using modelsummary()
present the regression output. Don’t forget to customize it.
Check List
I know that wrong class of a variable can hinder statistical models and visualizations
I know how directories work and how to load data in R
I know how to calculate and visualize a simple regression
I know that I can adjust the scales of the variable to ease the interpretation
I know how access residuals of the model for the further diagnostics