library(tidyverse)
Model Diagnostics
Week 8
Before we start
Any questions?
We’re almost done with the class! What you think we should recap? Feel free to leave any thoughts in this form after the class. When thinking about what we have covered, don’t forget that each lab has “Check list” in the end.
To get the extra points for the exercises you will need to complete only 5 exercises.
Review of the previous week
Load tidyverse
library
Load the Comparative Political Dataset data (cpds.xlsx
).
Let’s attempt to explain Total expenditure on health as a percentage of GDP (health_pmp
). Regress it against Total labour force as a percentage of population (labfopar
) and Total reported union members (grossu
). Set up a model using lm()
.
Using plot()
present residuals vs fitted graph. Is it homoscedastic?
Add country
and year
fixed effects. Don’t forget about the variables’ class!
Now, plot the residuals vs fitted. How did it change? Is it homoscedastic?
Using bptest()
from lmtest
library test your assumption. Interpret the result.
Using lm_robust()
from estimatr
library run the same model with HC2
standard errors. Did the result change?
= ... model_se
Run the same model, but cluster the standard errors by country
.
= ... model_cl
Using modelsummary()
present three models: with fixed effects, with fixed effects and robust standard errors, and with fixed effects and robsut standard errors clustered by country. Briefly compare them.
Agenda
Refreshing homoscedasticity assumption
Discuss Influential Points and outliers
Check for Normality of Errors
Test models for Multicollinearity
Homoscedasticity
Today we are working with European Social Survey. You can access their documentation here. I have pre-processed the data for you. You can find how I did it here. We are interested in explaining why people trust or distrust each other.
We are going to use the following variables:
pplfair
: Most people try to take advantage of you, or try to be fair (0: Most people try to take advantage of me; 10: Most people try to be fair)trstplt
: Trust in politicians (0: no trust; 10: complete trust)trstprt
: Trust in political partiesedulvlb
: Education level
Load the dataset.
= read.csv("data/ess.csv") ess
Let’s wrangle the data first. To ease the comprehension, rename the variables.
= ess %>%
ess rename(trust = pplfair,
trust_politicians = trstplt,
trust_parties = trstprt,
education = edulvlb)
head(ess)
cntry trust trust_politicians trust_parties rlgdgr happy education
1 AT 6.385696 3.773545 3.723961 4.612917 7.781570 1.243568
2 BE 6.076585 4.172808 3.942565 4.145283 7.778545 1.642675
3 CH 6.431571 5.540824 5.292646 4.482909 8.154348 1.498912
4 CY 4.428363 2.681481 2.497041 6.756598 6.969118 1.469173
5 DE 6.177006 4.004165 3.974069 3.922599 7.762929 1.435993
6 ES 5.578890 2.784035 2.776864 4.104518 7.847991 1.438520
Set up the model. Analyze the regression output.
= lm(trust ~ trust_politicians + education + trust_parties, ess)
model_trust summary(model_trust)
Call:
lm(formula = trust ~ trust_politicians + education + trust_parties,
data = ess)
Residuals:
Min 1Q Median 3Q Max
-0.75756 -0.26177 -0.07697 0.29635 1.03112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.01778 0.88503 1.150 0.264
trust_politicians 0.04732 0.66492 0.071 0.944
education 1.97227 0.73235 2.693 0.014 *
trust_parties 0.45800 0.66839 0.685 0.501
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4472 on 20 degrees of freedom
Multiple R-squared: 0.749, Adjusted R-squared: 0.7114
F-statistic: 19.9 on 3 and 20 DF, p-value: 3.22e-06
Let’s plot the residuals vs fitted graph. Is it homoscedastic?
plot(model_trust, 1)
Alternatively, feel free to use ggplot()
for this purpose.
ggplot(model_trust, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(se = FALSE)
Let’s test our assumption. All good! Refer to the Lab 7 if you are interested in how to approach fixing the heteroscedasticity problem.
library(lmtest)
bptest(model_trust)
studentized Breusch-Pagan test
data: model_trust
BP = 3.8039, df = 3, p-value = 0.2834
Influential Points
Influential points can drive our results significantly! Let’s try to find if we have any in our small dataset. Here we are going to use Cook’s Distance. Generally, there are various thresholds suggetions, but in this class we stick to the following formula:
\[ D = \frac{4}{N-k-1} \] Where \(N\) is the number of observations and \(k\) is the number of covariates. You don’t have to know it, but the denominator is basically Degrees of freedom for residuals.
Calculate the cutoff
= 4 / (model_trust$df.residual)
cutoff cutoff
[1] 0.2
Calculate the Cook’s distance for observations
cooks.distance(model_trust)
1 2 3 4 5 6
0.2719435794 0.0145865721 0.0367187308 0.2316452893 0.0064423418 0.0166924949
7 8 9 10 11 12
0.0148949821 0.2684953279 0.0513101291 0.0090977971 0.1002006909 0.0527893685
13 14 15 16 17 18
0.0053112368 0.0341848255 0.0593545749 0.0015258481 0.0096398014 0.0393548089
19 20 21 22 23 24
0.0397060110 0.0001259869 0.0126482655 0.0012011773 0.0041391820 0.0795677570
Visualize what we’ve got. With the given cutoff, there are three influential observations
plot(model_trust, 4)
abline(h = cutoff, lty = 2)
Alternatively, we can use ggplot()
.
ggplot(model_trust, aes(seq_along(.cooksd), .cooksd)) +
geom_col() +
geom_hline(yintercept = cutoff)
Set up models with and without influential points. Our goal is to understand their impact: do they drive the results? What have changed? General advice: stick to the more conservative model.
library(modelsummary)
= lm(trust ~ trust_politicians + education + trust_parties, ess[-c(1, 4, 8),])
model_trust_wo
= list(
models "Full Model" = model_trust,
"Without Outliers" = model_trust_wo)
modelsummary(models,
gof_omit = "AIC|BIC|Log.Lik.|F|RMSE",
stars = T)
Full Model | Without Outliers | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 1.018 | 0.125 |
(0.885) | (0.677) | |
trust_politicians | 0.047 | -0.234 |
(0.665) | (0.610) | |
education | 1.972* | 2.885*** |
(0.732) | (0.580) | |
trust_parties | 0.458 | 0.617 |
(0.668) | (0.628) | |
Num.Obs. | 24 | 21 |
R2 | 0.749 | 0.871 |
R2 Adj. | 0.711 | 0.849 |
Normality of Errors
Another important prerequisite is the normal distribution of errors (i.e., normality). Once again, we can use plot()
. First, visualize plot without outliers.
plot(model_trust_wo, 2)
Now, with outliers. Compare two graphs.
plot(model_trust, 2)
How different are they? We need to use formal tests. How would you interpret Shapiro-Wilk test below?
shapiro.test(model_trust$residuals)
Shapiro-Wilk normality test
data: model_trust$residuals
W = 0.96829, p-value = 0.6249
shapiro.test(model_trust_wo$residuals)
Shapiro-Wilk normality test
data: model_trust_wo$residuals
W = 0.94018, p-value = 0.2197
Of course, you can visualize this with ggplot()
, too. More details on visualizing base R diagnostics plots in ggplot is available here.
ggplot(model_trust) +
stat_qq(aes(sample = .stdresid)) +
geom_abline()
Ok, what to do if residuals are not normally distributed? Transform DV and/or IV. Quite in a similar way we did it in Lab 5. A slightly more formal way of doing this is available here.
Let’s present the summary diagnostics.
par(mfrow = c(2, 2))
plot(model_trust)
Check other possible diagnostics plots below.
plot.lm() ?
Multicollinearity
Finally, multicollinearity. We do not want to have perfectly collinear variables. Let’s check is using ggpairs()
from GGally
library.
library(GGally)
ggpairs(ess, columns = c("trust", "trust_politicians", "trust_parties", "education"))
Apparently, trust in politicians and trust in political parties capture overlapping concepts. However, their \(\rho = 0.99\). More formally, we can calculate variance inflation factor. Use vif()
function from car
library. Rule of thumb is: if VIF is greater than 10, we have multicollinearity.
library(car)
vif(model_trust)
trust_politicians education trust_parties
49.546656 1.503323 49.792787
Let’s exclude trust in political parties (trust_parties
). Model specification is crucial!
= lm(trust ~ trust_politicians + education, ess)
model_trust_excl
modelsummary(list("Model Base" = model_trust,
"Model Without Multicollinearity" = model_trust_excl),
stars = T,
gof_omit = "AIC|BIC|Log.Lik.|F|RMSE")
Model Base | Model Without Multicollinearity | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 1.018 | 0.983 |
(0.885) | (0.872) | |
trust_politicians | 0.047 | 0.496*** |
(0.665) | (0.114) | |
education | 1.972* | 2.010* |
(0.732) | (0.721) | |
trust_parties | 0.458 | |
(0.668) | ||
Num.Obs. | 24 | 24 |
R2 | 0.749 | 0.743 |
R2 Adj. | 0.711 | 0.719 |
Ok, excluded one variable. What’s next? Go over the diagnostics again with the corrected model!
Exercises
Let’s explore other variables’ effects. First, rename the variable rlgdgr
(How religious are you) to religion
Using ggplot()
draw distribution of religion
variable
Set up a linear model: predict trust
by trust_politicians
, education
and religion
. Briefly describe the results: which variables are statistically significant and what is the direction of association (negative/positive).
Draw Cook’s distance. Use the same threshold formula as we used in the lab. Are there any influential points?
Set up a model without outliers. How different are the results?
Additional Exercises
For the extra credit you don’t have to do the exercises below. These are optional!
Draw a redisuals vs fitted plot. Use either plot()
or ggplot()
. Is the homoscedasticity assumption satisfied?
Using bptest()
from lmtest
library, test it formally. Interpret the results
Draw a qqplot using plot()
. Is the normality assumption satisfied?
Test it formally using Shapiro-Wilk test (shapiro.test()
). Interpret the results.
Using vif()
from car
library check if there is multicollinearity.
Briefly summarize the diagnostics. What are the problems with the model? How substantive are they?
Check List
I know how to interpret the residuals vs fitted plot, and I know how to test for homoscedasticity formally using bptest()
!
I know what Cook’s distance is, and I know that there are various formulas to calculate the threshold
I am not afraid of using qqPlot to get the sense if residuals are distributed normally
I know what Shapiro Wilk test, and I will use it to test if my data is distributed normally
I know what variance inflation factor is, and I will use vif()
to search for variables with value greater than 10