Model Diagnostics

Week 8

Published

February 27, 2025

Before we start

  • Any questions?

  • We’re almost done with the class! What you think we should recap? Feel free to leave any thoughts in this form after the class. When thinking about what we have covered, don’t forget that each lab has “Check list” in the end.

  • To get the extra points for the exercises you will need to complete only 5 exercises.

Download script

Download data

Review of the previous week

Review

Load tidyverse library

library(tidyverse)

Load the Comparative Political Dataset data (cpds.xlsx).

Let’s attempt to explain Total expenditure on health as a percentage of GDP (health_pmp). Regress it against Total labour force as a percentage of population (labfopar) and Total reported union members (grossu). Set up a model using lm().

Using plot() present residuals vs fitted graph. Is it homoscedastic?

Add country and year fixed effects. Don’t forget about the variables’ class!

Now, plot the residuals vs fitted. How did it change? Is it homoscedastic?

Using bptest() from lmtest library test your assumption. Interpret the result.

Using lm_robust() from estimatr library run the same model with HC2 standard errors. Did the result change?

model_se = ...

Run the same model, but cluster the standard errors by country.

model_cl = ...

Using modelsummary() present three models: with fixed effects, with fixed effects and robust standard errors, and with fixed effects and robsut standard errors clustered by country. Briefly compare them.

Agenda

  • Refreshing homoscedasticity assumption

  • Discuss Influential Points and outliers

  • Check for Normality of Errors

  • Test models for Multicollinearity

Homoscedasticity

Today we are working with European Social Survey. You can access their documentation here. I have pre-processed the data for you. You can find how I did it here. We are interested in explaining why people trust or distrust each other.

We are going to use the following variables:

  • pplfair: Most people try to take advantage of you, or try to be fair (0: Most people try to take advantage of me; 10: Most people try to be fair)

  • trstplt: Trust in politicians (0: no trust; 10: complete trust)

  • trstprt: Trust in political parties

  • edulvlb: Education level

Load the dataset.

ess = read.csv("data/ess.csv")

Let’s wrangle the data first. To ease the comprehension, rename the variables.

ess = ess %>%
  rename(trust = pplfair,
         trust_politicians = trstplt,
         trust_parties = trstprt,
         education = edulvlb)

head(ess)
  cntry    trust trust_politicians trust_parties   rlgdgr    happy education
1    AT 6.385696          3.773545      3.723961 4.612917 7.781570  1.243568
2    BE 6.076585          4.172808      3.942565 4.145283 7.778545  1.642675
3    CH 6.431571          5.540824      5.292646 4.482909 8.154348  1.498912
4    CY 4.428363          2.681481      2.497041 6.756598 6.969118  1.469173
5    DE 6.177006          4.004165      3.974069 3.922599 7.762929  1.435993
6    ES 5.578890          2.784035      2.776864 4.104518 7.847991  1.438520

Set up the model. Analyze the regression output.

model_trust = lm(trust ~ trust_politicians + education + trust_parties, ess)
summary(model_trust)

Call:
lm(formula = trust ~ trust_politicians + education + trust_parties, 
    data = ess)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.75756 -0.26177 -0.07697  0.29635  1.03112 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)        1.01778    0.88503   1.150    0.264  
trust_politicians  0.04732    0.66492   0.071    0.944  
education          1.97227    0.73235   2.693    0.014 *
trust_parties      0.45800    0.66839   0.685    0.501  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4472 on 20 degrees of freedom
Multiple R-squared:  0.749, Adjusted R-squared:  0.7114 
F-statistic:  19.9 on 3 and 20 DF,  p-value: 3.22e-06

Let’s plot the residuals vs fitted graph. Is it homoscedastic?

plot(model_trust, 1)

Alternatively, feel free to use ggplot() for this purpose.

ggplot(model_trust, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0) +
  geom_smooth(se = FALSE)

Let’s test our assumption. All good! Refer to the Lab 7 if you are interested in how to approach fixing the heteroscedasticity problem.

library(lmtest)
bptest(model_trust)

    studentized Breusch-Pagan test

data:  model_trust
BP = 3.8039, df = 3, p-value = 0.2834

Influential Points

Influential points can drive our results significantly! Let’s try to find if we have any in our small dataset. Here we are going to use Cook’s Distance. Generally, there are various thresholds suggetions, but in this class we stick to the following formula:

\[ D = \frac{4}{N-k-1} \] Where \(N\) is the number of observations and \(k\) is the number of covariates. You don’t have to know it, but the denominator is basically Degrees of freedom for residuals.

Calculate the cutoff

cutoff = 4 / (model_trust$df.residual)
cutoff
[1] 0.2

Calculate the Cook’s distance for observations

cooks.distance(model_trust)
           1            2            3            4            5            6 
0.2719435794 0.0145865721 0.0367187308 0.2316452893 0.0064423418 0.0166924949 
           7            8            9           10           11           12 
0.0148949821 0.2684953279 0.0513101291 0.0090977971 0.1002006909 0.0527893685 
          13           14           15           16           17           18 
0.0053112368 0.0341848255 0.0593545749 0.0015258481 0.0096398014 0.0393548089 
          19           20           21           22           23           24 
0.0397060110 0.0001259869 0.0126482655 0.0012011773 0.0041391820 0.0795677570 

Visualize what we’ve got. With the given cutoff, there are three influential observations

plot(model_trust, 4)
abline(h = cutoff, lty = 2)

Alternatively, we can use ggplot().

ggplot(model_trust, aes(seq_along(.cooksd), .cooksd)) +
  geom_col() +
  geom_hline(yintercept = cutoff)

Set up models with and without influential points. Our goal is to understand their impact: do they drive the results? What have changed? General advice: stick to the more conservative model.

library(modelsummary)

model_trust_wo = lm(trust ~ trust_politicians + education + trust_parties, ess[-c(1, 4, 8),])

models = list(
  "Full Model" = model_trust,
  "Without Outliers" = model_trust_wo)

modelsummary(models, 
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE", 
             stars = T)
Full Model Without Outliers
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 1.018 0.125
(0.885) (0.677)
trust_politicians 0.047 -0.234
(0.665) (0.610)
education 1.972* 2.885***
(0.732) (0.580)
trust_parties 0.458 0.617
(0.668) (0.628)
Num.Obs. 24 21
R2 0.749 0.871
R2 Adj. 0.711 0.849

Normality of Errors

Another important prerequisite is the normal distribution of errors (i.e., normality). Once again, we can use plot(). First, visualize plot without outliers.

plot(model_trust_wo, 2)

Now, with outliers. Compare two graphs.

plot(model_trust, 2)

How different are they? We need to use formal tests. How would you interpret Shapiro-Wilk test below?

shapiro.test(model_trust$residuals)

    Shapiro-Wilk normality test

data:  model_trust$residuals
W = 0.96829, p-value = 0.6249
shapiro.test(model_trust_wo$residuals)

    Shapiro-Wilk normality test

data:  model_trust_wo$residuals
W = 0.94018, p-value = 0.2197

Of course, you can visualize this with ggplot(), too. More details on visualizing base R diagnostics plots in ggplot is available here.

ggplot(model_trust) +
  stat_qq(aes(sample = .stdresid)) +
  geom_abline()

Ok, what to do if residuals are not normally distributed? Transform DV and/or IV. Quite in a similar way we did it in Lab 5. A slightly more formal way of doing this is available here.

Let’s present the summary diagnostics.

par(mfrow = c(2, 2))
plot(model_trust)

Check other possible diagnostics plots below.

?plot.lm()

Multicollinearity

Finally, multicollinearity. We do not want to have perfectly collinear variables. Let’s check is using ggpairs() from GGally library.

library(GGally)
ggpairs(ess, columns = c("trust", "trust_politicians", "trust_parties", "education"))

Apparently, trust in politicians and trust in political parties capture overlapping concepts. However, their \(\rho = 0.99\). More formally, we can calculate variance inflation factor. Use vif() function from car library. Rule of thumb is: if VIF is greater than 10, we have multicollinearity.

library(car)
vif(model_trust)
trust_politicians         education     trust_parties 
        49.546656          1.503323         49.792787 

Let’s exclude trust in political parties (trust_parties). Model specification is crucial!

model_trust_excl = lm(trust ~ trust_politicians + education, ess)


modelsummary(list("Model Base" = model_trust,
                  "Model Without Multicollinearity" = model_trust_excl),
             stars = T,
             gof_omit = "AIC|BIC|Log.Lik.|F|RMSE")
Model Base Model Without Multicollinearity
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 1.018 0.983
(0.885) (0.872)
trust_politicians 0.047 0.496***
(0.665) (0.114)
education 1.972* 2.010*
(0.732) (0.721)
trust_parties 0.458
(0.668)
Num.Obs. 24 24
R2 0.749 0.743
R2 Adj. 0.711 0.719

Ok, excluded one variable. What’s next? Go over the diagnostics again with the corrected model!

Exercises

Let’s explore other variables’ effects. First, rename the variable rlgdgr (How religious are you) to religion

Solution

YOUR SOLUTION HERE

Using ggplot() draw distribution of religion variable

Solution

YOUR SOLUTION HERE

Set up a linear model: predict trust by trust_politicians, education and religion. Briefly describe the results: which variables are statistically significant and what is the direction of association (negative/positive).

Solution

YOUR SOLUTION HERE

Draw Cook’s distance. Use the same threshold formula as we used in the lab. Are there any influential points?

Solution

YOUR SOLUTION HERE

Set up a model without outliers. How different are the results?

Solution

YOUR SOLUTION HERE

Additional Exercises

For the extra credit you don’t have to do the exercises below. These are optional!

Draw a redisuals vs fitted plot. Use either plot() or ggplot(). Is the homoscedasticity assumption satisfied?

Solution

YOUR SOLUTION HERE

Using bptest() from lmtest library, test it formally. Interpret the results

Solution

YOUR SOLUTION HERE

Draw a qqplot using plot(). Is the normality assumption satisfied?

Solution

YOUR SOLUTION HERE

Test it formally using Shapiro-Wilk test (shapiro.test()). Interpret the results.

Solution

YOUR SOLUTION HERE

Using vif() from car library check if there is multicollinearity.

Solution

YOUR SOLUTION HERE

Briefly summarize the diagnostics. What are the problems with the model? How substantive are they?

Solution

YOUR SOLUTION HERE

Check List

I know how to interpret the residuals vs fitted plot, and I know how to test for homoscedasticity formally using bptest()!

I know what Cook’s distance is, and I know that there are various formulas to calculate the threshold

I am not afraid of using qqPlot to get the sense if residuals are distributed normally

I know what Shapiro Wilk test, and I will use it to test if my data is distributed normally

I know what variance inflation factor is, and I will use vif() to search for variables with value greater than 10