GLMs and Quarter Review

Week 9

Published

March 6, 2025

Before we start

Optional Review of the previous week

Review

Load tidyverse library

library(tidyverse)

Load the European Social Value data (ess.csv).

Rename the variables:

pplfair to trust
trstplt to trust_politicians
rlgdgr to religion

Run the following model. DV: happy, IVs: trust, trust_politicians and religion. Present the summary().

Using bptest() from the lmtest library, test if the model is homoscedastic. Interpret the results of Breusch-Pagan test.

Draw residuals vs fitted plot. Does it match the BP test results? You can use plot(..., which = 1).

Calculate the cutoff for the Cook’s distance using the formula: \(\frac{4}{N-k-1}\). Plot the Cook’s distance using plot() and add a cutoff line using abline(). Are there any influential points?

Set up a model without influential points. How different the estimates are? Present a simple modelsummary() comparison.

model_no = lm(happy ~ trust + trust_politicians + religion, ess[-c(...),])

library(...)
modelsummary(list(...),
             stars = T)

Draw a Q-Q plot using plot(..., which = 2) for the full model. Are the residuals distributed normally?

Using shapiro.test() test the normality of the residuals in the model. Interpret the results.

What else we have to check? How would you do that? No code needed but feel free to rely on the previous lab for this question.

Agenda

Recap of what we did
And a short introduction to GLM alongside

Recap

Today we are working with Transitional Justice Evaluation Tools dataset. You can explore their codebook here. Let’s load the data.

tjet = read.csv("data/tjet.csv")

Print out the first observations

...(tjet)

We are going to use the following variables:

country and year
dtr democratic transition
tcs_operated truth commission operated
trials_domestic count of domestic trials per country and start year
region United Nations region

Subset the above variables. You can do this with select()

tjet_sub = ... %>%
  ...(country, year, dtr, tcs_operated, trials_domestic, region)

Leave more recent observations. Say, starting year 1975.

tjet_sub = tjet_sub %>%
  ...(year >= 1975)

First, let’s explore the distribution of trials_domestic variable. Draw a histogram.

...

Now, draw a boxplot of trials_domestic by region.

ggplot(tjet_sub) +
  ...(aes(x = region, y = ...))

Now, present the descriptive statistics for the same variable (trials_domestic)

...

Now, let’s do a slightly more detailed subgroup analysis. Calculate the average and median trials_domestic for all the regions. Save that to a new object. Why is the average so different from the median?

trials_region = tjet_sub %>%
  ...(region) %>%
  ...(average_trials = ...(trials_domestic))

trials_region

Now, let’s the number of transitions (dtr). Why are there so many transitions?

transitions_region = ...

transitions_region

Let’s visualize various statistics! Combine trials_region and transitions_region using left_join(). Draw a scatteplot. Let transitions be on the X axis, and average_trials on the Y axis. Add geom_text() with label argument set to region, use vjust = 1.5.

trials_region %>%
  ...
  ggplot(...(x = ..., y = average_trials)) +
    geom_point(size = 4) +
    geom_text(aes(label = region), vjust = 1.5)

Using ggpairs() visualize the following columns: "trials_domestic", "region", "tcs_operated", "dtr". What do you notice?

library(...)
...

Set up a multiple linear regerssion. Predict democratic transition dtr by trials_domestic and tcs_operated. What type of model did we create based on the independent variables (IVs)? Present the results.

model = ...
summary(model)

Load ggeffects library

library(ggeffects)

Visualize the model with terms trials_domestic and tcs_operated.

ggpredict(model, terms = c("trials_domestic", "tcs_operated")) %>%
  plot()

Check the normality of residuals.

...(model, which = 2)

Set the linear probability model, but include country and year fixed effects.

model_lpm = lm(dtr ~ trials_domestic + tcs_operated + ..., tjet_sub)

Check the normality, again. Use Q-Q plot.

plot(model_lpm, 2)

Now, let’s check if the residuals are homoscedastic. Use which = 1 for plot(). What’s that? Can we fix that? Or even, should we?

plot(model_lpm, 1)

Logistic Regression

As you can see, one of the assumptions is violated. You might try to solve it using SEs - but again, these are not panacea. A slightly more common approach to deal with binary dependent variable is logistic regression. Let’s briefly try it out!

model_glm = glm(dtr ~ trials_domestic + tcs_operated, tjet_sub, family = binomial(link = "logit")) 
summary(model_glm)

Now, visualize how it looks like. Usually, we would expect to see slightly more “S” shaped curve. But still - good!

ggpredict(model_glm, term = c("trials_domestic", "tcs_operated")) %>%
  plot(show_data = T)

Compare to linear probability model without country and year fixed effects.

ggpredict(model, term = c("trials_domestic", "tcs_operated")) %>%
  plot(show_data = T)

Present both: linear probability model without fixed effects and logistic regression. The problem, though, that Logit is way harder to interpret! Stick to marginal effects, which we have covered in lab 5!

library(modelsummary)

modelsummary(list("LPM" = model,
                  "Logit" = model_glm),
             stars = T)

A final note, in most cases you would expect to see logistic regression as something like

glm(am ~ hp + wt, data = mtcars, family = binomial) %>%
  ggpredict(terms = c("hp")) %>%
  plot(show_data = T)

What’s next?

Congrats with finalizing the class! No exercises this time.

Helpful resource for R
Check the main page for additional resources
Learning to code is like learning a new language: if you’re interested, keep practicing!
I highly recommend taking POLI_SCI 406-0 Quantitative Causal Inference

What we have covered in coding

Library	Functions	Description
tidyverse	`filter()`, `mutate()`, `ggplot()`	data wrangling and visualization
modelsummary	`modelsummary()`	present good looking tables
broom	`tidy()`	extract additional information from the models
ggeffects	`ggpredict()`	calculate and visualize marginal effects
GGally	`ggpairs()`	extension to ggplot
lmtest	`bptest()`	additional statistical tests for diagnostics
estimatr	`lm_robust()`	regression with robust and clustered standard errors
car	`vif()`	additional statistical tests for diagnostics

Datasets we have used

Dataset	Description	Link
V-Dem	Measures democracy worldwide	V-Dem
World Happiness Report	Annual happiness report	World Happiness Report
Who Governs	Dataset on political elites	Who Governs
SIPRI	Data on military operations	SIPRI
Comparative Political Data Set	A dataset covering political institutions	CPDS
European Social Survey	Survey measuring attitudes across Europe	ESS
Transitional Justice Evaluation Tools Dataset	Dataset assessing transitional justice	TJET Dataset

Check list

I know that every lab has check list below! And I will use it to navigate what we have learned

R doesn’t scare me anymore

I have developed the intuition behind application of quantitative methods