library(tidyverse)GLMs and Quarter Review
Week 9
Before we start
Optional Review of the previous week
Load tidyverse library
Load the European Social Value data (ess.csv).
Rename the variables:
pplfairtotrusttrstplttotrust_politiciansrlgdgrtoreligion
Run the following model. DV: happy, IVs: trust, trust_politicians and religion. Present the summary().
Using bptest() from the lmtest library, test if the model is homoscedastic. Interpret the results of Breusch-Pagan test.
Draw residuals vs fitted plot. Does it match the BP test results? You can use plot(..., which = 1).
Calculate the cutoff for the Cook’s distance using the formula: \(\frac{4}{N-k-1}\). Plot the Cook’s distance using plot() and add a cutoff line using abline(). Are there any influential points?
Set up a model without influential points. How different the estimates are? Present a simple modelsummary() comparison.
model_no = lm(happy ~ trust + trust_politicians + religion, ess[-c(...),])
library(...)
modelsummary(list(...),
stars = T)Draw a Q-Q plot using plot(..., which = 2) for the full model. Are the residuals distributed normally?
Using shapiro.test() test the normality of the residuals in the model. Interpret the results.
What else we have to check? How would you do that? No code needed but feel free to rely on the previous lab for this question.
Agenda
Recap of what we did
And a short introduction to GLM alongside
Recap
Today we are working with Transitional Justice Evaluation Tools dataset. You can explore their codebook here. Let’s load the data.
tjet = read.csv("data/tjet.csv")Print out the first observations
...(tjet)We are going to use the following variables:
countryandyeardtrdemocratic transitiontcs_operatedtruth commission operatedtrials_domesticcount of domestic trials per country and start yearregionUnited Nations region
Subset the above variables. You can do this with select()
tjet_sub = ... %>%
...(country, year, dtr, tcs_operated, trials_domestic, region) Leave more recent observations. Say, starting year 1975.
tjet_sub = tjet_sub %>%
...(year >= 1975)First, let’s explore the distribution of trials_domestic variable. Draw a histogram.
...Now, draw a boxplot of trials_domestic by region.
ggplot(tjet_sub) +
...(aes(x = region, y = ...))Now, present the descriptive statistics for the same variable (trials_domestic)
...Now, let’s do a slightly more detailed subgroup analysis. Calculate the average and median trials_domestic for all the regions. Save that to a new object. Why is the average so different from the median?
trials_region = tjet_sub %>%
...(region) %>%
...(average_trials = ...(trials_domestic))
trials_regionNow, let’s the number of transitions (dtr). Why are there so many transitions?
transitions_region = ...
transitions_regionLet’s visualize various statistics! Combine trials_region and transitions_region using left_join(). Draw a scatteplot. Let transitions be on the X axis, and average_trials on the Y axis. Add geom_text() with label argument set to region, use vjust = 1.5.
trials_region %>%
...
ggplot(...(x = ..., y = average_trials)) +
geom_point(size = 4) +
geom_text(aes(label = region), vjust = 1.5) Using ggpairs() visualize the following columns: "trials_domestic", "region", "tcs_operated", "dtr". What do you notice?
library(...)
...Set up a multiple linear regerssion. Predict democratic transition dtr by trials_domestic and tcs_operated. What type of model did we create based on the independent variables (IVs)? Present the results.
model = ...
summary(model)Load ggeffects library
library(ggeffects)Visualize the model with terms trials_domestic and tcs_operated.
ggpredict(model, terms = c("trials_domestic", "tcs_operated")) %>%
plot()Check the normality of residuals.
...(model, which = 2)Set the linear probability model, but include country and year fixed effects.
model_lpm = lm(dtr ~ trials_domestic + tcs_operated + ..., tjet_sub) Check the normality, again. Use Q-Q plot.
plot(model_lpm, 2)Now, let’s check if the residuals are homoscedastic. Use which = 1 for plot(). What’s that? Can we fix that? Or even, should we?
plot(model_lpm, 1)Logistic Regression
As you can see, one of the assumptions is violated. You might try to solve it using SEs - but again, these are not panacea. A slightly more common approach to deal with binary dependent variable is logistic regression. Let’s briefly try it out!
model_glm = glm(dtr ~ trials_domestic + tcs_operated, tjet_sub, family = binomial(link = "logit"))
summary(model_glm)Now, visualize how it looks like. Usually, we would expect to see slightly more “S” shaped curve. But still - good!
ggpredict(model_glm, term = c("trials_domestic", "tcs_operated")) %>%
plot(show_data = T)Compare to linear probability model without country and year fixed effects.
ggpredict(model, term = c("trials_domestic", "tcs_operated")) %>%
plot(show_data = T)Present both: linear probability model without fixed effects and logistic regression. The problem, though, that Logit is way harder to interpret! Stick to marginal effects, which we have covered in lab 5!
library(modelsummary)
modelsummary(list("LPM" = model,
"Logit" = model_glm),
stars = T)A final note, in most cases you would expect to see logistic regression as something like
glm(am ~ hp + wt, data = mtcars, family = binomial) %>%
ggpredict(terms = c("hp")) %>%
plot(show_data = T)What’s next?
Congrats with finalizing the class! No exercises this time.
Check the main page for additional resources
Learning to code is like learning a new language: if you’re interested, keep practicing!
I highly recommend taking POLI_SCI 406-0 Quantitative Causal Inference
What we have covered in coding
| Library | Functions | Description |
|---|---|---|
| tidyverse | filter(), mutate(), ggplot() |
data wrangling and visualization |
| modelsummary | modelsummary() |
present good looking tables |
| broom | tidy() |
extract additional information from the models |
| ggeffects | ggpredict() |
calculate and visualize marginal effects |
| GGally | ggpairs() |
extension to ggplot |
| lmtest | bptest() |
additional statistical tests for diagnostics |
| estimatr | lm_robust() |
regression with robust and clustered standard errors |
| car | vif() |
additional statistical tests for diagnostics |
Datasets we have used
| Dataset | Description | Link |
|---|---|---|
| V-Dem | Measures democracy worldwide | V-Dem |
| World Happiness Report | Annual happiness report | World Happiness Report |
| Who Governs | Dataset on political elites | Who Governs |
| SIPRI | Data on military operations | SIPRI |
| Comparative Political Data Set | A dataset covering political institutions | CPDS |
| European Social Survey | Survey measuring attitudes across Europe | ESS |
| Transitional Justice Evaluation Tools Dataset | Dataset assessing transitional justice | TJET Dataset |
Check list
I know that every lab has check list below! And I will use it to navigate what we have learned
R doesn’t scare me anymore
I have developed the intuition behind application of quantitative methods