library(tidyverse)
GLMs and Quarter Review
Week 9
Before we start
Optional Review of the previous week
Load tidyverse
library
Load the European Social Value data (ess.csv
).
Rename the variables:
pplfair
totrust
trstplt
totrust_politicians
rlgdgr
toreligion
Run the following model. DV: happy
, IVs: trust
, trust_politicians
and religion
. Present the summary()
.
Using bptest()
from the lmtest
library, test if the model is homoscedastic. Interpret the results of Breusch-Pagan test.
Draw residuals vs fitted plot. Does it match the BP test results? You can use plot(..., which = 1)
.
Calculate the cutoff for the Cook’s distance using the formula: \(\frac{4}{N-k-1}\). Plot the Cook’s distance using plot()
and add a cutoff line using abline()
. Are there any influential points?
Set up a model without influential points. How different the estimates are? Present a simple modelsummary()
comparison.
= lm(happy ~ trust + trust_politicians + religion, ess[-c(...),])
model_no
library(...)
modelsummary(list(...),
stars = T)
Draw a Q-Q plot using plot(..., which = 2)
for the full model. Are the residuals distributed normally?
Using shapiro.test()
test the normality of the residuals
in the model. Interpret the results.
What else we have to check? How would you do that? No code needed but feel free to rely on the previous lab for this question.
Agenda
Recap of what we did
And a short introduction to GLM alongside
Recap
Today we are working with Transitional Justice Evaluation Tools dataset. You can explore their codebook here. Let’s load the data.
= read.csv("data/tjet.csv") tjet
Print out the first observations
...(tjet)
We are going to use the following variables:
country
andyear
dtr
democratic transitiontcs_operated
truth commission operatedtrials_domestic
count of domestic trials per country and start yearregion
United Nations region
Subset the above variables. You can do this with select()
= ... %>%
tjet_sub ...(country, year, dtr, tcs_operated, trials_domestic, region)
Leave more recent observations. Say, starting year 1975.
= tjet_sub %>%
tjet_sub ...(year >= 1975)
First, let’s explore the distribution of trials_domestic
variable. Draw a histogram.
...
Now, draw a boxplot of trials_domestic
by region
.
ggplot(tjet_sub) +
...(aes(x = region, y = ...))
Now, present the descriptive statistics for the same variable (trials_domestic
)
...
Now, let’s do a slightly more detailed subgroup analysis. Calculate the average and median trials_domestic
for all the regions. Save that to a new object. Why is the average so different from the median?
= tjet_sub %>%
trials_region ...(region) %>%
...(average_trials = ...(trials_domestic))
trials_region
Now, let’s the number of transitions (dtr
). Why are there so many transitions?
= ...
transitions_region
transitions_region
Let’s visualize various statistics! Combine trials_region
and transitions_region
using left_join()
. Draw a scatteplot. Let transitions
be on the X axis, and average_trials
on the Y axis. Add geom_text()
with label
argument set to region
, use vjust = 1.5
.
%>%
trials_region
...ggplot(...(x = ..., y = average_trials)) +
geom_point(size = 4) +
geom_text(aes(label = region), vjust = 1.5)
Using ggpairs()
visualize the following columns: "trials_domestic", "region", "tcs_operated", "dtr"
. What do you notice?
library(...)
...
Set up a multiple linear regerssion. Predict democratic transition dtr
by trials_domestic
and tcs_operated
. What type of model did we create based on the independent variables (IVs)? Present the results.
= ...
model summary(model)
Load ggeffects
library
library(ggeffects)
Visualize the model with terms trials_domestic
and tcs_operated
.
ggpredict(model, terms = c("trials_domestic", "tcs_operated")) %>%
plot()
Check the normality of residuals.
...(model, which = 2)
Set the linear probability model, but include country and year fixed effects.
= lm(dtr ~ trials_domestic + tcs_operated + ..., tjet_sub) model_lpm
Check the normality, again. Use Q-Q plot.
plot(model_lpm, 2)
Now, let’s check if the residuals are homoscedastic. Use which = 1
for plot()
. What’s that? Can we fix that? Or even, should we?
plot(model_lpm, 1)
Logistic Regression
As you can see, one of the assumptions is violated. You might try to solve it using SEs - but again, these are not panacea. A slightly more common approach to deal with binary dependent variable is logistic regression. Let’s briefly try it out!
= glm(dtr ~ trials_domestic + tcs_operated, tjet_sub, family = binomial(link = "logit"))
model_glm summary(model_glm)
Now, visualize how it looks like. Usually, we would expect to see slightly more “S” shaped curve. But still - good!
ggpredict(model_glm, term = c("trials_domestic", "tcs_operated")) %>%
plot(show_data = T)
Compare to linear probability model without country and year fixed effects.
ggpredict(model, term = c("trials_domestic", "tcs_operated")) %>%
plot(show_data = T)
Present both: linear probability model without fixed effects and logistic regression. The problem, though, that Logit is way harder to interpret! Stick to marginal effects, which we have covered in lab 5!
library(modelsummary)
modelsummary(list("LPM" = model,
"Logit" = model_glm),
stars = T)
A final note, in most cases you would expect to see logistic regression as something like
glm(am ~ hp + wt, data = mtcars, family = binomial) %>%
ggpredict(terms = c("hp")) %>%
plot(show_data = T)
What’s next?
Congrats with finalizing the class! No exercises this time.
Check the main page for additional resources
Learning to code is like learning a new language: if you’re interested, keep practicing!
I highly recommend taking POLI_SCI 406-0 Quantitative Causal Inference
What we have covered in coding
Library | Functions | Description |
---|---|---|
tidyverse | filter() , mutate() , ggplot() |
data wrangling and visualization |
modelsummary | modelsummary() |
present good looking tables |
broom | tidy() |
extract additional information from the models |
ggeffects | ggpredict() |
calculate and visualize marginal effects |
GGally | ggpairs() |
extension to ggplot |
lmtest | bptest() |
additional statistical tests for diagnostics |
estimatr | lm_robust() |
regression with robust and clustered standard errors |
car | vif() |
additional statistical tests for diagnostics |
Datasets we have used
Dataset | Description | Link |
---|---|---|
V-Dem | Measures democracy worldwide | V-Dem |
World Happiness Report | Annual happiness report | World Happiness Report |
Who Governs | Dataset on political elites | Who Governs |
SIPRI | Data on military operations | SIPRI |
Comparative Political Data Set | A dataset covering political institutions | CPDS |
European Social Survey | Survey measuring attitudes across Europe | ESS |
Transitional Justice Evaluation Tools Dataset | Dataset assessing transitional justice | TJET Dataset |
Check list
I know that every lab has check list below! And I will use it to navigate what we have learned
R doesn’t scare me anymore
I have developed the intuition behind application of quantitative methods