library(tidyverse)
library(marginaleffects)
library(GGally)
sipri = read.csv("data/sipri.csv")Logistic Regression
Week 8
Before we start
- Any questions?
Load the tidyverse, marginaleffects, and GGally libraries and sipri dataset
Using ggpairs from GGally library, visualize the relationship between Import and Regime variables
...Set up a linear regression model. Predict Import by Regime, save the model to model1 object. Visualize the model with plot_predictions() from marginaleffects.
model1 = ...
plot_predictions(model1,
by = ...)Present summary() of the model. What is the reference category in this case?
...Using relevel(), set the reference category to Minimally Democratic
sipri$Regime = as.factor(sipri$Regime)
sipri$Regime = relevel(sipri$Regime, ref = "Minimally Democratic")Set-up the model again, and present the summary. Did anything change?
model2 = lm(Import ~ Regime, sipri)
...Load the modelsummary library.
library(modelsummary)Present both models, side-by-side. Present only p.value statistics, and change the starts display significance level to = 0.05. Remove the following model goodness-of-fit statistics AIC|BIC|Log.Lik|F|RMSE:
modelsummary(list("First Model" = model1,
"Second Model" = model2),
... ,
gof_omit = ... ,
coef_rename = c("(Intercept)",
"Regime: Democratic",
"Regime: Electoral Authoritarian",
"Regime: Minimally Democratic",
"Regime: Autocratic"),
statistic = ...)Finally, clear the environment.
Agenda
Overview of Logistic regression
Practicing marginal effects
Data Exploration
Today we are working with Transitional Justice Evaluation Tools dataset. You can explore their codebook here. Let’s load the data.
tjet = read.csv("data/tjet.csv")We are going to use the following variables:
countryandyeardtrdemocratic transitiontcs_operatedtruth commission operatedtrials_domesticcount of domestic trials per country and start yearregionUnited Nations region
Subset the above variables with select()
tjet_sub = tjet %>%
select(country, year, dtr, tcs_operated, trials_domestic, region) Leave more recent observations. Say, starting year 1975.
tjet_sub = tjet_sub %>%
filter(year >= 1975)Using ggpairs() visualize the following columns: “trials_domestic”, “region”, “tcs_operated”, “dtr”. What do you notice?
Model Building
Set up a multiple linear regerssion. Predict democratic transition dtr by trials_domestic and tcs_operated. What type of model did we create based on the independent variables (IVs)? Present the results.
model_lpm = lm(dtr ~ trials_domestic + tcs_operated, tjet_sub)
summary(model_lpm)
Call:
lm(formula = dtr ~ trials_domestic + tcs_operated, data = tjet_sub)
Residuals:
Min 1Q Median 3Q Max
-1.3824 -0.2841 -0.2841 0.5062 0.7159
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.284102 0.005401 52.60 <2e-16 ***
trials_domestic 0.029956 0.002141 13.99 <2e-16 ***
tcs_operated 0.420752 0.014190 29.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4474 on 8345 degrees of freedom
(585 observations deleted due to missingness)
Multiple R-squared: 0.1395, Adjusted R-squared: 0.1393
F-statistic: 676.6 on 2 and 8345 DF, p-value: < 2.2e-16
Let’s draw it. Try geom_jitter() instead of geom_point().
plot_predictions(model_lpm,
condition = c("trials_domestic", "tcs_operated")) +
geom_point(data = tjet_sub,
aes(x = trials_domestic, y = dtr),
alpha = 0.4)Ignoring unknown labels:
• linetype : "tcs_operated"
Warning: Removed 585 rows containing missing values or values outside the scale range
(`geom_point()`).
Finally, let’s try out the Logistic regression, specifically made for the binary outcomes.
model_glm = glm(dtr ~ trials_domestic + tcs_operated, data = tjet_sub, family = binomial(link = "logit"))
summary(model_glm)
Call:
glm(formula = dtr ~ trials_domestic + tcs_operated, family = binomial(link = "logit"),
data = tjet_sub)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.04739 0.02864 -36.58 <2e-16 ***
trials_domestic 0.37528 0.02186 17.16 <2e-16 ***
tcs_operated 1.83853 0.07402 24.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10981.8 on 8347 degrees of freedom
Residual deviance: 9614.7 on 8345 degrees of freedom
(585 observations deleted due to missingness)
AIC: 9620.7
Number of Fisher Scoring iterations: 5
Model Presentation
Now, visualize how it looks like. Usually, we would expect to see slightly more “S” shaped curve. But still - good!
plot_predictions(model_glm,
condition = c("trials_domestic", "tcs_operated")) +
geom_point(data = tjet_sub,
aes(x = trials_domestic, y = dtr),
alpha = 0.4) +
labs(x = "Number of Domestic Trials",
y = "Democratic Transition",
color = "Trurth Commission",
fill = "Trurth Commission")Ignoring unknown labels:
• linetype : "tcs_operated"
Warning: Removed 585 rows containing missing values or values outside the scale range
(`geom_point()`).
Present both: linear probability model and logistic regression. The problem, though, that Logit is way harder to interpret! You can stick to marginal effects, and add another column that calculates the Average Marginal Effect (i.e., average of individual marginal effects).
library(modelsummary)
# Calculate the Average Marginal Effect
ame = avg_slopes(model_glm)
# Present the Table
modelsummary(list("LPM" = model_lpm,
"Logit" = model_glm,
"Logit (AME)" = ame),
stars = T,
gof_omit = "BIC|Log.Lik.|F|RMSE",
coef_rename = c("(Intercept)",
"Domestic Trials",
"Truth Commission"))Bootstrapped Standard Errors for Logit*
Bootstrapped standard errors for logistic regression work in the same way as for linear regression. However, in general, there are no assumptions about the distribution of the standard errors in logistic regression.
A bit of harder code here - but you have seen it with OLS before. What has changed? Let’s briefly discuss what’s going on, and feel free to ask questions about this. Pay attention to:
#| cache: trueComparison of the Standard Errors
library(rsample)
library(broom)
set.seed(123)
fit_boot = bootstraps(data = tjet_sub, times = 100)$splits %>%
map(~glm(dtr ~ trials_domestic + tcs_operated, data = analysis(.), family = binomial(link = "logit"))) %>%
map(tidy) %>%
bind_rows() %>%
group_by(term) %>%
summarize(std.error = sd(estimate))
fit_boot# A tibble: 3 × 2
term std.error
<chr> <dbl>
1 (Intercept) 0.0315
2 tcs_operated 0.0719
3 trials_domestic 0.0288
Check List
Optional Exercises
Imagine you are interested in what influences the creation of a Truth Commission. The hypothesis is that involvement by international institutions increases the likelihood of a Commission being created. Let’s test this hypothesis. Subset the following variables:
countryandyeartcs_operatedtruth commission operatedICC_arrest_warrantnumber of arrest warrants issued by ICCuninvongoing UN investigations, countregionUnited Nations region
Explain tcs_operated by number of arrest warrants issued by ICC (ICC_arrest_warrant) and ongoing UN investigations in the country (uninv). Set up the linear probability model
Now, set up the logistic regression model.
Present both models with plot_predictions() graph. Which model makes more sense to you?
Present the table with both models with modelsummary(). Don’t forget to present the average marginal effects for the logistic regression.
Create a new logistic regression model by including the United Nations region (region) variable as fixed effects. Present only logistic regression models. How did it change the model? Pay attention to \(\text{AIC}\). Generally, the model having lower AIC are considered better fit.