Logistic Regression

Week 8

Published

November 7, 2025

Before we start

Any questions?

Review

Load the tidyverse, marginaleffects, and GGally libraries and sipri dataset

library(tidyverse)
library(marginaleffects)
library(GGally)

sipri = read.csv("data/sipri.csv")

Using ggpairs from GGally library, visualize the relationship between Import and Regime variables

...

Set up a linear regression model. Predict Import by Regime, save the model to model1 object. Visualize the model with plot_predictions() from marginaleffects.

model1 = ...

plot_predictions(model1,
                 by = ...)

Present summary() of the model. What is the reference category in this case?

...

Using relevel(), set the reference category to Minimally Democratic

sipri$Regime = as.factor(sipri$Regime)
sipri$Regime = relevel(sipri$Regime, ref = "Minimally Democratic")

Set-up the model again, and present the summary. Did anything change?

model2 = lm(Import ~ Regime, sipri)
...

Load the modelsummary library.

library(modelsummary)

Present both models, side-by-side. Present only p.value statistics, and change the starts display significance level to = 0.05. Remove the following model goodness-of-fit statistics AIC|BIC|Log.Lik|F|RMSE:

modelsummary(list("First Model" = model1,
                  "Second Model" = model2),
                   ... ,
                   gof_omit = ... ,
                   coef_rename = c("(Intercept)", 
                              "Regime: Democratic",
                              "Regime: Electoral Authoritarian",
                              "Regime: Minimally Democratic",
                              "Regime: Autocratic"),
                   statistic = ...)

Finally, clear the environment.

Agenda

Overview of Logistic regression
Practicing marginal effects

Data Exploration

Today we are working with Transitional Justice Evaluation Tools dataset. You can explore their codebook here. Let’s load the data.

tjet = read.csv("data/tjet.csv")

We are going to use the following variables:

country and year
dtr democratic transition
tcs_operated truth commission operated
trials_domestic count of domestic trials per country and start year
region United Nations region

Subset the above variables with select()

tjet_sub = tjet %>%
  select(country, year, dtr, tcs_operated, trials_domestic, region)

Leave more recent observations. Say, starting year 1975.

tjet_sub = tjet_sub %>%
  filter(year >= 1975)

Using ggpairs() visualize the following columns: “trials_domestic”, “region”, “tcs_operated”, “dtr”. What do you notice?

Model Building

Set up a multiple linear regerssion. Predict democratic transition dtr by trials_domestic and tcs_operated. What type of model did we create based on the independent variables (IVs)? Present the results.

model_lpm = lm(dtr ~ trials_domestic + tcs_operated, tjet_sub)
summary(model_lpm)


Call:
lm(formula = dtr ~ trials_domestic + tcs_operated, data = tjet_sub)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3824 -0.2841 -0.2841  0.5062  0.7159 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.284102   0.005401   52.60   <2e-16 ***
trials_domestic 0.029956   0.002141   13.99   <2e-16 ***
tcs_operated    0.420752   0.014190   29.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4474 on 8345 degrees of freedom
  (585 observations deleted due to missingness)
Multiple R-squared:  0.1395,    Adjusted R-squared:  0.1393 
F-statistic: 676.6 on 2 and 8345 DF,  p-value: < 2.2e-16

Let’s draw it. Try geom_jitter() instead of geom_point().

plot_predictions(model_lpm,
                 condition = c("trials_domestic", "tcs_operated")) +
  geom_point(data = tjet_sub,
              aes(x = trials_domestic, y = dtr),
             alpha = 0.4)

Ignoring unknown labels:
• linetype : "tcs_operated"

Warning: Removed 585 rows containing missing values or values outside the scale range
(`geom_point()`).

Finally, let’s try out the Logistic regression, specifically made for the binary outcomes.

model_glm = glm(dtr ~ trials_domestic + tcs_operated, data = tjet_sub, family = binomial(link = "logit")) 
summary(model_glm)


Call:
glm(formula = dtr ~ trials_domestic + tcs_operated, family = binomial(link = "logit"), 
    data = tjet_sub)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -1.04739    0.02864  -36.58   <2e-16 ***
trials_domestic  0.37528    0.02186   17.16   <2e-16 ***
tcs_operated     1.83853    0.07402   24.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10981.8  on 8347  degrees of freedom
Residual deviance:  9614.7  on 8345  degrees of freedom
  (585 observations deleted due to missingness)
AIC: 9620.7

Number of Fisher Scoring iterations: 5

Model Presentation

Now, visualize how it looks like. Usually, we would expect to see slightly more “S” shaped curve. But still - good!

plot_predictions(model_glm,
                 condition = c("trials_domestic", "tcs_operated")) +
  geom_point(data = tjet_sub,
              aes(x = trials_domestic, y = dtr),
             alpha = 0.4) +
  labs(x = "Number of Domestic Trials",
       y = "Democratic Transition",
       color = "Trurth Commission",
       fill = "Trurth Commission")

Ignoring unknown labels:
• linetype : "tcs_operated"

Warning: Removed 585 rows containing missing values or values outside the scale range
(`geom_point()`).

Present both: linear probability model and logistic regression. The problem, though, that Logit is way harder to interpret! You can stick to marginal effects, and add another column that calculates the Average Marginal Effect (i.e., average of individual marginal effects).

library(modelsummary)

# Calculate the Average Marginal Effect
ame = avg_slopes(model_glm)

# Present the Table
modelsummary(list("LPM" = model_lpm,
                  "Logit" = model_glm,
                  "Logit (AME)" = ame),
             stars = T,
             gof_omit = "BIC|Log.Lik.|F|RMSE",
             coef_rename = c("(Intercept)",
                             "Domestic Trials",
                             "Truth Commission"))

Bootstrapped Standard Errors for Logit*

Bootstrapped standard errors for logistic regression work in the same way as for linear regression. However, in general, there are no assumptions about the distribution of the standard errors in logistic regression.

A bit of harder code here - but you have seen it with OLS before. What has changed? Let’s briefly discuss what’s going on, and feel free to ask questions about this. Pay attention to:

#| cache: true
Comparison of the Standard Errors

library(rsample)
library(broom)

set.seed(123)
fit_boot = bootstraps(data = tjet_sub, times = 100)$splits %>% 
  map(~glm(dtr ~ trials_domestic + tcs_operated, data = analysis(.), family = binomial(link = "logit"))) %>% 
  map(tidy) %>% 
  bind_rows() %>% 
  group_by(term) %>% 
  summarize(std.error = sd(estimate)) 

fit_boot

# A tibble: 3 × 2
  term            std.error
  <chr>               <dbl>
1 (Intercept)        0.0315
2 tcs_operated       0.0719
3 trials_domestic    0.0288

Check List

I am confident regression presentation skills both with plots and tables
I am gradually mastering analysis of paris with ggpairs() from GGally
I know how to use geom_jitter() to present stacked observations
I know the difference between a linear probability model and logistic regression
I have a grasp of marginal effects and understand why they are useful

Optional Exercises

Imagine you are interested in what influences the creation of a Truth Commission. The hypothesis is that involvement by international institutions increases the likelihood of a Commission being created. Let’s test this hypothesis. Subset the following variables:

country and year
tcs_operated truth commission operated
ICC_arrest_warrantnumber of arrest warrants issued by ICC
uninv ongoing UN investigations, count
region United Nations region

Solution

tjet_subset = tjet %>%
  ...(country, year, tcs_operated, ICC_arrest_warrant, uninv, region)

Explain tcs_operated by number of arrest warrants issued by ICC (ICC_arrest_warrant) and ongoing UN investigations in the country (uninv). Set up the linear probability model

Solution

...

Now, set up the logistic regression model.

Solution

...

Present both models with plot_predictions() graph. Which model makes more sense to you?

Solution

...

Present the table with both models with modelsummary(). Don’t forget to present the average marginal effects for the logistic regression.

Solution

...

Create a new logistic regression model by including the United Nations region (region) variable as fixed effects. Present only logistic regression models. How did it change the model? Pay attention to \(\text{AIC}\). Generally, the model having lower AIC are considered better fit.

Solution

modelsummary(list("Logit" = ...,
                  "Logit (AME)" = ...,
                  "Logit (with FE)" = ...),
             stars = TRUE,
             gof_omit = "R2|R2 Adj.|BIC|Log.Lik.|F|RMSE")