Quarter Review

Week 8

Published

May 22, 2025

Before we start

  • Questions regarding Final Paper or the class?

  • Next week no section, but office hours instead (Scott Hall 110)

Agenda

  • Reviewing what we have learned

  • Model Interpretation and Diagnostics

  • Quarto formatting

Quarter Review

View slides in new window

Set Up


Download script


Today we are working with Comparative Political Dataset. We have used it in Week 2, and if you don’t have the dataset, download it here

It consists of political and institutional country-level data. Take a look on their codebook.

Today we are working with the following variables.

  • prefisc_gini - Gini index. What is it?

  • eu - member states of the European Union identification

  • openc - Openness of the economy (trade as % of GDP)

  • poco - post-communist countries post-communist countries identification

Let’s load our library.

library(tidyverse)

Now, let’s load the data.

library(readxl)
cpds = read_excel("data/cpds.xlsx")

Exploratory Analysis

Let’s do a similar data analysis as before. What do you notice?

library(GGally)

ggpairs(cpds, columns = c("prefisc_gini", "eu", "openc", "poco"))

Let’s change misclassified variables.

cpds$eu = as.factor(cpds$eu)
cpds$poco = as.factor(cpds$poco)

Multiple Linear Regression with Fixed Effects

Let’s explain inequality using the multiple linear regression. We measure (i.e., operationalize) inequality with the GINI index. Our main explanatory variable is EU membership. In addition, we control for economic openness (openc), and add a fixed effect for country’s communist past (poco).

model_mlr = lm(prefisc_gini ~ eu + openc + poco, cpds)
summary(model_mlr)

Call:
lm(formula = prefisc_gini ~ eu + openc + poco, data = cpds)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.3329  -2.9815   0.2101   3.0609  12.8156 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.465498   0.339522 119.184  < 2e-16 ***
eu1          2.264796   0.397369   5.699 1.97e-08 ***
openc       -0.007549   0.002793  -2.703  0.00709 ** 
poco1       -0.248216   0.504688  -0.492  0.62304    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.03 on 545 degrees of freedom
  (1317 observations deleted due to missingness)
Multiple R-squared:  0.05716,   Adjusted R-squared:  0.05197 
F-statistic: 11.01 on 3 and 545 DF,  p-value: 4.969e-07

A Gini index of 0 represents perfect equality, where everyone has the same income. Conversely, a Gini index of 100 represents perfect inequality. Let me provide an example: one percentage point increase in trade as % of GDP (openc) decreases GINI index by 0.008.

Qustions

What is the interpretation for the main explanatory variable, i.e. being part of the EU?

Using ggpredict() function from ggeffects library, visualize the fixed effect: plot the openc and poco terms. Does it make sense why the previous communist experience is not statistically significant?

Multiple Linear Regression with Interaction effect

Now, let’s set up an interaction effect. While it’s most useful for causal inference to interact the main explanatory variable with a moderator (i.e. other type of variable), for illustration purposes, we’ll instead interact economic openness (openc) with communist experience (poco).

model_int = lm(prefisc_gini ~ eu + openc * poco, cpds)
summary(model_int)

Call:
lm(formula = prefisc_gini ~ eu + openc * poco, data = cpds)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2249  -2.8294   0.0792   2.8096  12.5204 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.183549   0.338196 118.817  < 2e-16 ***
eu1          2.369795   0.390323   6.071 2.38e-09 ***
openc       -0.005083   0.002787  -1.824   0.0687 .  
poco1        5.850890   1.373454   4.260 2.41e-05 ***
openc:poco1 -0.056715   0.011914  -4.761 2.48e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.952 on 544 degrees of freedom
  (1317 observations deleted due to missingness)
Multiple R-squared:  0.09487,   Adjusted R-squared:  0.08821 
F-statistic: 14.25 on 4 and 544 DF,  p-value: 4.508e-11

Let’s visualize the interaction model. We observe that openness of the economy is relevant only for post communist countries. Any ideas why?

library(ggeffects)

ggpredict(model_int, terms = c("openc", "poco")) |>
  plot()

To interpret this effect, we are relying on marginal effects. Say, what is the effect of being part of the EU:

when having average openness of the economy and having the communist experience vs when having average openness of the economy and not having the communist experience? The marginal effect of having communist experience vs not having is \(42.89 - 42.10 = 0.79\).

ggpredict(model_int, terms = c("eu [1]", "poco [0,1]"))
# Predicted values of prefisc_gini

poco: 0

eu | Predicted |       95% CI
-----------------------------
1  |     42.10 | 41.66, 42.54

poco: 1

eu | Predicted |       95% CI
-----------------------------
1  |     42.89 | 41.87, 43.92

Adjusted for:
* openc = 89.16

Presenting the models

And, as usual, you can use modelsummary() to present the output of the regression.

library(modelsummary)
 modelsummary(list("Fixed Effects model" = model_mlr,
                    "Interaction model" = model_int),
                   title = "Explaining Inequality (OLS)",  
                   stars = TRUE,
                   gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
                   coef_rename = c("(Intercept)", 
                             "EU", 
                             "Openness of the Economy", 
                             "Communsit Experience",
                             "Openness of the Economy × Communsit Experience"))
Explaining Inequality (OLS)
Fixed Effects model Interaction model
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 40.465*** 40.184***
(0.340) (0.338)
EU 2.265*** 2.370***
(0.397) (0.390)
Openness of the Economy -0.008** -0.005+
(0.003) (0.003)
Communsit Experience -0.248 5.851***
(0.505) (1.373)
Openness of the Economy × Communsit Experience -0.057***
(0.012)
Num.Obs. 549 549
R2 0.057 0.095
R2 Adj. 0.052 0.088

Diagnostics

There is no expectation that you master this part. But as there were request to expand, here is a brief recap of diagnostics. As a base, let’s take ggfortify and lmtest library diagnostics tools. We’ll focus only on technical side of diagnostics.

library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.3
library(lmtest)

autoplot(model_int)

Linearity Assumption

We expect that the relationship between our dependent variable and set of independent variables would be linear.

autoplot(model_int)[1]

More formally, we can use reset() test from lmtest library. What does the p-value below indicates?

reset(model_int)

    RESET test

data:  model_int
RESET = 2.4443, df1 = 2, df2 = 542, p-value = 0.08774

For fixing, consider variable transformations. You can use Box-Cox transformations, more on them here. Alternatively, use another model (some suggestions are at the end of the script).

Normality

We want residuals to be distributed normally.

autoplot(model_int)[2]

Formally, you can use any normality test. E.g., shapiro.test(). What does it say?

shapiro.test(model_int$residuals)

    Shapiro-Wilk normality test

data:  model_int$residuals
W = 0.99679, p-value = 0.3493

For fixing, consider transforming the variables, and think through potential omitted variables. Alternatively, use another model.

Non Constant Variance

The variance of the residuals should be constant (i.e. be homoscedastic).

autoplot(model_int)[3]

For formal test, you can use bptest() from lmtest library. What does p-value say?

bptest(model_int)

    studentized Breusch-Pagan test

data:  model_int
BP = 18.038, df = 4, p-value = 0.001213

For fixing, consider introducing robust standard errors. You can use lm_robust() function from the estimatr package.

Influential cases

We don’t want one or several observations to be driving the whole result. Thus, we need to be cautios with influential cases.

autoplot(model_int)[4]

For the formal evaluation, you can use Cook’s Distance. If you have observation that are above the threshold (usually, \(\frac{4}{N}\)), consider excluding them and study separately. For example, in our case it is

4 / model_int$df.residual
[1] 0.007352941

Then, we can check it easily on the plot.

plot(model_int, 4)

Multicollinearity

Finally, let’s address multicollinearity. We want to avoid perfect or near-perfect linear relationships among the independent variables. The Variance Inflation Factor (VIF) helps detect this issue. If multicollinearity is present, consider removing one of the variables with a high VIF. However, it’s not uncommon for interaction terms and polynomial transformations to have high VIF scores (and that’s ok).

library(car)
vif(model_int)
        eu      openc       poco openc:poco 
  1.166160   1.212807   7.819099   8.008637 

What’s next?

Quarto formatting

If you are interested, learning how to write papers in R and Quarto is quite easy. You can render your documents to PDF. Here are a couple of guides:

Alternative Models

What if your outcome is not continuous or my diagnostics fails completely? Then here are the starting points that would help you to think of other than linear relations

Outcome type Model Functions
Count data Poisson glm() with poisson link
Binary Binomial glm() with binomial link
Categorical Multinomial multinom() from nnet library

Discussion Section Review

Libraries and Functions covered

Library Functions Description
tidyverse filter(), mutate(), ggplot() data wrangling and visualization
modelsummary modelsummary() present good looking tables
ggeffects ggpredict() calculate and visualize marginal effects
GGally ggpairs(), ggcoef() extension to ggplot
ggfortify autoplot() extension to ggplot for diagnostics
lmtest bptest() statistical tests for diagnostics
car vif() additional statistical tests for diagnostics

Datasets covered

Dataset Description Link
V-Dem Measures democracy worldwide V-Dem
World Happiness Report Annual happiness report World Happiness Report
Who Governs Dataset on political elites Who Governs
SIPRI Data on military operations SIPRI
Comparative Political Data Set A dataset covering political institutions CPDS

Check List