Quarter Review

Week 8

Published

May 22, 2025

Before we start

Questions regarding Final Paper or the class?
Next week no section, but office hours instead (Scott Hall 110)

Agenda

Reviewing what we have learned
Model Interpretation and Diagnostics
Quarto formatting

Quarter Review

View slides in new window

Set Up

Download script

Today we are working with Comparative Political Dataset. We have used it in Week 2, and if you don’t have the dataset, download it here

It consists of political and institutional country-level data. Take a look on their codebook.

Today we are working with the following variables.

prefisc_gini - Gini index. What is it?
eu - member states of the European Union identification
openc - Openness of the economy (trade as % of GDP)
poco - post-communist countries post-communist countries identification

Let’s load our library.

library(tidyverse)

Now, let’s load the data.

library(readxl)
cpds = read_excel("data/cpds.xlsx")

Exploratory Analysis

Let’s do a similar data analysis as before. What do you notice?

library(GGally)

ggpairs(cpds, columns = c("prefisc_gini", "eu", "openc", "poco"))

Let’s change misclassified variables.

cpds$eu = as.factor(cpds$eu)
cpds$poco = as.factor(cpds$poco)

Multiple Linear Regression with Fixed Effects

Let’s explain inequality using the multiple linear regression. We measure (i.e., operationalize) inequality with the GINI index. Our main explanatory variable is EU membership. In addition, we control for economic openness (openc), and add a fixed effect for country’s communist past (poco).

model_mlr = lm(prefisc_gini ~ eu + openc + poco, cpds)
summary(model_mlr)


Call:
lm(formula = prefisc_gini ~ eu + openc + poco, data = cpds)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.3329  -2.9815   0.2101   3.0609  12.8156 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.465498   0.339522 119.184  < 2e-16 ***
eu1          2.264796   0.397369   5.699 1.97e-08 ***
openc       -0.007549   0.002793  -2.703  0.00709 ** 
poco1       -0.248216   0.504688  -0.492  0.62304    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.03 on 545 degrees of freedom
  (1317 observations deleted due to missingness)
Multiple R-squared:  0.05716,   Adjusted R-squared:  0.05197 
F-statistic: 11.01 on 3 and 545 DF,  p-value: 4.969e-07

A Gini index of 0 represents perfect equality, where everyone has the same income. Conversely, a Gini index of 100 represents perfect inequality. Let me provide an example: one percentage point increase in trade as % of GDP (openc) decreases GINI index by 0.008.

Qustions

What is the interpretation for the main explanatory variable, i.e. being part of the EU?

Using ggpredict() function from ggeffects library, visualize the fixed effect: plot the openc and poco terms. Does it make sense why the previous communist experience is not statistically significant?

Multiple Linear Regression with Interaction effect

Now, let’s set up an interaction effect. While it’s most useful for causal inference to interact the main explanatory variable with a moderator (i.e. other type of variable), for illustration purposes, we’ll instead interact economic openness (openc) with communist experience (poco).

model_int = lm(prefisc_gini ~ eu + openc * poco, cpds)
summary(model_int)


Call:
lm(formula = prefisc_gini ~ eu + openc * poco, data = cpds)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2249  -2.8294   0.0792   2.8096  12.5204 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.183549   0.338196 118.817  < 2e-16 ***
eu1          2.369795   0.390323   6.071 2.38e-09 ***
openc       -0.005083   0.002787  -1.824   0.0687 .  
poco1        5.850890   1.373454   4.260 2.41e-05 ***
openc:poco1 -0.056715   0.011914  -4.761 2.48e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.952 on 544 degrees of freedom
  (1317 observations deleted due to missingness)
Multiple R-squared:  0.09487,   Adjusted R-squared:  0.08821 
F-statistic: 14.25 on 4 and 544 DF,  p-value: 4.508e-11

Let’s visualize the interaction model. We observe that openness of the economy is relevant only for post communist countries. Any ideas why?

library(ggeffects)

ggpredict(model_int, terms = c("openc", "poco")) |>
  plot()

To interpret this effect, we are relying on marginal effects. Say, what is the effect of being part of the EU:

when having average openness of the economy and having the communist experience vs when having average openness of the economy and not having the communist experience? The marginal effect of having communist experience vs not having is \(42.89 - 42.10 = 0.79\).

ggpredict(model_int, terms = c("eu [1]", "poco [0,1]"))

# Predicted values of prefisc_gini

poco: 0

eu | Predicted |       95% CI
-----------------------------
1  |     42.10 | 41.66, 42.54

poco: 1

eu | Predicted |       95% CI
-----------------------------
1  |     42.89 | 41.87, 43.92

Adjusted for:
* openc = 89.16

Presenting the models

And, as usual, you can use modelsummary() to present the output of the regression.

library(modelsummary)
 modelsummary(list("Fixed Effects model" = model_mlr,
                    "Interaction model" = model_int),
                   title = "Explaining Inequality (OLS)",  
                   stars = TRUE,
                   gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
                   coef_rename = c("(Intercept)", 
                             "EU", 
                             "Openness of the Economy", 
                             "Communsit Experience",
                             "Openness of the Economy × Communsit Experience"))

Explaining Inequality (OLS)
	Fixed Effects model	Interaction model
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	40.465***	40.184***
	(0.340)	(0.338)
EU	2.265***	2.370***
	(0.397)	(0.390)
Openness of the Economy	-0.008**	-0.005+
	(0.003)	(0.003)
Communsit Experience	-0.248	5.851***
	(0.505)	(1.373)
Openness of the Economy × Communsit Experience		-0.057***
		(0.012)
Num.Obs.	549	549
R2	0.057	0.095
R2 Adj.	0.052	0.088

Diagnostics

There is no expectation that you master this part. But as there were request to expand, here is a brief recap of diagnostics. As a base, let’s take ggfortify and lmtest library diagnostics tools. We’ll focus only on technical side of diagnostics.

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.4.3

library(lmtest)

autoplot(model_int)

Linearity Assumption

We expect that the relationship between our dependent variable and set of independent variables would be linear.

autoplot(model_int)[1]

More formally, we can use reset() test from lmtest library. What does the p-value below indicates?

reset(model_int)


    RESET test

data:  model_int
RESET = 2.4443, df1 = 2, df2 = 542, p-value = 0.08774

For fixing, consider variable transformations. You can use Box-Cox transformations, more on them here. Alternatively, use another model (some suggestions are at the end of the script).

Normality

We want residuals to be distributed normally.

autoplot(model_int)[2]

Formally, you can use any normality test. E.g., shapiro.test(). What does it say?

shapiro.test(model_int$residuals)


    Shapiro-Wilk normality test

data:  model_int$residuals
W = 0.99679, p-value = 0.3493

For fixing, consider transforming the variables, and think through potential omitted variables. Alternatively, use another model.

Non Constant Variance

The variance of the residuals should be constant (i.e. be homoscedastic).

autoplot(model_int)[3]

For formal test, you can use bptest() from lmtest library. What does p-value say?

bptest(model_int)


    studentized Breusch-Pagan test

data:  model_int
BP = 18.038, df = 4, p-value = 0.001213

For fixing, consider introducing robust standard errors. You can use lm_robust() function from the estimatr package.

Influential cases

We don’t want one or several observations to be driving the whole result. Thus, we need to be cautios with influential cases.

autoplot(model_int)[4]

For the formal evaluation, you can use Cook’s Distance. If you have observation that are above the threshold (usually, \(\frac{4}{N}\)), consider excluding them and study separately. For example, in our case it is

4 / model_int$df.residual

[1] 0.007352941

Then, we can check it easily on the plot.

plot(model_int, 4)

Multicollinearity

Finally, let’s address multicollinearity. We want to avoid perfect or near-perfect linear relationships among the independent variables. The Variance Inflation Factor (VIF) helps detect this issue. If multicollinearity is present, consider removing one of the variables with a high VIF. However, it’s not uncommon for interaction terms and polynomial transformations to have high VIF scores (and that’s ok).

library(car)
vif(model_int)

        eu      openc       poco openc:poco 
  1.166160   1.212807   7.819099   8.008637

What’s next?

Quarto formatting

If you are interested, learning how to write papers in R and Quarto is quite easy. You can render your documents to PDF. Here are a couple of guides:

Alternative Models

What if your outcome is not continuous or my diagnostics fails completely? Then here are the starting points that would help you to think of other than linear relations

Outcome type	Model	Functions
Count data	Poisson	`glm()` with poisson link
Binary	Binomial	`glm()` with binomial link
Categorical	Multinomial	`multinom()` from `nnet` library

Discussion Section Review

Libraries and Functions covered

Library	Functions	Description
tidyverse	`filter()`, `mutate()`, `ggplot()`	data wrangling and visualization
modelsummary	`modelsummary()`	present good looking tables
ggeffects	`ggpredict()`	calculate and visualize marginal effects
GGally	`ggpairs()`, `ggcoef()`	extension to ggplot
ggfortify	`autoplot()`	extension to ggplot for diagnostics
lmtest	`bptest()`	statistical tests for diagnostics
car	`vif()`	additional statistical tests for diagnostics

Datasets covered

Dataset	Description	Link
V-Dem	Measures democracy worldwide	V-Dem
World Happiness Report	Annual happiness report	World Happiness Report
Who Governs	Dataset on political elites	Who Governs
SIPRI	Data on military operations	SIPRI
Comparative Political Data Set	A dataset covering political institutions	CPDS

Check List

I know that every lab has check list below! And I will use it to navigate what we have learned
R doesn’t scare me anymore
I have developed the intuition behind application of quantitative methods