library(tidyverse)
Quarter Review
Week 8
Before we start
Questions regarding Final Paper or the class?
Next week no section, but office hours instead (Scott Hall 110)
Agenda
Reviewing what we have learned
Model Interpretation and Diagnostics
Quarto formatting
Quarter Review
Set Up
Today we are working with Comparative Political Dataset. We have used it in Week 2, and if you don’t have the dataset, download it here
It consists of political and institutional country-level data. Take a look on their codebook.
Today we are working with the following variables.
prefisc_gini
- Gini index. What is it?eu
- member states of the European Union identificationopenc
- Openness of the economy (trade as % of GDP)poco
- post-communist countries post-communist countries identification
Let’s load our library.
Now, let’s load the data.
library(readxl)
= read_excel("data/cpds.xlsx") cpds
Exploratory Analysis
Let’s do a similar data analysis as before. What do you notice?
library(GGally)
ggpairs(cpds, columns = c("prefisc_gini", "eu", "openc", "poco"))
Let’s change misclassified variables.
$eu = as.factor(cpds$eu)
cpds$poco = as.factor(cpds$poco) cpds
Multiple Linear Regression with Fixed Effects
Let’s explain inequality using the multiple linear regression. We measure (i.e., operationalize) inequality with the GINI index. Our main explanatory variable is EU membership. In addition, we control for economic openness (openc
), and add a fixed effect for country’s communist past (poco
).
= lm(prefisc_gini ~ eu + openc + poco, cpds)
model_mlr summary(model_mlr)
Call:
lm(formula = prefisc_gini ~ eu + openc + poco, data = cpds)
Residuals:
Min 1Q Median 3Q Max
-10.3329 -2.9815 0.2101 3.0609 12.8156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.465498 0.339522 119.184 < 2e-16 ***
eu1 2.264796 0.397369 5.699 1.97e-08 ***
openc -0.007549 0.002793 -2.703 0.00709 **
poco1 -0.248216 0.504688 -0.492 0.62304
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.03 on 545 degrees of freedom
(1317 observations deleted due to missingness)
Multiple R-squared: 0.05716, Adjusted R-squared: 0.05197
F-statistic: 11.01 on 3 and 545 DF, p-value: 4.969e-07
A Gini index of 0 represents perfect equality, where everyone has the same income. Conversely, a Gini index of 100 represents perfect inequality. Let me provide an example: one percentage point increase in trade as % of GDP (openc
) decreases GINI index by 0.008.
Multiple Linear Regression with Interaction effect
Now, let’s set up an interaction effect. While it’s most useful for causal inference to interact the main explanatory variable with a moderator (i.e. other type of variable), for illustration purposes, we’ll instead interact economic openness (openc
) with communist experience (poco
).
= lm(prefisc_gini ~ eu + openc * poco, cpds)
model_int summary(model_int)
Call:
lm(formula = prefisc_gini ~ eu + openc * poco, data = cpds)
Residuals:
Min 1Q Median 3Q Max
-10.2249 -2.8294 0.0792 2.8096 12.5204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.183549 0.338196 118.817 < 2e-16 ***
eu1 2.369795 0.390323 6.071 2.38e-09 ***
openc -0.005083 0.002787 -1.824 0.0687 .
poco1 5.850890 1.373454 4.260 2.41e-05 ***
openc:poco1 -0.056715 0.011914 -4.761 2.48e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.952 on 544 degrees of freedom
(1317 observations deleted due to missingness)
Multiple R-squared: 0.09487, Adjusted R-squared: 0.08821
F-statistic: 14.25 on 4 and 544 DF, p-value: 4.508e-11
Let’s visualize the interaction model. We observe that openness of the economy is relevant only for post communist countries. Any ideas why?
library(ggeffects)
ggpredict(model_int, terms = c("openc", "poco")) |>
plot()
To interpret this effect, we are relying on marginal effects. Say, what is the effect of being part of the EU:
when having average openness of the economy and having the communist experience vs when having average openness of the economy and not having the communist experience? The marginal effect of having communist experience vs not having is \(42.89 - 42.10 = 0.79\).
ggpredict(model_int, terms = c("eu [1]", "poco [0,1]"))
# Predicted values of prefisc_gini
poco: 0
eu | Predicted | 95% CI
-----------------------------
1 | 42.10 | 41.66, 42.54
poco: 1
eu | Predicted | 95% CI
-----------------------------
1 | 42.89 | 41.87, 43.92
Adjusted for:
* openc = 89.16
Presenting the models
And, as usual, you can use modelsummary()
to present the output of the regression.
library(modelsummary)
modelsummary(list("Fixed Effects model" = model_mlr,
"Interaction model" = model_int),
title = "Explaining Inequality (OLS)",
stars = TRUE,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
coef_rename = c("(Intercept)",
"EU",
"Openness of the Economy",
"Communsit Experience",
"Openness of the Economy × Communsit Experience"))
Fixed Effects model | Interaction model | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
(Intercept) | 40.465*** | 40.184*** |
(0.340) | (0.338) | |
EU | 2.265*** | 2.370*** |
(0.397) | (0.390) | |
Openness of the Economy | -0.008** | -0.005+ |
(0.003) | (0.003) | |
Communsit Experience | -0.248 | 5.851*** |
(0.505) | (1.373) | |
Openness of the Economy × Communsit Experience | -0.057*** | |
(0.012) | ||
Num.Obs. | 549 | 549 |
R2 | 0.057 | 0.095 |
R2 Adj. | 0.052 | 0.088 |
Diagnostics
There is no expectation that you master this part. But as there were request to expand, here is a brief recap of diagnostics. As a base, let’s take ggfortify
and lmtest
library diagnostics tools. We’ll focus only on technical side of diagnostics.
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.3
library(lmtest)
autoplot(model_int)
Linearity Assumption
We expect that the relationship between our dependent variable and set of independent variables would be linear.
autoplot(model_int)[1]
More formally, we can use reset()
test from lmtest
library. What does the p-value below indicates?
reset(model_int)
RESET test
data: model_int
RESET = 2.4443, df1 = 2, df2 = 542, p-value = 0.08774
For fixing, consider variable transformations. You can use Box-Cox transformations, more on them here. Alternatively, use another model (some suggestions are at the end of the script).
Normality
We want residuals to be distributed normally.
autoplot(model_int)[2]
Formally, you can use any normality test. E.g., shapiro.test()
. What does it say?
shapiro.test(model_int$residuals)
Shapiro-Wilk normality test
data: model_int$residuals
W = 0.99679, p-value = 0.3493
For fixing, consider transforming the variables, and think through potential omitted variables. Alternatively, use another model.
Non Constant Variance
The variance of the residuals should be constant (i.e. be homoscedastic).
autoplot(model_int)[3]
For formal test, you can use bptest()
from lmtest
library. What does p-value say?
bptest(model_int)
studentized Breusch-Pagan test
data: model_int
BP = 18.038, df = 4, p-value = 0.001213
For fixing, consider introducing robust standard errors. You can use lm_robust()
function from the estimatr
package.
Influential cases
We don’t want one or several observations to be driving the whole result. Thus, we need to be cautios with influential cases.
autoplot(model_int)[4]
For the formal evaluation, you can use Cook’s Distance. If you have observation that are above the threshold (usually, \(\frac{4}{N}\)), consider excluding them and study separately. For example, in our case it is
4 / model_int$df.residual
[1] 0.007352941
Then, we can check it easily on the plot.
plot(model_int, 4)
Multicollinearity
Finally, let’s address multicollinearity. We want to avoid perfect or near-perfect linear relationships among the independent variables. The Variance Inflation Factor (VIF) helps detect this issue. If multicollinearity is present, consider removing one of the variables with a high VIF. However, it’s not uncommon for interaction terms and polynomial transformations to have high VIF scores (and that’s ok).
library(car)
vif(model_int)
eu openc poco openc:poco
1.166160 1.212807 7.819099 8.008637
What’s next?
Quarto formatting
If you are interested, learning how to write papers in R and Quarto is quite easy. You can render your documents to PDF. Here are a couple of guides:
Alternative Models
What if your outcome is not continuous or my diagnostics fails completely? Then here are the starting points that would help you to think of other than linear relations
Outcome type | Model | Functions |
---|---|---|
Count data | Poisson | glm() with poisson link |
Binary | Binomial | glm() with binomial link |
Categorical | Multinomial | multinom() from nnet library |
Discussion Section Review
Libraries and Functions covered
Library | Functions | Description |
---|---|---|
tidyverse | filter() , mutate() , ggplot() |
data wrangling and visualization |
modelsummary | modelsummary() |
present good looking tables |
ggeffects | ggpredict() |
calculate and visualize marginal effects |
GGally | ggpairs() , ggcoef() |
extension to ggplot |
ggfortify | autoplot() |
extension to ggplot for diagnostics |
lmtest | bptest() |
statistical tests for diagnostics |
car | vif() |
additional statistical tests for diagnostics |
Datasets covered
Dataset | Description | Link |
---|---|---|
V-Dem | Measures democracy worldwide | V-Dem |
World Happiness Report | Annual happiness report | World Happiness Report |
Who Governs | Dataset on political elites | Who Governs |
SIPRI | Data on military operations | SIPRI |
Comparative Political Data Set | A dataset covering political institutions | CPDS |