---
title: "Quarter Review"
subtitle: "Week 8"
date: 2025-05-22
format:
html:
embed-resources: true
toc: true
---
# Before we start
- Questions regarding Final Paper or the class?
- Next week no section, but office hours instead (Scott Hall 110)
# Agenda
- Reviewing what we have learned
- Model Interpretation and Diagnostics
- Quarto formatting
# Quarter Review
[{{< fa arrow-up-right-from-square >}} View slides in new window](https://artur-baranov.github.io/nu-ps312-ds/slides/ps312-s_08.html){.btn .btn-primary role="button" target="_blank"}
# Set Up
Download script
Today we are working with Comparative Political Dataset. We have used it in Week 2, and if you don't have the dataset, download it [here](https://artur-baranov.github.io/nu-ps312-ds/data/cpds.xlsx)
It consists of political and institutional country-level data. Take a look on their [codebook](https://cpds-data.org/wp-content/uploads/2024/11/codebook_cpds.pdf).
Today we are working with the following variables.
- `prefisc_gini` - Gini index. What is it?
- `eu` - member states of the European Union identification
- `openc` - Openness of the economy (trade as % of GDP)
- `poco` - post-communist countries post-communist countries identification
Let's load our library.
```{r}
#| message: false
library(tidyverse)
```
Now, let's load the data.
```{r}
library(readxl)
cpds = read_excel("data/cpds.xlsx")
```
## Exploratory Analysis
Let's do a similar data analysis as before. What do you notice?
```{r}
#| warning: false
#| message: false
library(GGally)
ggpairs(cpds, columns = c("prefisc_gini", "eu", "openc", "poco"))
```
Let's change misclassified variables.
```{r}
cpds$eu = as.factor(cpds$eu)
cpds$poco = as.factor(cpds$poco)
```
# Multiple Linear Regression with Fixed Effects
Let's explain inequality using the multiple linear regression. We measure (i.e., operationalize) inequality with the GINI index. Our main explanatory variable is EU membership. In addition, we control for economic openness (`openc`), and add a fixed effect for country's communist past (`poco`).
```{r}
model_mlr = lm(prefisc_gini ~ eu + openc + poco, cpds)
summary(model_mlr)
```
A Gini index of 0 represents perfect equality, where everyone has the same income. Conversely, a Gini index of 100 represents perfect inequality. Let me provide an example: one percentage point increase in trade as % of GDP (`openc`) decreases GINI index by 0.008.
::: {.callout-note icon="false"}
## Qustions
What is the interpretation for the main explanatory variable, i.e. being part of the EU?
Using `ggpredict()` function from `ggeffects` library, visualize the fixed effect: plot the `openc` and `poco` terms. Does it make sense why the previous communist experience is not statistically significant?
:::
# Multiple Linear Regression with Interaction effect
Now, let’s set up an interaction effect. While it’s most useful for causal inference to interact the main explanatory variable with a *moderator* (i.e. other type of variable), for illustration purposes, we’ll instead interact economic openness (`openc`) with communist experience (`poco`).
```{r}
model_int = lm(prefisc_gini ~ eu + openc * poco, cpds)
summary(model_int)
```
Let's visualize the interaction model. We observe that openness of the economy is relevant only for post communist countries. Any ideas why?
```{r}
library(ggeffects)
ggpredict(model_int, terms = c("openc", "poco")) |>
plot()
```
To interpret this effect, we are relying on marginal effects. Say, what is the effect of being part of the EU:
when having average openness of the economy and **having the communist experience** vs **when having average openness of the economy and not having the communist experience**? The marginal effect of having communist experience vs not having is $42.89 - 42.10 = 0.79$.
```{r}
ggpredict(model_int, terms = c("eu [1]", "poco [0,1]"))
```
# Presenting the models
And, as usual, you can use `modelsummary()` to present the output of the regression.
```{r}
library(modelsummary)
modelsummary(list("Fixed Effects model" = model_mlr,
"Interaction model" = model_int),
title = "Explaining Inequality (OLS)",
stars = TRUE,
gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
coef_rename = c("(Intercept)",
"EU",
"Openness of the Economy",
"Communsit Experience",
"Openness of the Economy × Communsit Experience"))
```
# Diagnostics
There is no expectation that you master this part. But as there were request to expand, here is a brief recap of diagnostics. As a base, let's take `ggfortify` and `lmtest` library diagnostics tools. We'll focus only on technical side of diagnostics.
```{r}
#| message: false
library(ggfortify)
library(lmtest)
autoplot(model_int)
```
## Linearity Assumption
We expect that the relationship between our dependent variable and set of independent variables would be linear.
```{r}
autoplot(model_int)[1]
```
More formally, we can use `reset()` test from `lmtest` library. What does the p-value below indicates?
```{r}
reset(model_int)
```
For fixing, consider variable transformations. You can use Box-Cox transformations, more on them [here](https://www.r-bloggers.com/2022/10/box-cox-transformation-in-r/). Alternatively, use another model (some suggestions are at the end of the script).
## Normality
We want residuals to be distributed normally.
```{r}
autoplot(model_int)[2]
```
Formally, you can use any normality test. E.g., `shapiro.test()`. What does it say?
```{r}
shapiro.test(model_int$residuals)
```
For fixing, consider transforming the variables, and think through potential omitted variables. Alternatively, use another model.
## Non Constant Variance
The variance of the residuals should be constant (i.e. be homoscedastic).
```{r}
autoplot(model_int)[3]
```
For formal test, you can use `bptest()` from `lmtest` library. What does p-value say?
```{r}
bptest(model_int)
```
For fixing, consider introducing robust standard errors. You can use `lm_robust()` function from the `estimatr` package.
## Influential cases
We don't want one or several observations to be driving the whole result. Thus, we need to be cautios with influential cases.
```{r}
autoplot(model_int)[4]
```
For the formal evaluation, you can use Cook's Distance. If you have observation that are above the threshold (usually, $\frac{4}{N}$), consider excluding them and study separately. For example, in our case it is
```{r}
4 / model_int$df.residual
```
Then, we can check it easily on the plot.
```{r}
plot(model_int, 4)
```
## Multicollinearity
Finally, let’s address multicollinearity. We want to avoid perfect or near-perfect linear relationships among the independent variables. The Variance Inflation Factor (VIF) helps detect this issue. If multicollinearity is present, consider removing one of the variables with a high VIF. However, it’s not uncommon for interaction terms and polynomial transformations to have high VIF scores (and that's ok).
```{r}
#| message: false
library(car)
vif(model_int)
```
# What's next?
## Quarto formatting
If you are interested, learning how to write papers in R and Quarto is quite easy. You can render your documents to PDF. Here are a couple of guides:
- [Typst](https://quarto.org/docs/output-formats/typst.html)
- [Markdown](https://quarto.org/docs/authoring/markdown-basics.html)
## Alternative Models
What if your outcome is not continuous or my diagnostics fails completely? Then here are the starting points that would help you to think of other than linear relations
| Outcome type | Model | Functions |
|--------------|-------------|----------------------------------|
| Count data | Poisson | `glm()` with poisson link |
| Binary | Binomial | `glm()` with binomial link |
| Categorical | Multinomial | `multinom()` from `nnet` library |
# Discussion Section Review
## Libraries and Functions covered
| Library | Functions | Description |
|--------------|------------------------------------|------------------------------------------------------|
| tidyverse | `filter()`, `mutate()`, `ggplot()` | data wrangling and visualization |
| modelsummary | `modelsummary()` | present good looking tables |
| ggeffects | `ggpredict()` | calculate and visualize marginal effects |
| GGally | `ggpairs()`, `ggcoef()` | extension to ggplot |
| ggfortify | `autoplot()` | extension to ggplot for diagnostics |
| lmtest | `bptest()` | statistical tests for diagnostics |
| car | `vif()` | additional statistical tests for diagnostics |
## Datasets covered
| Dataset | Description | Link |
|---------------------------------------------------|--------------------------------------------|----------------------------------------------------------|
| **V-Dem** | Measures democracy worldwide | [V-Dem](https://www.v-dem.net/) |
| **World Happiness Report** | Annual happiness report | [World Happiness Report](https://worldhappiness.report/) |
| **Who Governs** | Dataset on political elites | [Who Governs](https://doi.org/10.1017/S0003055420000490) |
| **SIPRI** | Data on military operations | [SIPRI](https://www.sipri.org/databases) |
| **Comparative Political Data Set** | A dataset covering political institutions | [CPDS](https://www.cpds-data.org/) |
|
# Check List
- [ ] I know that every lab has check list below! And I will use it to navigate what we have learned
- [ ] R doesn't scare me anymore
- [ ] I have developed the intuition behind application of quantitative methods