---
title: "Regression Extensions"
subtitle: "Week 6"
date: 2025-05-08
format: 
  html:
    embed-resources: true
toc: true
---

# Before we start

-   Questions regarding Lab Assignment or the class?

<br><center>
<a href="https://artur-baranov.github.io/nu-ps312-ds/scripts/ps312-d_06.qmd" class="btn btn-primary" role="button" download="ps312-d_06.qmd" style="width:200px" target="_blank">Download script</a>
</center>
<br>
<center>
<a href="https://artur-baranov.github.io/nu-ps312-ds/data/WhoGov.csv" class="btn btn-primary" role="button" download="WhoGov.csv" style="width:200px" target="_blank">Download data</a>
</center><br>

# Agenda

-   Using dummy variables in the OLS

-   Introducing fixed effects and interactions

-   Attempting to fix problems identified during diagnostics

# Set Up

Today we are working with WhoGov dataset. The data provides information on elites in various countries. As usual, I recommend taking a look at their [codebook](https://politicscentre.nuffield.ox.ac.uk/media/4117/whogov_codebook.pdf).

```{r}
#| message: false

library(tidyverse)

whogov = read.csv("data/WhoGov.csv")
```

First of all, these are the following variables we are going to work with today:

-   `country_name` is a country name

-   `n_individuals` number of unique persons in the cabinet

-   `leaderexperience_continuous` the number of years the person has been leader of the country in total.

-   `leader_party` party of the leader

Start with exploring the distribution of number of unique persons in the cabinet (`n_individuals`)

```{r}
#| message: false

ggplot(whogov) +
  geom_histogram(aes(x = n_individuals)) 
```

# Dummy Variables

Let’s examine whether a country's leader being independent from a political party is associated with having more or fewer members in their cabinet. First, let's create a dummy variable indicating if a leader is independent or non-independent. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.

```{r}
whogov = whogov %>%
  mutate(indep = ifelse(leader_party == "independent", "Independent", "Partisan"))
```

Now, build a simple model and explore the effect. On average, being a non-independent (i.e. partisan) leader is associated with having 2.28 more members in their cabinet compared to independent leaders.

```{r}
lm(n_individuals ~ indep, whogov) %>%
  summary()
```

What if we want to know the effect relative to Partisan leader? Let's `relevel()` the variable!

```{r}
#| eval: false

whogov$indep = relevel(whogov$indep, ref = "Partisan")
```

Oops! This is why classes of data are important. Fix it!

```{r}
whogov$indep = as.factor(whogov$indep)
```

Now we can relevel the variable

```{r}
whogov$indep = relevel(whogov$indep, ref = "Partisan")
```

Compare the models. Does the result sound reasonable? Pretty much. This is simply an inverse. But things get way more interesting if a categorical variable has more than 2 levels. You will see this later on. 

```{r}
lm(n_individuals ~ indep, whogov) %>%
  summary()
```

## Fixed Effects

Let’s explore how leader’s tenure is associated with the number of individuals in the government. We start with the simple linear regression. 

```{r}
lm(n_individuals ~ leaderexperience_continuous, whogov) %>%
  summary()
```

::: {.callout-tip icon="false"}
## Coding Task

Take a moment and draw a scatterplot for n_individuals and leaderexperience_continuous. Add a regression line to the plot.

```{r}

```


:::

Now, let’s add a categorical variable, `indep`, to the model. By doing so, we assume that the association between the leader’s tenure and the number of individuals in the government differs depending on whether the leader is independent or partisan.

Practically, this could be done in multiple ways. First, let’s discuss introduction of **fixed effects** to our model. 

```{r}
model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov) 
summary(model_fe)
```

We will use `ggeffects` library for visualization of regression with the fixed effects. This is sort of an addition to `ggplot2` library from `tidyverse.` Don’t forget to install it using `install.packages()`!

```{r}
library(ggeffects)
```

Then, visualize the result. What can we see?

```{r}
ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
  plot()
```

Let’s customize the plot. It should be relatively straightforward given we know `ggplot` functions. Details for the customization of `plot()` function can be found on [ggeffects](https://strengejacke.github.io/ggeffects/reference/plot.html) website.

```{r}
ggpredict(model_fe, terms= c("leaderexperience_continuous", "indep")) %>%
  plot(show_ci = F) +
  labs(title = "Fixed Effects Regression",
       x = "Tenure of a Leader",
       y = "Number of Individuals in a Cabinet",
       color = "Leader's Status") +
  theme_bw()
```

Some common fixed effects include:

-   Country/Region/State

-   Individual leaders/Parties

-   Year/Time

-   Policy presence or absence

By introducing fixed effects, we are able to control for unobserved confounders that vary across the units (not within!). 

## Interactions

Often dummy variables are used to introduce an interaction term in the model. We will explore the association between `Perceptions_of_corruption` and number of people in the cabinet (`n_individuals`) depending on the independence of the party leader.

The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.

Load the World Happiness Report we used in the [First Meeting](https://artur-baranov.github.io/nu-ps312-ds/data/WHR.csv).

```{r}
whr = read.csv("data/WHR.csv")
```

Let's prepare the data

```{r}
whr_subset = whr %>%
  select(Country_name, Perceptions_of_corruption)

whogov_subset = whogov %>%
  filter(year == 2021) %>%
  select(country_name, n_individuals, indep)
```

Now, we are merging them.

```{r}
whr_whogov = whr_subset %>%
  left_join(whogov_subset, by = c("Country_name" = "country_name")) 
```

Check the result

```{r}
head(whr_whogov)
```

Now, to interact variables we need to use asterisk `*`, i.e. multiplication.

```{r}
model_in = lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
summary(model_in)
```

Let’s plot the result.

```{r}
ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
  plot(show_ci = FALSE) +
  labs(title = "Regression with Interaction Term",
       x = "Number of Individuals in a Cabinet",
       y = "Perception of Corruption",
       color = "Leader's Status") +
  theme_bw()
```

And you can easily simulate the data (i.e., calculate the marginal effect) using `ggpredict()`. For example,

```{r}
ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))
```

# Diagnostics

Let's first identify if we have a problem. Load the `ggfortify` library to draw the diagnostic plots.

```{r}
#| warning: false

library(ggfortify)
```

And let's see the interaction model. What stands out?

```{r}
autoplot(model_in)
```

Let's visualize that by the independence/partisanship of the leader. Any patterns?

```{r}
autoplot(model_in, colour = "indep")
```

::: {.callout-tip icon="false"}
## Coding Task

Let's try to fix that.

-   Add `system_category` from `WhoGov` dataset as a control variable

-   Apply a quadratic transformation to the dependent variable

-   Draw diagnostics plots. What have changed?

:::


# Check List

<input type="checkbox"/> I know what dummy variables are

<input type="checkbox"/> I know why I might need fixed effects

<input type="checkbox"/> I have an intuition regarding implementation of interaction effects

<input type="checkbox"/> I know how to visualize complex regressions using `ggeffects()`.