---
title: "OLS Regression"
subtitle: "Week 5"
date: 2025-05-01
format: 
  html:
    embed-resources: true
toc: true
---

# Before we start

-   Questions regarding Lab Assignment or the class?

<br><center>
<a href="https://artur-baranov.github.io/nu-ps312-ds/scripts/ps312-d_05.qmd" class="btn btn-primary" role="button" download="ps312-d_05.qmd" style="width:200px" target="_blank">Download script</a>
</center>
<br>
<center>
<a href="https://artur-baranov.github.io/nu-ps312-ds/data/appelloyle_jprdata.dta" class="btn btn-primary" role="button" download="appelloyle_jprdata.dta" style="width:200px" target="_blank">Download data</a>
</center><br>


# Agenda

-   Replicating a study

-   Running the regression

-   Thinking about controls

-   Starting diagnostics

# Set up

Today we are working with data from a published academic paper.[^1] Let's replicate their study and think about the underlying causal relation. 

[^1]: Appel, B.J. and Loyle, C.E., 2012. The economic benefits of justice: Post-conflict justice and foreign direct investment. *Journal of Peace Research*, 49(5), pp.685-699.

::: {.callout-note icon="false"}
## Qustion

Here is the summary of the paper. 

Appel and Loyle argue that post-conflict justice institutions act as credible signals of future stability, reducing investor uncertainty and thereby increasing foreign direct investment (FDI) in post-conflict states. They show empirically that post-conflict countries implementing restorative justice mechanisms receive significantly more FDI compared to those that do not, even after accounting for economic, political, and conflict-related factors.

Extract the following information:

-   **Dependent variable (outcome)**:

-   **Independent variable (explanatory variable)**:

-   **Mechanism (why IV has influence on DV)**: 

-   **Control variables (potential confounders)**: 

:::

First, load the library

```{r}
#| warning: false

library(dagitty)
```

Let's fill in the directed acyclic graph (DAG) below.

```{r}
#| eval: false

dag_appel_loyle = dagitty('dag {
                          IV -> DV
                          control -> IV
                          control -> DV
                          ... -> ...
                          }')
plot(dag_appel_loyle)
```

# Simple Linear Regression

Let's load the data first. We need `foreign` library to load Stata `.dta` files.

```{r}
library(foreign)
appel_loyle_raw = read.dta("data/appelloyle_jprdata.dta")
```

Now, let's load the `tidyverse` library for data wrangling, and start exploring the dataset.

```{r}
#| message: false

library(tidyverse)

colnames(appel_loyle_raw)
```

Doesn't seem straightforward, right? To get the sense of what's going on you can go to the original replication data folder of the study ([here](https://www.prio.org/journals/jpr/replicationdata)). But, to simplify the task, this script will guide you.

Our dependent variable is `v3Mdiff` and independent is `truthvictim`. Let's rename that for the sake of clarity, to received foreign direct investments (fdi) and post-conflict justice (pcj) respectively.

```{r}
appel_loyle = appel_loyle_raw %>%
  mutate(fdi = v3Mdiff,
         pcj = truthvictim)
```

Conceptually, post-conflict justice is operationalized as presence of truth commissions and reparations programs after the conflict.

First, explore the foreign direct investment. What do you notice?

```{r}
#| message: false

ggplot(appel_loyle) +
  geom_histogram(aes(x = fdi)) +
  theme_bw() +
  labs(x = "Received Foreign Direct Investment",
       y = "Count")
```

Now, let's see the distribution of the independent variable. How to fix the graph? When the graph is fixed, add the following line of code `scale_x_discrete(labels = c("Absent", "Present"))` to the graph.

```{r}
#| message: false

ggplot(appel_loyle) +
  geom_histogram(aes(x = pcj)) +
  theme_light() +
  labs(x = "Post-Conflict Institutions",
       y = "Count")
```

Now, compare the distributions. 

```{r}
#| message: false

ggplot(appel_loyle) +
  geom_boxplot(aes(x = as.factor(pcj), y = fdi)) +
  labs(x = "Post-Conflict Institutions",
       y = "Received Foreign Direct Investment") +
  scale_x_discrete(labels = c("Absent", "Present"))
```

Finally, let's set up a simple linear regression. What does it mean?

```{r}
slr_model = lm(fdi ~ pcj, appel_loyle)
summary(slr_model)
```

# Multiple Linear Regression

## Economic Controls

Now, let's think through control that authors use. First, focus on several economic control variables:

-   Economic development (`fv8`) measured as GDP per capita

-   Economic size (`fv10`) measured as GDP

-   Female Life expectancy (`v64mean`)

Thus, this is how the authors model the real world. Does it raise any problems?

```{r}
econ_appel_loyle = dagitty('dag {
                          PCJ -> FDI
                          EconomicDvelopment -> PCJ
                          EconomicDvelopment -> FDI
                          EconomicSize -> FDI
                          EconomicSize -> PCJ
                          FLifeexpectancy -> FDI
                          FLifeexpectancy -> PCJ
                          }')
plot(econ_appel_loyle)
```

Again, let's rename the variables for the sake of clarity

```{r}
appel_loyle = appel_loyle %>%
  rename(econ_development = fv8,
         econ_size = fv10,
         flife_exp = v64mean)
```

Now, let's set up the multiple linear model! Do the results hold when we add controls?

```{r}
econ_model = lm(fdi ~ pcj + econ_development + econ_size + flife_exp, appel_loyle)
summary(econ_model)
```

## Political Controls

Now, let's add political institutions. 

-   Domestic political constraints (`fv27`)

-   Political regime (`polity2`)

::: {.callout-tip icon="false"}
## Coding Task

First, add them to the DAG

```{r}
#| eval: false

dagitty('dag {
              PCJ -> FDI
              EconomicDvelopment -> PCJ
              EconomicDvelopment -> FDI
              EconomicSize -> FDI
              EconomicSize -> PCJ
              FLifeexpectancy -> FDI
              FLifeexpectancy -> PCJ
              ...
              ...
                          }') %>%
plot()
```

Now, rename the variables, `polity2` to `regime` and `fv27` to `constraints`.

```{r}
#| eval: false

appel_loyle = appel_loyle ...
  ...(regime = ...,
      ... = fv27)
```

Extend the previous model by adding political institutions. Present `summary()`. Is the main independent variable statistically significant?

```{r}
#| eval: false

pol_model = lm(..., appel_loyle)
...
```

**Additional task**

Install and load library `texreg`, then run the `screenreg()` function on your model. In your free time feel free to explore `htmlreg()` for HTML and `texreg()` for PDF

:::

# Model Diagnostics

Let's take their full model, and run a quick diagnostic. The code below uses non-amended dataset, thus the names of the variables are not changed. Feel free to play around this after the section. 

Most importantly, the PCJ is statistically significant.

```{r}
full_model = lm(v3Mdiff ~ truthvictim + fv8 + fv10 + fv11 + fv34 + fv27 + victory_lag + cw_duration_lag + damage + peace_agreement_lag + coldwar + polity2 + xratf + labor + v64mean, appel_loyle_raw)

summary(full_model)
```

Let's load a `ggfortify` package for the diagnostics

```{r}
#| warning: false

library(ggfortify)
```

We'll get back to diagnostics in the upcoming weeks: but as a beginning, and most importantly, these patterns highlight possible problems with the model. Quite likely there are some outliers that might drive the result. If you can wait to get into a detail in these plots, feel free to use this [tutorial](https://library.virginia.edu/data/articles/diagnostic-plots). 

-   In your free time, try applying a log transformation to the dependent variable in the `full_model` and compare the diagnostic plots

```{r}
autoplot(full_model)
```

# Some Tips

-   When thinking about the statistical relationship, pay attention to the mechanism and causal pathway: how and why IV influences DV?

-   Visualize the distributions! Highlight potential problems in the preliminary stage of data analysis

# Check List

<input type="checkbox"/> I know how to explore variables, and plot their relationships

<input type="checkbox"/> I am not afraid of multiple linear regression, and can easily interpret the coefficients and statistical significance

<input type="checkbox"/> I am able to link my theoretical framework to empirically operationalized variables

<input type="checkbox"/> I am gradually developing intuition behind model diagnostics