--- title: "OLS Regression" subtitle: "Week 5" date: 2025-05-01 format: html: embed-resources: true toc: true --- # Before we start - Questions regarding Lab Assignment or the class?
Download script

Download data

# Agenda - Replicating a study - Running the regression - Thinking about controls - Starting diagnostics # Set up Today we are working with data from a published academic paper.[^1] Let's replicate their study and think about the underlying causal relation. [^1]: Appel, B.J. and Loyle, C.E., 2012. The economic benefits of justice: Post-conflict justice and foreign direct investment. *Journal of Peace Research*, 49(5), pp.685-699. ::: {.callout-note icon="false"} ## Qustion Here is the summary of the paper. Appel and Loyle argue that post-conflict justice institutions act as credible signals of future stability, reducing investor uncertainty and thereby increasing foreign direct investment (FDI) in post-conflict states. They show empirically that post-conflict countries implementing restorative justice mechanisms receive significantly more FDI compared to those that do not, even after accounting for economic, political, and conflict-related factors. Extract the following information: - **Dependent variable (outcome)**: - **Independent variable (explanatory variable)**: - **Mechanism (why IV has influence on DV)**: - **Control variables (potential confounders)**: ::: First, load the library ```{r} #| warning: false library(dagitty) ``` Let's fill in the directed acyclic graph (DAG) below. ```{r} #| eval: false dag_appel_loyle = dagitty('dag { IV -> DV control -> IV control -> DV ... -> ... }') plot(dag_appel_loyle) ``` # Simple Linear Regression Let's load the data first. We need `foreign` library to load Stata `.dta` files. ```{r} library(foreign) appel_loyle_raw = read.dta("data/appelloyle_jprdata.dta") ``` Now, let's load the `tidyverse` library for data wrangling, and start exploring the dataset. ```{r} #| message: false library(tidyverse) colnames(appel_loyle_raw) ``` Doesn't seem straightforward, right? To get the sense of what's going on you can go to the original replication data folder of the study ([here](https://www.prio.org/journals/jpr/replicationdata)). But, to simplify the task, this script will guide you. Our dependent variable is `v3Mdiff` and independent is `truthvictim`. Let's rename that for the sake of clarity, to received foreign direct investments (fdi) and post-conflict justice (pcj) respectively. ```{r} appel_loyle = appel_loyle_raw %>% mutate(fdi = v3Mdiff, pcj = truthvictim) ``` Conceptually, post-conflict justice is operationalized as presence of truth commissions and reparations programs after the conflict. First, explore the foreign direct investment. What do you notice? ```{r} #| message: false ggplot(appel_loyle) + geom_histogram(aes(x = fdi)) + theme_bw() + labs(x = "Received Foreign Direct Investment", y = "Count") ``` Now, let's see the distribution of the independent variable. How to fix the graph? When the graph is fixed, add the following line of code `scale_x_discrete(labels = c("Absent", "Present"))` to the graph. ```{r} #| message: false ggplot(appel_loyle) + geom_histogram(aes(x = pcj)) + theme_light() + labs(x = "Post-Conflict Institutions", y = "Count") ``` Now, compare the distributions. ```{r} #| message: false ggplot(appel_loyle) + geom_boxplot(aes(x = as.factor(pcj), y = fdi)) + labs(x = "Post-Conflict Institutions", y = "Received Foreign Direct Investment") + scale_x_discrete(labels = c("Absent", "Present")) ``` Finally, let's set up a simple linear regression. What does it mean? ```{r} slr_model = lm(fdi ~ pcj, appel_loyle) summary(slr_model) ``` # Multiple Linear Regression ## Economic Controls Now, let's think through control that authors use. First, focus on several economic control variables: - Economic development (`fv8`) measured as GDP per capita - Economic size (`fv10`) measured as GDP - Female Life expectancy (`v64mean`) Thus, this is how the authors model the real world. Does it raise any problems? ```{r} econ_appel_loyle = dagitty('dag { PCJ -> FDI EconomicDvelopment -> PCJ EconomicDvelopment -> FDI EconomicSize -> FDI EconomicSize -> PCJ FLifeexpectancy -> FDI FLifeexpectancy -> PCJ }') plot(econ_appel_loyle) ``` Again, let's rename the variables for the sake of clarity ```{r} appel_loyle = appel_loyle %>% rename(econ_development = fv8, econ_size = fv10, flife_exp = v64mean) ``` Now, let's set up the multiple linear model! Do the results hold when we add controls? ```{r} econ_model = lm(fdi ~ pcj + econ_development + econ_size + flife_exp, appel_loyle) summary(econ_model) ``` ## Political Controls Now, let's add political institutions. - Domestic political constraints (`fv27`) - Political regime (`polity2`) ::: {.callout-tip icon="false"} ## Coding Task First, add them to the DAG ```{r} #| eval: false dagitty('dag { PCJ -> FDI EconomicDvelopment -> PCJ EconomicDvelopment -> FDI EconomicSize -> FDI EconomicSize -> PCJ FLifeexpectancy -> FDI FLifeexpectancy -> PCJ ... ... }') %>% plot() ``` Now, rename the variables, `polity2` to `regime` and `fv27` to `constraints`. ```{r} #| eval: false appel_loyle = appel_loyle ... ...(regime = ..., ... = fv27) ``` Extend the previous model by adding political institutions. Present `summary()`. Is the main independent variable statistically significant? ```{r} #| eval: false pol_model = lm(..., appel_loyle) ... ``` **Additional task** Install and load library `texreg`, then run the `screenreg()` function on your model. In your free time feel free to explore `htmlreg()` for HTML and `texreg()` for PDF ::: # Model Diagnostics Let's take their full model, and run a quick diagnostic. The code below uses non-amended dataset, thus the names of the variables are not changed. Feel free to play around this after the section. Most importantly, the PCJ is statistically significant. ```{r} full_model = lm(v3Mdiff ~ truthvictim + fv8 + fv10 + fv11 + fv34 + fv27 + victory_lag + cw_duration_lag + damage + peace_agreement_lag + coldwar + polity2 + xratf + labor + v64mean, appel_loyle_raw) summary(full_model) ``` Let's load a `ggfortify` package for the diagnostics ```{r} #| warning: false library(ggfortify) ``` We'll get back to diagnostics in the upcoming weeks: but as a beginning, and most importantly, these patterns highlight possible problems with the model. Quite likely there are some outliers that might drive the result. If you can wait to get into a detail in these plots, feel free to use this [tutorial](https://library.virginia.edu/data/articles/diagnostic-plots). - In your free time, try applying a log transformation to the dependent variable in the `full_model` and compare the diagnostic plots ```{r} autoplot(full_model) ``` # Some Tips - When thinking about the statistical relationship, pay attention to the mechanism and causal pathway: how and why IV influences DV? - Visualize the distributions! Highlight potential problems in the preliminary stage of data analysis # Check List I know how to explore variables, and plot their relationships I am not afraid of multiple linear regression, and can easily interpret the coefficients and statistical significance I am able to link my theoretical framework to empirically operationalized variables I am gradually developing intuition behind model diagnostics