OLS Regression

Week 5

Published

May 1, 2025

Before we start

Questions regarding Lab Assignment or the class?

Download script
Download data

Agenda

Replicating a study
Running the regression
Thinking about controls
Starting diagnostics

Set up

Today we are working with data from a published academic paper.¹ Let’s replicate their study and think about the underlying causal relation.

Qustion

Here is the summary of the paper.

Appel and Loyle argue that post-conflict justice institutions act as credible signals of future stability, reducing investor uncertainty and thereby increasing foreign direct investment (FDI) in post-conflict states. They show empirically that post-conflict countries implementing restorative justice mechanisms receive significantly more FDI compared to those that do not, even after accounting for economic, political, and conflict-related factors.

Extract the following information:

Dependent variable (outcome):
Independent variable (explanatory variable):
Mechanism (why IV has influence on DV):
Control variables (potential confounders):

First, load the library

library(dagitty)

Let’s fill in the directed acyclic graph (DAG) below.

dag_appel_loyle = dagitty('dag {
                          IV -> DV
                          control -> IV
                          control -> DV
                          ... -> ...
                          }')
plot(dag_appel_loyle)

Simple Linear Regression

Let’s load the data first. We need foreign library to load Stata .dta files.

library(foreign)
appel_loyle_raw = read.dta("data/appelloyle_jprdata.dta")

Now, let’s load the tidyverse library for data wrangling, and start exploring the dataset.

library(tidyverse)

colnames(appel_loyle_raw)

 [1] "testnewid_lag"       "ccode"               "id"                 
 [4] "damage"              "victory_lag"         "peace_agreement_lag"
 [7] "issue_territory_lag" "cw_duration_lag"     "fv8"                
[10] "fv10"                "fv11"                "fv27"               
[13] "fv34"                "v2diff"              "v3Mdiff"            
[16] "coldwar"             "xratf"               "labor"              
[19] "physint"             "worker"              "polity2"            
[22] "v60mean"             "v63mean"             "v64mean"            
[25] "truthvictim"

Doesn’t seem straightforward, right? To get the sense of what’s going on you can go to the original replication data folder of the study (here). But, to simplify the task, this script will guide you.

Our dependent variable is v3Mdiff and independent is truthvictim. Let’s rename that for the sake of clarity, to received foreign direct investments (fdi) and post-conflict justice (pcj) respectively.

appel_loyle = appel_loyle_raw %>%
  mutate(fdi = v3Mdiff,
         pcj = truthvictim)

Conceptually, post-conflict justice is operationalized as presence of truth commissions and reparations programs after the conflict.

First, explore the foreign direct investment. What do you notice?

ggplot(appel_loyle) +
  geom_histogram(aes(x = fdi)) +
  theme_bw() +
  labs(x = "Received Foreign Direct Investment",
       y = "Count")

Now, let’s see the distribution of the independent variable. How to fix the graph? When the graph is fixed, add the following line of code scale_x_discrete(labels = c("Absent", "Present")) to the graph.

ggplot(appel_loyle) +
  geom_histogram(aes(x = pcj)) +
  theme_light() +
  labs(x = "Post-Conflict Institutions",
       y = "Count")

Now, compare the distributions.

ggplot(appel_loyle) +
  geom_boxplot(aes(x = as.factor(pcj), y = fdi)) +
  labs(x = "Post-Conflict Institutions",
       y = "Received Foreign Direct Investment") +
  scale_x_discrete(labels = c("Absent", "Present"))

Finally, let’s set up a simple linear regression. What does it mean?

slr_model = lm(fdi ~ pcj, appel_loyle)
summary(slr_model)


Call:
lm(formula = fdi ~ pcj, data = appel_loyle)

Residuals:
    Min      1Q  Median      3Q     Max 
-4047.4  -459.1  -418.7  -136.9 22648.3 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)    425.0      323.9   1.312   0.1927  
pcj           1763.5      744.0   2.370   0.0198 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2842 on 93 degrees of freedom
Multiple R-squared:  0.05696,   Adjusted R-squared:  0.04682 
F-statistic: 5.618 on 1 and 93 DF,  p-value: 0.01985

Multiple Linear Regression

Economic Controls

Now, let’s think through control that authors use. First, focus on several economic control variables:

Economic development (fv8) measured as GDP per capita
Economic size (fv10) measured as GDP
Female Life expectancy (v64mean)

Thus, this is how the authors model the real world. Does it raise any problems?

econ_appel_loyle = dagitty('dag {
                          PCJ -> FDI
                          EconomicDvelopment -> PCJ
                          EconomicDvelopment -> FDI
                          EconomicSize -> FDI
                          EconomicSize -> PCJ
                          FLifeexpectancy -> FDI
                          FLifeexpectancy -> PCJ
                          }')
plot(econ_appel_loyle)

Plot coordinates for graph not supplied! Generating coordinates, see ?coordinates for how to set your own.

Again, let’s rename the variables for the sake of clarity

appel_loyle = appel_loyle %>%
  rename(econ_development = fv8,
         econ_size = fv10,
         flife_exp = v64mean)

Now, let’s set up the multiple linear model! Do the results hold when we add controls?

econ_model = lm(fdi ~ pcj + econ_development + econ_size + flife_exp, appel_loyle)
summary(econ_model)


Call:
lm(formula = fdi ~ pcj + econ_development + econ_size + flife_exp, 
    data = appel_loyle)

Residuals:
    Min      1Q  Median      3Q     Max 
-7348.9  -579.7   151.2   304.2 16072.0 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -7.637e+02  1.711e+03  -0.446   0.6563    
pcj               1.544e+03  6.440e+02   2.397   0.0186 *  
econ_development  2.753e-03  1.295e-01   0.021   0.9831    
econ_size         7.561e-09  1.533e-09   4.933 3.68e-06 ***
flife_exp         9.793e+00  3.118e+01   0.314   0.7542    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2438 on 90 degrees of freedom
Multiple R-squared:  0.3283,    Adjusted R-squared:  0.2985 
F-statistic:    11 on 4 and 90 DF,  p-value: 2.63e-07

Political Controls

Now, let’s add political institutions.

Domestic political constraints (fv27)
Political regime (polity2)

Coding Task

First, add them to the DAG

dagitty('dag {
              PCJ -> FDI
              EconomicDvelopment -> PCJ
              EconomicDvelopment -> FDI
              EconomicSize -> FDI
              EconomicSize -> PCJ
              FLifeexpectancy -> FDI
              FLifeexpectancy -> PCJ
              ...
              ...
                          }') %>%
plot()

Now, rename the variables, polity2 to regime and fv27 to constraints.

appel_loyle = appel_loyle ...
  ...(regime = ...,
      ... = fv27)

Extend the previous model by adding political institutions. Present summary(). Is the main independent variable statistically significant?

pol_model = lm(..., appel_loyle)
...

Additional task

Install and load library texreg, then run the screenreg() function on your model. In your free time feel free to explore htmlreg() for HTML and texreg() for PDF

Model Diagnostics

Let’s take their full model, and run a quick diagnostic. The code below uses non-amended dataset, thus the names of the variables are not changed. Feel free to play around this after the section.

Most importantly, the PCJ is statistically significant.

full_model = lm(v3Mdiff ~ truthvictim + fv8 + fv10 + fv11 + fv34 + fv27 + victory_lag + cw_duration_lag + damage + peace_agreement_lag + coldwar + polity2 + xratf + labor + v64mean, appel_loyle_raw)

summary(full_model)


Call:
lm(formula = v3Mdiff ~ truthvictim + fv8 + fv10 + fv11 + fv34 + 
    fv27 + victory_lag + cw_duration_lag + damage + peace_agreement_lag + 
    coldwar + polity2 + xratf + labor + v64mean, data = appel_loyle_raw)

Residuals:
    Min      1Q  Median      3Q     Max 
-5308.4  -715.8   188.0   670.0 11747.4 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -1.278e+03  2.852e+03  -0.448  0.65525    
truthvictim          1.960e+03  7.030e+02   2.788  0.00663 ** 
fv8                 -1.110e-01  1.329e-01  -0.835  0.40625    
fv10                 1.095e-08  1.738e-09   6.301 1.56e-08 ***
fv11                 3.740e+01  2.324e+01   1.609  0.11152    
fv34                 1.988e+02  2.016e+02   0.986  0.32701    
fv27                 2.558e+03  1.460e+03   1.753  0.08357 .  
victory_lag         -3.397e+01  6.507e+02  -0.052  0.95850    
cw_duration_lag      8.111e-01  3.554e+01   0.023  0.98185    
damage               2.838e+01  1.024e+01   2.771  0.00697 ** 
peace_agreement_lag -1.215e+03  7.938e+02  -1.531  0.12983    
coldwar              8.153e+01  6.541e+02   0.125  0.90112    
polity2             -9.017e+01  5.631e+01  -1.601  0.11330    
xratf               -4.252e+01  1.389e+01  -3.061  0.00301 ** 
labor                9.844e+00  2.553e+01   0.386  0.70083    
v64mean              3.475e+00  3.299e+01   0.105  0.91638    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2213 on 79 degrees of freedom
Multiple R-squared:  0.5141,    Adjusted R-squared:  0.4219 
F-statistic: 5.573 on 15 and 79 DF,  p-value: 1.472e-07

Let’s load a ggfortify package for the diagnostics

library(ggfortify)

We’ll get back to diagnostics in the upcoming weeks: but as a beginning, and most importantly, these patterns highlight possible problems with the model. Quite likely there are some outliers that might drive the result. If you can wait to get into a detail in these plots, feel free to use this tutorial.

In your free time, try applying a log transformation to the dependent variable in the full_model and compare the diagnostic plots

autoplot(full_model)

Some Tips

When thinking about the statistical relationship, pay attention to the mechanism and causal pathway: how and why IV influences DV?
Visualize the distributions! Highlight potential problems in the preliminary stage of data analysis

Check List

I know how to explore variables, and plot their relationships

I am not afraid of multiple linear regression, and can easily interpret the coefficients and statistical significance

I am able to link my theoretical framework to empirically operationalized variables

I am gradually developing intuition behind model diagnostics

Footnotes

Appel, B.J. and Loyle, C.E., 2012. The economic benefits of justice: Post-conflict justice and foreign direct investment. Journal of Peace Research, 49(5), pp.685-699.↩︎