library(dagitty)
OLS Regression
Week 5
Before we start
- Questions regarding Lab Assignment or the class?
Agenda
Replicating a study
Running the regression
Thinking about controls
Starting diagnostics
Set up
Today we are working with data from a published academic paper.1 Let’s replicate their study and think about the underlying causal relation.
First, load the library
Let’s fill in the directed acyclic graph (DAG) below.
= dagitty('dag {
dag_appel_loyle IV -> DV
control -> IV
control -> DV
... -> ...
}')
plot(dag_appel_loyle)
Simple Linear Regression
Let’s load the data first. We need foreign
library to load Stata .dta
files.
library(foreign)
= read.dta("data/appelloyle_jprdata.dta") appel_loyle_raw
Now, let’s load the tidyverse
library for data wrangling, and start exploring the dataset.
library(tidyverse)
colnames(appel_loyle_raw)
[1] "testnewid_lag" "ccode" "id"
[4] "damage" "victory_lag" "peace_agreement_lag"
[7] "issue_territory_lag" "cw_duration_lag" "fv8"
[10] "fv10" "fv11" "fv27"
[13] "fv34" "v2diff" "v3Mdiff"
[16] "coldwar" "xratf" "labor"
[19] "physint" "worker" "polity2"
[22] "v60mean" "v63mean" "v64mean"
[25] "truthvictim"
Doesn’t seem straightforward, right? To get the sense of what’s going on you can go to the original replication data folder of the study (here). But, to simplify the task, this script will guide you.
Our dependent variable is v3Mdiff
and independent is truthvictim
. Let’s rename that for the sake of clarity, to received foreign direct investments (fdi) and post-conflict justice (pcj) respectively.
= appel_loyle_raw %>%
appel_loyle mutate(fdi = v3Mdiff,
pcj = truthvictim)
Conceptually, post-conflict justice is operationalized as presence of truth commissions and reparations programs after the conflict.
First, explore the foreign direct investment. What do you notice?
ggplot(appel_loyle) +
geom_histogram(aes(x = fdi)) +
theme_bw() +
labs(x = "Received Foreign Direct Investment",
y = "Count")
Now, let’s see the distribution of the independent variable. How to fix the graph? When the graph is fixed, add the following line of code scale_x_discrete(labels = c("Absent", "Present"))
to the graph.
ggplot(appel_loyle) +
geom_histogram(aes(x = pcj)) +
theme_light() +
labs(x = "Post-Conflict Institutions",
y = "Count")
Now, compare the distributions.
ggplot(appel_loyle) +
geom_boxplot(aes(x = as.factor(pcj), y = fdi)) +
labs(x = "Post-Conflict Institutions",
y = "Received Foreign Direct Investment") +
scale_x_discrete(labels = c("Absent", "Present"))
Finally, let’s set up a simple linear regression. What does it mean?
= lm(fdi ~ pcj, appel_loyle)
slr_model summary(slr_model)
Call:
lm(formula = fdi ~ pcj, data = appel_loyle)
Residuals:
Min 1Q Median 3Q Max
-4047.4 -459.1 -418.7 -136.9 22648.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 425.0 323.9 1.312 0.1927
pcj 1763.5 744.0 2.370 0.0198 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2842 on 93 degrees of freedom
Multiple R-squared: 0.05696, Adjusted R-squared: 0.04682
F-statistic: 5.618 on 1 and 93 DF, p-value: 0.01985
Multiple Linear Regression
Economic Controls
Now, let’s think through control that authors use. First, focus on several economic control variables:
Economic development (
fv8
) measured as GDP per capitaEconomic size (
fv10
) measured as GDPFemale Life expectancy (
v64mean
)
Thus, this is how the authors model the real world. Does it raise any problems?
= dagitty('dag {
econ_appel_loyle PCJ -> FDI
EconomicDvelopment -> PCJ
EconomicDvelopment -> FDI
EconomicSize -> FDI
EconomicSize -> PCJ
FLifeexpectancy -> FDI
FLifeexpectancy -> PCJ
}')
plot(econ_appel_loyle)
Plot coordinates for graph not supplied! Generating coordinates, see ?coordinates for how to set your own.
Again, let’s rename the variables for the sake of clarity
= appel_loyle %>%
appel_loyle rename(econ_development = fv8,
econ_size = fv10,
flife_exp = v64mean)
Now, let’s set up the multiple linear model! Do the results hold when we add controls?
= lm(fdi ~ pcj + econ_development + econ_size + flife_exp, appel_loyle)
econ_model summary(econ_model)
Call:
lm(formula = fdi ~ pcj + econ_development + econ_size + flife_exp,
data = appel_loyle)
Residuals:
Min 1Q Median 3Q Max
-7348.9 -579.7 151.2 304.2 16072.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.637e+02 1.711e+03 -0.446 0.6563
pcj 1.544e+03 6.440e+02 2.397 0.0186 *
econ_development 2.753e-03 1.295e-01 0.021 0.9831
econ_size 7.561e-09 1.533e-09 4.933 3.68e-06 ***
flife_exp 9.793e+00 3.118e+01 0.314 0.7542
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2438 on 90 degrees of freedom
Multiple R-squared: 0.3283, Adjusted R-squared: 0.2985
F-statistic: 11 on 4 and 90 DF, p-value: 2.63e-07
Political Controls
Now, let’s add political institutions.
Domestic political constraints (
fv27
)Political regime (
polity2
)
Model Diagnostics
Let’s take their full model, and run a quick diagnostic. The code below uses non-amended dataset, thus the names of the variables are not changed. Feel free to play around this after the section.
Most importantly, the PCJ is statistically significant.
= lm(v3Mdiff ~ truthvictim + fv8 + fv10 + fv11 + fv34 + fv27 + victory_lag + cw_duration_lag + damage + peace_agreement_lag + coldwar + polity2 + xratf + labor + v64mean, appel_loyle_raw)
full_model
summary(full_model)
Call:
lm(formula = v3Mdiff ~ truthvictim + fv8 + fv10 + fv11 + fv34 +
fv27 + victory_lag + cw_duration_lag + damage + peace_agreement_lag +
coldwar + polity2 + xratf + labor + v64mean, data = appel_loyle_raw)
Residuals:
Min 1Q Median 3Q Max
-5308.4 -715.8 188.0 670.0 11747.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.278e+03 2.852e+03 -0.448 0.65525
truthvictim 1.960e+03 7.030e+02 2.788 0.00663 **
fv8 -1.110e-01 1.329e-01 -0.835 0.40625
fv10 1.095e-08 1.738e-09 6.301 1.56e-08 ***
fv11 3.740e+01 2.324e+01 1.609 0.11152
fv34 1.988e+02 2.016e+02 0.986 0.32701
fv27 2.558e+03 1.460e+03 1.753 0.08357 .
victory_lag -3.397e+01 6.507e+02 -0.052 0.95850
cw_duration_lag 8.111e-01 3.554e+01 0.023 0.98185
damage 2.838e+01 1.024e+01 2.771 0.00697 **
peace_agreement_lag -1.215e+03 7.938e+02 -1.531 0.12983
coldwar 8.153e+01 6.541e+02 0.125 0.90112
polity2 -9.017e+01 5.631e+01 -1.601 0.11330
xratf -4.252e+01 1.389e+01 -3.061 0.00301 **
labor 9.844e+00 2.553e+01 0.386 0.70083
v64mean 3.475e+00 3.299e+01 0.105 0.91638
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2213 on 79 degrees of freedom
Multiple R-squared: 0.5141, Adjusted R-squared: 0.4219
F-statistic: 5.573 on 15 and 79 DF, p-value: 1.472e-07
Let’s load a ggfortify
package for the diagnostics
library(ggfortify)
We’ll get back to diagnostics in the upcoming weeks: but as a beginning, and most importantly, these patterns highlight possible problems with the model. Quite likely there are some outliers that might drive the result. If you can wait to get into a detail in these plots, feel free to use this tutorial.
- In your free time, try applying a log transformation to the dependent variable in the
full_model
and compare the diagnostic plots
autoplot(full_model)
Some Tips
When thinking about the statistical relationship, pay attention to the mechanism and causal pathway: how and why IV influences DV?
Visualize the distributions! Highlight potential problems in the preliminary stage of data analysis
Check List
I know how to explore variables, and plot their relationships
I am not afraid of multiple linear regression, and can easily interpret the coefficients and statistical significance
I am able to link my theoretical framework to empirically operationalized variables
I am gradually developing intuition behind model diagnostics
Footnotes
Appel, B.J. and Loyle, C.E., 2012. The economic benefits of justice: Post-conflict justice and foreign direct investment. Journal of Peace Research, 49(5), pp.685-699.↩︎