library(tidyverse)
library(patchwork)Regression
Week 7
Before we start
- Any questions?
If you have checked out the GitHub practice .qmd, then this would be quite familiar. Though, with the generated data! Let’s compare Histograms and Boxplots.
First of all, load tidyverse and patchwork libraries (intall the latter, if haven’t done so yet!).
Let’s generate the data. Say, coming from normal distribution. What is rnorm? Generate 1000 observaitons with \(\mu = 0.5\) and \(\sigma = 0.15\).
set.seed(123)
data = data.frame(loyalty = rnorm(...))Plot histogram first, then add a geom_density() layer. Customize the plot. Add limits for X axis on the graph from 0 to 1 using xlim(), change theme to theme_minimal(). Make sure to indicate median.
gr_histogram = data %>%
ggplot() +
...(aes(x = loyalty, y = ..density..), fill = "lightgray", color = "gray") +
...(aes(x = loyalty)) +
...
...
labs(x = "Loyalty",
y = "Density") +
geom_vline(xintercept = median(data$loyalty), size = 1)
gr_histogramNow, repeat the same process with boxplot. Make loyalty appear on the Y axis. Add coord_flip().
gr_boxplot = data %>%
ggplot(aes(x = "", y = loyalty)) +
geom_boxplot(fill = "lightgray") +
...
theme_minimal() +
labs(x = "", y = "Loyalty") +
ylim(0,1)
gr_boxplotFinally, compare them together. We can use patchwork for this. Compare the graphs.
gr_histogram / gr_boxplotAgenda
Practice joins (again!)
Build OLS Regression Models (with extensions!)
Presenting Regression Results
Exploring the Data
We are working with V-Dem and SIPRI Arms Transfers Database. It contains information on all transfers of major conventional arms. The variables are:
Recipientof armsYearof the transferImportof armsRegimea V-Dem variable for political regime
sipri = read.csv("data/sipri.csv")Now, subset some variables from V-Dem. We are choosing the following variables:
country_nameyearof the coded datae_gdpGDP of a countrye_miintecoArmed conflict, internationale_miintercArmed conflict, internal
load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))
vdem_variables = vdem %>%
select(country_name, year, e_gdp, e_miinteco, e_miinterc)Note the syntax below. We are joining two dataframes by two variables: Recipient and Year, but in the V-Dem data those have different name or spelling.
sipri_vdem = sipri %>%
left_join(vdem_variables, by = c("Recipient" = "country_name",
"Year" = "year"))
head(sipri_vdem) Recipient Year Import Regime e_gdp e_miinteco e_miinterc
1 India 1950 141 Autocratic 42170.12 0 0
2 India 1951 277 Electoral Authoritarian 42010.81 0 0
3 India 1952 104 Minimally Democratic 43033.76 0 0
4 India 1953 430 Minimally Democratic 44503.00 0 0
5 India 1954 265 Minimally Democratic 45960.55 0 0
6 India 1955 350 Minimally Democratic 47305.30 1 1
For our convenience, rename the variables in the newly created dataframe.
sipri_vdem = sipri_vdem %>%
rename(GDP = e_gdp,
International_conflict = e_miinteco,
Internal_conflict = e_miinterc)
head(sipri_vdem) Recipient Year Import Regime GDP International_conflict
1 India 1950 141 Autocratic 42170.12 0
2 India 1951 277 Electoral Authoritarian 42010.81 0
3 India 1952 104 Minimally Democratic 43033.76 0
4 India 1953 430 Minimally Democratic 44503.00 0
5 India 1954 265 Minimally Democratic 45960.55 0
6 India 1955 350 Minimally Democratic 47305.30 1
Internal_conflict
1 0
2 0
3 0
4 0
5 0
6 1
Explore the GDP variable. What can you notice? Take the log() of GDP directly in the plot. Did it get better?
ggplot(sipri_vdem) +
geom_boxplot(aes(y = GDP))Warning: Removed 436 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Therefore, let’s create a new variable Log_GDP. This is a very typical operation. We call it transformations.
sipri_vdem = sipri_vdem %>%
mutate(Log_GDP = log(GDP))To explore multiple variables at once, it is useful to plot them in pairs plot. There’s a library GGally which is based on ggplot2, and it’s quite straightforward. Be careful with the wrong class identification! Can you notice anything?
library(GGally)
sipri_vdem %>%
ggpairs(columns = c("Import", "Regime", "Log_GDP", "International_conflict", "Internal_conflict"))Don’t forget to fix the classes of the variables!
sipri_vdem$International_conflict = as.factor(sipri_vdem$International_conflict)
sipri_vdem$Regime = as.factor(sipri_vdem$Regime)Model Building
Let’s set up a basic model. We are interested in explaining the Import of arms. Is it related to the economic capacity of the state? We can use proxy Log_GDP. We can interpret and plot it quite easily, right? But does the intercept make sense?
model_basic = lm(Import ~ Log_GDP, sipri_vdem)
summary(model_basic)
Call:
lm(formula = Import ~ Log_GDP, data = sipri_vdem)
Residuals:
Min 1Q Median 3Q Max
-676.3 -222.9 -87.5 66.0 5291.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -718.034 29.032 -24.73 <2e-16 ***
Log_GDP 108.547 3.168 34.26 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 457.9 on 6439 degrees of freedom
(436 observations deleted due to missingness)
Multiple R-squared: 0.1542, Adjusted R-squared: 0.1541
F-statistic: 1174 on 1 and 6439 DF, p-value: < 2.2e-16
However, simple models are rare to see in academic articles. Let’s run a Multiple Linear Regression with fixed effects. Add International_conflict and Regime to the model. You can interpret these coefficients easily, too.
model_fe = lm(Import ~ Log_GDP + International_conflict + Regime, sipri_vdem)
summary(model_fe)
Call:
lm(formula = Import ~ Log_GDP + International_conflict + Regime,
data = sipri_vdem)
Residuals:
Min 1Q Median 3Q Max
-991.3 -217.9 -82.6 81.4 5169.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -783.055 36.852 -21.248 < 2e-16 ***
Log_GDP 126.364 4.468 28.281 < 2e-16 ***
International_conflict1 206.923 26.470 7.817 6.72e-15 ***
RegimeDemocratic -95.328 17.191 -5.545 3.11e-08 ***
RegimeElectoral Authoritarian -151.434 23.328 -6.491 9.45e-11 ***
RegimeMinimally Democratic -216.183 50.820 -4.254 2.15e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 454.6 on 4347 degrees of freedom
(2524 observations deleted due to missingness)
Multiple R-squared: 0.1995, Adjusted R-squared: 0.1986
F-statistic: 216.6 on 5 and 4347 DF, p-value: < 2.2e-16
To convey a better idea what’s going on, you can plot these three independent variables. We can use marginaleffects library for this.
library(marginaleffects)
model_fe %>%
plot_predictions(condition = c("Log_GDP", "International_conflict", "Regime"))Finally, let’s set up another model. This time with an interaction effect. Again, to get a better idea what’s going on, you can plot the three independent variables against the dependent variable. But the limitation is it’s hard to include more variables into the graph.
However, good news is that it is still a ggplot object! So you can customize it in a same manner.
model_int = lm(Import ~ Log_GDP * International_conflict + Regime, sipri_vdem)
model_int %>%
plot_predictions(condition = c("Log_GDP", "International_conflict", "Regime")) +
labs(x = "Log GDP",
y = "Import",
title = "Interaction Model",
color = "International Conflict",
fill = "International Conflict")Presentation of Results with Tables
The most common way to present the resutls of the regression is in tables. We have practiced it over the last two quarters. By now, the expectation is you know how to present models in a “publishable” way. For details and examples see details.
Don’t forget to:
Include p-values and \(\alpha\) (critical values for stars display)!
Or include confidence intervals for \(\beta\) (
statisitc = "conf.int")If you have a categorical variable, mention the reference category
\(R^2\) for simple linear regression and \(R^2_a\) for multiple linear regression
Include number of observations
Be concise (unless it’s made for appendix)! You can simply highlight you have fixed effects instead of including the coefficient.
Add the following argument: output = "table.html" if you want to save the table.
library(modelsummary)
publishable_table = modelsummary(list("Base model" = model_basic,
"Fixed Effects model" = model_fe,
"Interaction model" = model_int),
title = "Arms Import Models",
stars = c("*" = 0.05), # or TRUE for default settings
gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
coef_rename = c("(Intercept)",
"Log GDP",
"International Conflict",
"Regime: Democratic",
"Regime: Electoral Authoritarian",
"Regime: Minimally Democratic",
"Log GDP × International Conflict"),
notes = "Regime reference category: Autocratic",
statistic = c("p.value"))
publishable_table| Base model | Fixed Effects model | Interaction model | |
|---|---|---|---|
| * p < 0.05 | |||
| Regime reference category: Autocratic | |||
| (Intercept) | -718.034* | -783.055* | -747.804* |
| (<0.001) | (<0.001) | (<0.001) | |
| Log GDP | 108.547* | 126.364* | 122.263* |
| (<0.001) | (<0.001) | (<0.001) | |
| International Conflict | 206.923* | -375.723* | |
| (<0.001) | (0.013) | ||
| Regime: Democratic | -95.328* | -95.161* | |
| (<0.001) | (<0.001) | ||
| Regime: Electoral Authoritarian | -151.434* | -151.842* | |
| (<0.001) | (<0.001) | ||
| Regime: Minimally Democratic | -216.183* | -218.460* | |
| (<0.001) | (<0.001) | ||
| Log GDP × International Conflict | 63.002* | ||
| (<0.001) | |||
| Num.Obs. | 6441 | 4353 | 4353 |
| R2 | 0.154 | 0.199 | 0.202 |
| R2 Adj. | 0.154 | 0.199 | 0.201 |
When you have fixed effects, it may be a good practice to eliminate fixed effects coefficients. For example,
modelsummary(list("Base model" = model_basic,
"Fixed Effects model" = model_fe,
"Interaction model" = model_int),
title = "Arms Import Models",
stars = c("*" = 0.05),
gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
coef_omit = "Regime", # remove the coefficient you do not need to show
coef_rename = c("(Intercept)",
"Log GDP",
"International Conflict",
"Log GDP × International Conflict"),
add_rows = data.frame(term = "Fixed effects",
base = "-",
fe = "+",
int = "+"))| Base model | Fixed Effects model | Interaction model | |
|---|---|---|---|
| * p < 0.05 | |||
| (Intercept) | -718.034* | -783.055* | -747.804* |
| (29.032) | (36.852) | (37.877) | |
| Log GDP | 108.547* | 126.364* | 122.263* |
| (3.168) | (4.468) | (4.582) | |
| International Conflict | 206.923* | -375.723* | |
| (26.470) | (151.132) | ||
| Log GDP × International Conflict | 63.002* | ||
| (16.090) | |||
| Num.Obs. | 6441 | 4353 | 4353 |
| R2 | 0.154 | 0.199 | 0.202 |
| R2 Adj. | 0.154 | 0.199 | 0.201 |
| Fixed effects | - | + | + |
Check List
Optional Exercises
Let’s start with extracting the variables for the analysis. Select country_name, year, and Life expectancy (e_pelifeex) from vdem data.
Now, using left_join(), combine two dataframes: sipri and vdem_extraction.
In the newly created sipri_expectancy dataframe, rename the e_pelifeex to Life_expectancy
Draw a ggpairs() graph (library is GGally). Can you discern any patterns (e.g., intersection of regime vs life expectancy or import vs life expectancy)?
Set up a fixed effects model. Regress Import against Life_expectancy and Regime
Present a summary table
Present a modelsummary() table. Make it as publishable as possible. Don’t forget to include the reference category for the Regime variable.