Regression

Week 7

Published

October 31, 2025

Before we start

Any questions?

Review

If you have checked out the GitHub practice .qmd, then this would be quite familiar. Though, with the generated data! Let’s compare Histograms and Boxplots.

First of all, load tidyverse and patchwork libraries (intall the latter, if haven’t done so yet!).

library(tidyverse)
library(patchwork)

Let’s generate the data. Say, coming from normal distribution. What is rnorm? Generate 1000 observaitons with \(\mu = 0.5\) and \(\sigma = 0.15\).

set.seed(123)
data = data.frame(loyalty = rnorm(...))

Plot histogram first, then add a geom_density() layer. Customize the plot. Add limits for X axis on the graph from 0 to 1 using xlim(), change theme to theme_minimal(). Make sure to indicate median.

gr_histogram = data %>%
  ggplot() +
    ...(aes(x = loyalty, y = ..density..), fill = "lightgray", color = "gray") +
    ...(aes(x = loyalty)) +
    ...
    ...
    labs(x = "Loyalty",
         y = "Density") +
   geom_vline(xintercept = median(data$loyalty), size = 1)

gr_histogram

Now, repeat the same process with boxplot. Make loyalty appear on the Y axis. Add coord_flip().

gr_boxplot = data %>%
  ggplot(aes(x = "", y = loyalty)) +
    geom_boxplot(fill = "lightgray") +
    ...
    theme_minimal() +
    labs(x = "", y = "Loyalty") +
    ylim(0,1)

gr_boxplot

Finally, compare them together. We can use patchwork for this. Compare the graphs.

gr_histogram / gr_boxplot

Agenda

Practice joins (again!)
Build OLS Regression Models (with extensions!)
Presenting Regression Results

Exploring the Data

We are working with V-Dem and SIPRI Arms Transfers Database. It contains information on all transfers of major conventional arms. The variables are:

Recipient of arms
Year of the transfer
Import of arms
Regime a V-Dem variable for political regime

sipri = read.csv("data/sipri.csv")

Now, subset some variables from V-Dem. We are choosing the following variables:

country_name
year of the coded data
e_gdp GDP of a country
e_miinteco Armed conflict, international
e_miinterc Armed conflict, internal

load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))
vdem_variables = vdem %>%
  select(country_name, year, e_gdp, e_miinteco, e_miinterc)

Note the syntax below. We are joining two dataframes by two variables: Recipient and Year, but in the V-Dem data those have different name or spelling.

sipri_vdem = sipri %>%
  left_join(vdem_variables, by = c("Recipient" = "country_name", 
                                   "Year" = "year"))

head(sipri_vdem)

  Recipient Year Import                  Regime    e_gdp e_miinteco e_miinterc
1     India 1950    141              Autocratic 42170.12          0          0
2     India 1951    277 Electoral Authoritarian 42010.81          0          0
3     India 1952    104    Minimally Democratic 43033.76          0          0
4     India 1953    430    Minimally Democratic 44503.00          0          0
5     India 1954    265    Minimally Democratic 45960.55          0          0
6     India 1955    350    Minimally Democratic 47305.30          1          1

For our convenience, rename the variables in the newly created dataframe.

sipri_vdem = sipri_vdem %>%
  rename(GDP = e_gdp,
         International_conflict = e_miinteco,
         Internal_conflict = e_miinterc)

head(sipri_vdem)

  Recipient Year Import                  Regime      GDP International_conflict
1     India 1950    141              Autocratic 42170.12                      0
2     India 1951    277 Electoral Authoritarian 42010.81                      0
3     India 1952    104    Minimally Democratic 43033.76                      0
4     India 1953    430    Minimally Democratic 44503.00                      0
5     India 1954    265    Minimally Democratic 45960.55                      0
6     India 1955    350    Minimally Democratic 47305.30                      1
  Internal_conflict
1                 0
2                 0
3                 0
4                 0
5                 0
6                 1

Explore the GDP variable. What can you notice? Take the log() of GDP directly in the plot. Did it get better?

ggplot(sipri_vdem) +
  geom_boxplot(aes(y = GDP))

Warning: Removed 436 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Therefore, let’s create a new variable Log_GDP. This is a very typical operation. We call it transformations.

sipri_vdem = sipri_vdem %>%
  mutate(Log_GDP = log(GDP))

To explore multiple variables at once, it is useful to plot them in pairs plot. There’s a library GGally which is based on ggplot2, and it’s quite straightforward. Be careful with the wrong class identification! Can you notice anything?

library(GGally)

sipri_vdem %>%
  ggpairs(columns = c("Import", "Regime", "Log_GDP", "International_conflict", "Internal_conflict"))

Don’t forget to fix the classes of the variables!

sipri_vdem$International_conflict = as.factor(sipri_vdem$International_conflict)
sipri_vdem$Regime = as.factor(sipri_vdem$Regime)

Model Building

Let’s set up a basic model. We are interested in explaining the Import of arms. Is it related to the economic capacity of the state? We can use proxy Log_GDP. We can interpret and plot it quite easily, right? But does the intercept make sense?

model_basic = lm(Import ~ Log_GDP, sipri_vdem)
summary(model_basic)


Call:
lm(formula = Import ~ Log_GDP, data = sipri_vdem)

Residuals:
   Min     1Q Median     3Q    Max 
-676.3 -222.9  -87.5   66.0 5291.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -718.034     29.032  -24.73   <2e-16 ***
Log_GDP      108.547      3.168   34.26   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 457.9 on 6439 degrees of freedom
  (436 observations deleted due to missingness)
Multiple R-squared:  0.1542,    Adjusted R-squared:  0.1541 
F-statistic:  1174 on 1 and 6439 DF,  p-value: < 2.2e-16

However, simple models are rare to see in academic articles. Let’s run a Multiple Linear Regression with fixed effects. Add International_conflict and Regime to the model. You can interpret these coefficients easily, too.

model_fe = lm(Import ~ Log_GDP + International_conflict + Regime, sipri_vdem)
summary(model_fe)


Call:
lm(formula = Import ~ Log_GDP + International_conflict + Regime, 
    data = sipri_vdem)

Residuals:
   Min     1Q Median     3Q    Max 
-991.3 -217.9  -82.6   81.4 5169.6 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -783.055     36.852 -21.248  < 2e-16 ***
Log_GDP                        126.364      4.468  28.281  < 2e-16 ***
International_conflict1        206.923     26.470   7.817 6.72e-15 ***
RegimeDemocratic               -95.328     17.191  -5.545 3.11e-08 ***
RegimeElectoral Authoritarian -151.434     23.328  -6.491 9.45e-11 ***
RegimeMinimally Democratic    -216.183     50.820  -4.254 2.15e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 454.6 on 4347 degrees of freedom
  (2524 observations deleted due to missingness)
Multiple R-squared:  0.1995,    Adjusted R-squared:  0.1986 
F-statistic: 216.6 on 5 and 4347 DF,  p-value: < 2.2e-16

To convey a better idea what’s going on, you can plot these three independent variables. We can use marginaleffects library for this.

library(marginaleffects)

model_fe %>%
  plot_predictions(condition  = c("Log_GDP", "International_conflict", "Regime"))

Finally, let’s set up another model. This time with an interaction effect. Again, to get a better idea what’s going on, you can plot the three independent variables against the dependent variable. But the limitation is it’s hard to include more variables into the graph.

However, good news is that it is still a ggplot object! So you can customize it in a same manner.

model_int = lm(Import ~ Log_GDP * International_conflict + Regime, sipri_vdem)

model_int %>%
    plot_predictions(condition  = c("Log_GDP", "International_conflict", "Regime")) +
    labs(x = "Log GDP",
         y = "Import",
         title = "Interaction Model",
         color = "International Conflict",
         fill = "International Conflict")

Presentation of Results with Tables

The most common way to present the resutls of the regression is in tables. We have practiced it over the last two quarters. By now, the expectation is you know how to present models in a “publishable” way. For details and examples see details.

Don’t forget to:

Include p-values and \(\alpha\) (critical values for stars display)!
Or include confidence intervals for \(\beta\) (statisitc = "conf.int")
If you have a categorical variable, mention the reference category
\(R^2\) for simple linear regression and \(R^2_a\) for multiple linear regression
Include number of observations
Be concise (unless it’s made for appendix)! You can simply highlight you have fixed effects instead of including the coefficient.

Add the following argument: output = "table.html" if you want to save the table.

library(modelsummary)
publishable_table = modelsummary(list("Base model" = model_basic,
                    "Fixed Effects model" = model_fe,
                    "Interaction model" = model_int),
                   title = "Arms Import Models",  
                   stars = c("*" = 0.05), # or TRUE for default settings
                   gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
                   coef_rename = c("(Intercept)", 
                             "Log GDP", 
                             "International Conflict", 
                             "Regime: Democratic",
                             "Regime: Electoral Authoritarian",
                             "Regime: Minimally Democratic",
                             "Log GDP × International Conflict"),
                   notes = "Regime reference category: Autocratic",
                   statistic = c("p.value"))

publishable_table

Arms Import Models
	Base model	Fixed Effects model	Interaction model
* p < 0.05
Regime reference category: Autocratic
(Intercept)	-718.034*	-783.055*	-747.804*
	(<0.001)	(<0.001)	(<0.001)
Log GDP	108.547*	126.364*	122.263*
	(<0.001)	(<0.001)	(<0.001)
International Conflict		206.923*	-375.723*
		(<0.001)	(0.013)
Regime: Democratic		-95.328*	-95.161*
		(<0.001)	(<0.001)
Regime: Electoral Authoritarian		-151.434*	-151.842*
		(<0.001)	(<0.001)
Regime: Minimally Democratic		-216.183*	-218.460*
		(<0.001)	(<0.001)
Log GDP × International Conflict			63.002*
			(<0.001)
Num.Obs.	6441	4353	4353
R2	0.154	0.199	0.202
R2 Adj.	0.154	0.199	0.201

When you have fixed effects, it may be a good practice to eliminate fixed effects coefficients. For example,

modelsummary(list("Base model" = model_basic,
                    "Fixed Effects model" = model_fe,
                    "Interaction model" = model_int),
                   title = "Arms Import Models",  
                   stars = c("*" = 0.05), 
                   gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
                   coef_omit = "Regime", # remove the coefficient you do not need to show
                   coef_rename = c("(Intercept)", 
                             "Log GDP", 
                             "International Conflict", 
                             "Log GDP × International Conflict"),
                  add_rows = data.frame(term = "Fixed effects",
                                   base = "-",
                                   fe = "+",
                                   int = "+"))

Arms Import Models
	Base model	Fixed Effects model	Interaction model
* p < 0.05
(Intercept)	-718.034*	-783.055*	-747.804*
	(29.032)	(36.852)	(37.877)
Log GDP	108.547*	126.364*	122.263*
	(3.168)	(4.468)	(4.582)
International Conflict		206.923*	-375.723*
		(26.470)	(151.132)
Log GDP × International Conflict			63.002*
			(16.090)
Num.Obs.	6441	4353	4353
R2	0.154	0.199	0.202
R2 Adj.	0.154	0.199	0.201
Fixed effects	-	+	+

Check List

I am confident in my data merging skills
I know how to use GGally to analyze pairs of variables
I can plot complex models with marginaleffects library
I know how to use modelsummary() to present publishable tables
I have an intuition about predicted values and marginal effects
I have an intuition about how to transform variables

Optional Exercises

Let’s start with extracting the variables for the analysis. Select country_name, year, and Life expectancy (e_pelifeex) from vdem data.

Solution

vdem_extraction = vdem %>%
  ...

Now, using left_join(), combine two dataframes: sipri and vdem_extraction.

Solution

sipri_expectancy = sipri %>%
  ...(..., by = c("Recipient" = "country_name", 
                                     "Year" = ...))

In the newly created sipri_expectancy dataframe, rename the e_pelifeex to Life_expectancy

Solution

sipri_expectancy = ... %>%
  ...(Life_expectancy = ...)

Draw a ggpairs() graph (library is GGally). Can you discern any patterns (e.g., intersection of regime vs life expectancy or import vs life expectancy)?

Solution

YOUR SOLUTION HERE

Set up a fixed effects model. Regress Import against Life_expectancy and Regime

Solution

life_expectancy_model_fe = lm(...)

Present a summary table

Solution

YOUR SOLUTION HERE

Present a modelsummary() table. Make it as publishable as possible. Don’t forget to include the reference category for the Regime variable.

Solution

YOUR SOLUTION HERE