Presenting Results of Regression

Week 5

Published

February 6, 2025

Before we start

  • Any questions?
Download script

Download data

Review of the previous week

Review

Start with loading the tidyverse library and the Who Governs data.

library(tidyverse)

whogov = read.csv("data/WhoGov.csv")

Calculate confidence intervals of n_total for each system_category.

Visualize the calculated confidence intervals.

Set up a linear model. Regress n_total against system_category.

Using ggeffects library, plot the predicted values of the model.

Review of the homework

The ggeffects library is incredibly helpful! And quite simple. You can use it for plotting the categorical data. One thing to note, make sure it’s a factor()! Check it out below.

library(ggeffects)

int = lm(alertness ~ coffee * time, data = coffee_data)

ggpredict(int, terms = c("time", "coffee")) %>%
  plot(connect_lines = TRUE) 

Agenda

  • Discuss libraries in development. Some cutting-edge packages are not as straightforward to install!

  • Practice joins

  • Discuss how to present the regressions results and ease the interpretation of complex models

Devtools in R

Sometimes, it is helpful to utilize versions of packages that are under development. Those are impossible to install directly, but you can download them frob GitHub. To simplify this process, you need special package called devtools.

install.packages("devtools")

Now, install the vdemdata library. This way we’ll be able to load the most current V-Dem dataset directly to the R.

devtools::install_github("vdeminstitute/vdemdata")

Let’s test it. We see the dataset is here! But for the future, this is the way to install packages that are not released yet.

library(tidyverse)
library(vdemdata)
vdem %>%
  select(country_name, year, histname, v2x_polyarchy) %>%
  head()
  country_name year                 histname v2x_polyarchy
1       Mexico 1789 Viceroyalty of New Spain         0.028
2       Mexico 1790 Viceroyalty of New Spain         0.028
3       Mexico 1791 Viceroyalty of New Spain         0.028
4       Mexico 1792 Viceroyalty of New Spain         0.028
5       Mexico 1793 Viceroyalty of New Spain         0.028
6       Mexico 1794 Viceroyalty of New Spain         0.028

Merging Datasets and Exploring the Data

We are working with SIPRI Arms Transfers Database. It contains information on all transfers of major conventional arms. The variables are:

  • Recipient of arms

  • Year of the transfer

  • Import of arms

  • Regime a V-Dem variable for political regime

sipri = read.csv("data/sipri.csv")

Now, subset some variables from V-Dem. We are choosing the following variables:

  • country_name

  • year of the coded data

  • e_gdp GDP of a country

  • e_miinteco Armed conflict, international

  • e_miinterc Armed conflict, internal

vdem_variables = vdem %>%
  select(country_name, year, e_gdp, e_miinteco, e_miinterc)

Note the syntax below. We are joining two dataframes by two variables: Recipient and Year, but in the V-Dem data those have different name or spelling.

sipri_vdem = sipri %>%
  left_join(vdem_variables, by = c("Recipient" = "country_name", 
                                   "Year" = "year"))

head(sipri_vdem)
  Recipient Year Import                  Regime    e_gdp e_miinteco e_miinterc
1     India 1950    141              Autocratic 42170.12          0          0
2     India 1951    277 Electoral Authoritarian 42010.81          0          0
3     India 1952    104    Minimally Democratic 43033.76          0          0
4     India 1953    430    Minimally Democratic 44503.00          0          0
5     India 1954    265    Minimally Democratic 45960.55          0          0
6     India 1955    350    Minimally Democratic 47305.30          1          1

For our convenience, rename the variables in the newly created dataframe.

sipri_vdem = sipri_vdem %>%
  rename(GDP = e_gdp,
         International_conflict = e_miinteco,
         Internal_conflict = e_miinterc)

head(sipri_vdem)
  Recipient Year Import                  Regime      GDP International_conflict
1     India 1950    141              Autocratic 42170.12                      0
2     India 1951    277 Electoral Authoritarian 42010.81                      0
3     India 1952    104    Minimally Democratic 43033.76                      0
4     India 1953    430    Minimally Democratic 44503.00                      0
5     India 1954    265    Minimally Democratic 45960.55                      0
6     India 1955    350    Minimally Democratic 47305.30                      1
  Internal_conflict
1                 0
2                 0
3                 0
4                 0
5                 0
6                 1

Explore the GDP variable. Does it need a transformation?

ggplot(sipri_vdem) +
  geom_histogram(aes(x = GDP))

It might be the case. But to double-check our assumption, let’s draw a boxplot. Take the log() of GDP directly in the plot. Did it get better?

ggplot(sipri_vdem) +
  geom_boxplot(aes(y = GDP))
Warning: Removed 436 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Therefore, let’s create a new variable Log_GDP.

sipri_vdem = sipri_vdem %>%
  mutate(Log_GDP = log(GDP))

To explore multiple variables at once, it is useful to plot them in pairs plot. There’s a library GGally which is based on ggplot2, and it’s quite straightforward. Be careful with the wrong class identification! Can you notice anything?

library(GGally)

sipri_vdem %>%
  ggpairs(columns = c("Import", "Regime", "Log_GDP", "International_conflict", "Internal_conflict"))

Don’t forget to fix the classes of the variables!

sipri_vdem$International_conflict = as.factor(sipri_vdem$International_conflict)
sipri_vdem$Regime = as.factor(sipri_vdem$Regime)

Model Building

Let’s set up a basic model. We are interested in explaining the Import of arms. Is it related to the economic capacity of the state? We can use proxy Log_GDP. We can interpret and plot it quite easily, right? But does the intercept make sense?

model_basic = lm(Import ~ Log_GDP, sipri_vdem)
summary(model_basic)

Call:
lm(formula = Import ~ Log_GDP, data = sipri_vdem)

Residuals:
   Min     1Q Median     3Q    Max 
-676.3 -222.9  -87.5   66.0 5291.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -718.034     29.032  -24.73   <2e-16 ***
Log_GDP      108.547      3.168   34.26   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 457.9 on 6439 degrees of freedom
  (436 observations deleted due to missingness)
Multiple R-squared:  0.1542,    Adjusted R-squared:  0.1541 
F-statistic:  1174 on 1 and 6439 DF,  p-value: < 2.2e-16

However, simple models are rare to see in academic articles. Let’s run a Multiple Linear Regression with fixed effects. Add International_conflict and Regime to the model. You can interpret these coefficients easily, too.

model_fe = lm(Import ~ Log_GDP + International_conflict + Regime, sipri_vdem)
summary(model_fe)

Call:
lm(formula = Import ~ Log_GDP + International_conflict + Regime, 
    data = sipri_vdem)

Residuals:
   Min     1Q Median     3Q    Max 
-991.3 -217.9  -82.6   81.4 5169.6 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -783.055     36.852 -21.248  < 2e-16 ***
Log_GDP                        126.364      4.468  28.281  < 2e-16 ***
International_conflict1        206.923     26.470   7.817 6.72e-15 ***
RegimeDemocratic               -95.328     17.191  -5.545 3.11e-08 ***
RegimeElectoral Authoritarian -151.434     23.328  -6.491 9.45e-11 ***
RegimeMinimally Democratic    -216.183     50.820  -4.254 2.15e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 454.6 on 4347 degrees of freedom
  (2524 observations deleted due to missingness)
Multiple R-squared:  0.1995,    Adjusted R-squared:  0.1986 
F-statistic: 216.6 on 5 and 4347 DF,  p-value: < 2.2e-16

To convey a better idea what’s going on, you can plot these three independent variables. But what if there are more?

library(ggeffects)

ggpredict(model_fe, terms = c("Log_GDP", "International_conflict", "Regime")) %>%
  plot()

Finally, let’s set up another model. This time with an interaction effect.

model_int = lm(Import ~ Log_GDP * International_conflict + Regime, sipri_vdem)
summary(model_int)

Call:
lm(formula = Import ~ Log_GDP * International_conflict + Regime, 
    data = sipri_vdem)

Residuals:
    Min      1Q  Median      3Q     Max 
-1183.9  -214.6   -83.1    77.6  5177.3 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -747.804     37.877 -19.743  < 2e-16 ***
Log_GDP                          122.263      4.582  26.682  < 2e-16 ***
International_conflict1         -375.723    151.132  -2.486    0.013 *  
RegimeDemocratic                 -95.161     17.162  -5.545 3.12e-08 ***
RegimeElectoral Authoritarian   -151.842     23.290  -6.520 7.85e-11 ***
RegimeMinimally Democratic      -218.460     50.740  -4.305 1.70e-05 ***
Log_GDP:International_conflict1   63.002     16.090   3.916 9.16e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 453.8 on 4346 degrees of freedom
  (2524 observations deleted due to missingness)
Multiple R-squared:  0.2023,    Adjusted R-squared:  0.2012 
F-statistic: 183.7 on 6 and 4346 DF,  p-value: < 2.2e-16

Again, to get a better idea what’s going on, you can plot the three independent variables against the dependent variable. But the limitation remains the same, it’s hard to include more variables into the graph.

ggpredict(model_int, terms = c("Log_GDP", "International_conflict", "Regime")) %>%
  plot() 

Presentation of Results

Tables

The most common way to present the resutls of the regression is in tables. We have practiced it over the last two quarters. By now, the expectation is you know how to present models in a “publishable” way. For details and examples see details.

Don’t forget to:

  • Include p-values and \(\alpha\) (critical values for stars display)!

  • Include standad errors or confidence intervals for \(\beta\)

  • If you have a categorical variable, mention the reference category

  • \(R^2\) for simple linear regression and \(R^2_a\) for multiple linear regression

  • Include number of observations

Add the following argument: output = "table.html".

library(modelsummary)
publishable_table = modelsummary(list("Base model" = model_basic,
                    "Fixed Effects model" = model_fe,
                    "Interaction model" = model_int),
                   title = "Arms Import Models",  
                   stars = TRUE,
                   gof_omit = "AIC|BIC|Log.Lik|F|RMSE",
                   coef_rename = c("(Intercept)", 
                             "Log GDP", 
                             "International Conflict", 
                             "Regime: Democratic",
                             "Regime: Electoral Authoritarian",
                             "Regime: Minimally Democratic",
                             "Log GDP × International Conflict"),
                   notes = "Regime reference category: Autocratic")

publishable_table
Arms Import Models
Base model Fixed Effects model Interaction model
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Regime reference category: Autocratic
(Intercept) -718.034*** -783.055*** -747.804***
(29.032) (36.852) (37.877)
Log GDP 108.547*** 126.364*** 122.263***
(3.168) (4.468) (4.582)
International Conflict 206.923*** -375.723*
(26.470) (151.132)
Regime: Democratic -95.328*** -95.161***
(17.191) (17.162)
Regime: Electoral Authoritarian -151.434*** -151.842***
(23.328) (23.290)
Regime: Minimally Democratic -216.183*** -218.460***
(50.820) (50.740)
Log GDP × International Conflict 63.002***
(16.090)
Num.Obs. 6441 4353 4353
R2 0.154 0.199 0.202
R2 Adj. 0.154 0.199 0.201

The more complex the model, the harder it is to interpret the results. Let’s write out our models. You can often see these in articles. By plugging the coefficients we or our readers can calculate the predicted value for any GDP value, regime type and presence or absence of international conflict.

The basic model is easy, right?

\[ \text{Arms Import} = \beta_0 + \beta_1(\text{LogGDP}) \] Now, fixed effects. It gets slightly trickier.

\[ \text{Arms Import} = \beta_0 + \beta_1(\text{LogGDP}) + \beta_2(\text{International Conflict}) + \beta_3(\text{Regime}) \]

And finally, interaction model. You can try to interpret the interaction, but we need to know the reference categories. Which is a bit of a hustle.

\[ \text{Arms Import} = \beta_0 + \beta_1(\text{LogGDP}) + \beta_2(\text{International Conflict}) + \beta_3(\text{Regime}) + \beta_4(\text{LogGDP} \times \text{International Conflict}) \]

Statistical Significance and Magnitude of \(\beta\)

You, probably, remember this from the problem sets. For yourself and reader it’s easier to compare effects visually.

To extract additional information automatically from the model, we use tidy() function from broom library. Take a moment to compare to “native” output.

library(broom)
summary(model_int)

Call:
lm(formula = Import ~ Log_GDP * International_conflict + Regime, 
    data = sipri_vdem)

Residuals:
    Min      1Q  Median      3Q     Max 
-1183.9  -214.6   -83.1    77.6  5177.3 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -747.804     37.877 -19.743  < 2e-16 ***
Log_GDP                          122.263      4.582  26.682  < 2e-16 ***
International_conflict1         -375.723    151.132  -2.486    0.013 *  
RegimeDemocratic                 -95.161     17.162  -5.545 3.12e-08 ***
RegimeElectoral Authoritarian   -151.842     23.290  -6.520 7.85e-11 ***
RegimeMinimally Democratic      -218.460     50.740  -4.305 1.70e-05 ***
Log_GDP:International_conflict1   63.002     16.090   3.916 9.16e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 453.8 on 4346 degrees of freedom
  (2524 observations deleted due to missingness)
Multiple R-squared:  0.2023,    Adjusted R-squared:  0.2012 
F-statistic: 183.7 on 6 and 4346 DF,  p-value: < 2.2e-16

This pretty much the dataframe with the regression output. Thus, it simplifies our work for visualization purposes! (and similar functionality is available in modelsummary library with get_estimates() command)

model_int_output = tidy(model_int, conf.int = TRUE, conf.level = 0.95)
model_int_output
# A tibble: 7 × 7
  term                 estimate std.error statistic   p.value conf.low conf.high
  <chr>                   <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)            -748.      37.9     -19.7  3.69e- 83   -822.     -674. 
2 Log_GDP                 122.       4.58     26.7  2.19e-145    113.      131. 
3 International_confl…   -376.     151.       -2.49 1.30e-  2   -672.      -79.4
4 RegimeDemocratic        -95.2     17.2      -5.54 3.12e-  8   -129.      -61.5
5 RegimeElectoral Aut…   -152.      23.3      -6.52 7.85e- 11   -198.     -106. 
6 RegimeMinimally Dem…   -218.      50.7      -4.31 1.70e-  5   -318.     -119. 
7 Log_GDP:Internation…     63.0     16.1       3.92 9.16e-  5     31.5      94.5

Plotting the effects is great to provide intuition about the significance and the magnitude of the effects. But! Don’t forget about the scales! Are you comparing comparable? Scaling may be helpful: how can you compare one unit increase in Log_GDP to “one unit increase” in International_conflict?

Moreover, you cannot interpret the effect of International_conflict or Log_GDP without accounting for the interaction term. In our case, this plot is not publishible and is quite misleading. It is more suitable for fixed effects and simple additive models. However, don’t forget about the scale!

ggplot(model_int_output) +
  geom_linerange(aes(x = term,
                    y = estimate,
                    ymin = conf.low,
                    ymax = conf.high)) +
  geom_point(aes(x = term,
                 y = estimate)) +
  geom_hline(yintercept = 0, lty=2, color="gray") +
  coord_flip()

Predicted values

Presenting predicted values provides the reader with intuition about what happens when variables take specific values. For example, we take the most typical GDP, and calculate the effects for two categorical variables: International_conflict and Regime.

ggpredict(model_int, terms = c("International_conflict", "Regime")) 
# Predicted values of Import

Regime: Autocratic

International_conflict | Predicted |         95% CI
---------------------------------------------------
0                      |    308.57 | 288.87, 328.27
1                      |    477.20 | 423.80, 530.59

Regime: Democratic

International_conflict | Predicted |         95% CI
---------------------------------------------------
0                      |    213.41 | 187.75, 239.07
1                      |    382.03 | 322.22, 441.85

Regime: Electoral Authoritarian

International_conflict | Predicted |         95% CI
---------------------------------------------------
0                      |    156.73 | 114.75, 198.70
1                      |    325.35 | 258.22, 392.48

Regime: Minimally Democratic

International_conflict | Predicted |         95% CI
---------------------------------------------------
0                      |     90.11 |  -7.60, 187.82
1                      |    258.74 | 146.76, 370.71

Adjusted for:
* Log_GDP = 8.64

Let’s make it look publishable. We wrangle the data a bit. By now, the code shouldn’t stress you out. If it does, then run it step-by-step to get a sense of what’s going on!

model_int_groupped = ggpredict(model_int, terms = c("International_conflict", "Regime")) %>%
  as.data.frame() %>%
  rename(International_conflict = x,
         Regime = group,
         Import = predicted) 

model_int_groupped
  International_conflict    Import std.error   conf.low conf.high
1                      0 308.56877  10.04954 288.866542  328.2710
2                      0 213.40769  13.08769 187.749148  239.0662
3                      0 156.72724  21.40898 114.754718  198.6998
4                      0  90.10883  49.84073  -7.604411  187.8221
5                      1 477.19551  27.23333 423.804293  530.5867
6                      1 382.03443  30.51067 322.217952  441.8509
7                      1 325.35398  34.24112 258.223927  392.4840
8                      1 258.73557  57.11487 146.761296  370.7098
                   Regime
1              Autocratic
2              Democratic
3 Electoral Authoritarian
4    Minimally Democratic
5              Autocratic
6              Democratic
7 Electoral Authoritarian
8    Minimally Democratic

Finally, present the marginal effects using tinytable.1 The code below is quite messy, in your free time you can explore it in a more detail. For now, I want this code to be available to you. Add this line of code in the end: save_tt("predicted_values.html") to save the table on your machine.

library(tinytable)

model_int_groupped %>%
  group_by(International_conflict, Regime) %>%
  summarize(Import = paste0(round(Import, 3), "\n", "[", round(conf.low, 3), "; ", round(conf.high, 3), "]")) %>%
  pivot_wider(names_from = Regime,
              values_from = Import) %>%
  mutate(International_conflict = ifelse(International_conflict == 1, "Present", "Absent")) %>%
  rename(`International Conflict` = International_conflict) %>% 
  tt(note = "Adjusted for average log of GDP (=8.64)") %>%
  group_tt(j = list("Regime" = 2:5)) 
Regime
International Conflict Autocratic Democratic Electoral Authoritarian Minimally Democratic
Adjusted for average log of GDP (=8.64)
Absent 308.569 [288.867; 328.271] 213.408 [187.749; 239.066] 156.727 [114.755; 198.7] 90.109 [-7.604; 187.822]
Present 477.196 [423.804; 530.587] 382.034 [322.218; 441.851] 325.354 [258.224; 392.484] 258.736 [146.761; 370.71]

Marginal Effects

Calculate the “typical” observation (or simply the average observation) across all independent variables.

typical_obs = sipri_vdem %>%
      summarize(Log_GDP = mean(Log_GDP, na.rm = T),
                International_conflict = table(International_conflict) %>% which.max() %>% names(),
                Regime = table(Regime) %>% which.max() %>% names())

typical_obs
   Log_GDP International_conflict     Regime
1 8.984952                      0 Autocratic

This would be our baseline. We can calculate the predicted value for the typical observation.

ggpredict(model_int, typical_obs)
# Predicted values of Import

Log_GDP | Predicted |         95% CI
------------------------------------
   8.98 |    350.72 | 329.90, 371.54

However, the most interesting is to know how the DV change if the IV changes (holding everything else constant). Basically, we are trying to get a simple interpretation, as it is with the simple additive OLS. You can easily calculate the average marginal effect with one unit change in some X with ggeffects library. Moreover, you can illustrate how the predicted values change in comparison to the most typical observation.

marginal_effects = list(Log_GDP = c(8.984952, 8, 8.5, 9.5, 10),
                        International_conflict = c("0"),
                        Regime = c("Autocratic", "Democratic"))


ggpredict(model_int, marginal_effects) 
# Predicted values of Import

Regime: Autocratic

Log_GDP | Predicted |         95% CI
------------------------------------
   8.00 |    230.30 | 211.46, 249.13
   8.50 |    291.43 | 272.06, 310.80
   8.98 |    350.72 | 329.90, 371.54
   9.50 |    413.69 | 390.51, 436.87
  10.00 |    474.82 | 448.77, 500.88

Regime: Democratic

Log_GDP | Predicted |         95% CI
------------------------------------
   8.00 |    135.14 | 106.89, 163.38
   8.50 |    196.27 | 170.13, 222.41
   8.98 |    255.56 | 230.85, 280.27
   9.50 |    318.53 | 294.54, 342.51
  10.00 |    379.66 | 355.54, 403.78

Exercises

Let’s start with extracting the variables for the analysis. Select country_name, year, and Life expectancy (e_pelifeex) from vdem data.

Solution
vdem_extraction = vdem %>%
  ...

Now, using left_join(), combine two dataframes: sipri and vdem_extraction.

Solution
sipri_expectancy = sipri %>%
  ...(..., by = c("Recipient" = "country_name", 
                                     "Year" = ...))

In the newly created sipri_expectancy dataframe, rename the e_pelifeex to Life_expectancy

Solution
sipri_expectancy = ... %>%
  ...(Life_expectancy = ...)

Draw a ggpairs() graph (library is GGally). Can you discern any patterns (e.g., intersection of regime vs life expectancy or import vs life expectancy)?

Solution

YOUR SOLUTION HERE

Set up a fixed effects model. Regress Import against Life_expectancy and Regime

Solution
life_expectancy_model_fe = lm(...)

Present a summary table

Solution

YOUR SOLUTION HERE

Present a modelsummary() table. Make it as publishable as possible. Don’t forget to include the reference category for the Regime variable.

Solution

YOUR SOLUTION HERE

Using tidy(), extract technical information from the model, save it to model_data dataframe. Don’t forget to extract 95% confidence intervals.

Solution

YOUR SOLUTION HERE

Using the information in model_data, plot the effects of the model. You can use either geom_linerange() or geom_errorbar()

Solution
ggplot(...) +
  ...(aes(x = term,
                    y = estimate,
                    ymin = conf.low,
                    ymax = ...)) +
  geom_point(aes(x = ...,
                 y = ...)) +
  geom_hline(yintercept = 0, lty=2, color="gray") +
  coord_flip()

Predict the value for the following observation: Regime = "Democratic" and Life_expectancy = 62.

Solution

YOUR SOLUTION HERE

Present the marginal effects for the following groups: Regime = c("Democratic", "Autocratic") and Life_expectancy = c(40, 50, 60). Who is importing more arms?

Solution

YOUR SOLUTION HERE

Check List

I know how why we need devtools, and I’m happy to use cutting edge libraries!

I know how to use GGally to analyze pairs of variables

I know basics of data merging (with left_join()!)

I can plot complex models with ggeffects library

I know how to save modelsummary() output to my machine

I am not afraid of using math mode and \(\LaTeX\)

I can easily extract additional model information using broom library and the tidy() function

I know how to plot confidence intervals

I have an intuition about predicted values and marginal effects

Footnotes

  1. Great example of research presenting the results in a similar manner can be found in Reuter, O.J. and Szakonyi, D., 2015. Online social media and political awareness in authoritarian regimes. British Journal of Political Science, 45(1), pp.29-51.↩︎