Review and Confidence Intervals

Week 4

Published

January 30, 2025

Before we start

Congrats passing through the first quiz! How are you feeling?
Grades for the psets are out.

Download script
Download data

Review of the previous week

Review

Start with loading the tidyverse library and the World Happiness Report data.

library(tidyverse)

whr = read.csv("data/WHR.csv")

How is the Continent associated with happiness (Ladder_score)? Set up the linear model and present the regression output.

...

Load the ggeffects library.

...

Using ggpredict() plot the model.

...(model) %>%
  ...()

Using relevel() transform the Continent variable so the North America becomes a reference category. Don’t forget that the variable should be of class factor!

...

Run the model again, present the output.

...

Draw the plot to gain a better understanding of what’s going on. What has changed? Why does the result look different?

...

Review of the homework

Don’t load data multiple times
Don’t leave install.packages() when you render the script
Don’t forget to use stars = TRUE for modelsummary()

$\LaTeX$ issues

Integration of R, Python, markdown, Latex and other useful languages incredibly useful. But it comes to a price that researchers should be careful. Any $\LaTeX$ code should go within two $ dollar signs $. For example, an inline formula looks like this: $ Y = 10 $, which produces the following result: $Y = 10$. Alternatively, you can use a double dollar sign to start a “chunk” for latex. For example:

\[ Y = \beta_0 + \beta_1X + u \]

Useful functions

Sometimes you might want to visualize some mathematical functions using geom_function(). In the HW, you were asked to plot an OLS regression vs the true data generating process. For this purpose example below is super useful (however, geom_abline() for this particular task is perfect, too).

set.seed(123)

example_data = data.frame(x = rnorm(100,
                                    mean = 2,
                                    sd = 15),
                          y = rnorm(100,
                                    mean = 2,
                                    sd = 15) * rnorm(100))

ggplot(example_data, aes(x = x, y = y)) +
  geom_point() +
  geom_smooth(aes(color = "Fitted"), method = "lm", se = F) +
  geom_function(aes(color = "Known"), fun = function(x){1 + 0.5 * x}, size = 1.2)

Sometimes it’s useful to visualize, say, polynomials! How does $x^3$ look like?

ggplot() +
  geom_function(fun = function(x){x^2}) +
  geom_vline(xintercept = 0) +
  xlim(-1, 1)

Agenda

Go over hard concepts: fixed effects, interactions and transformations
Refreshing confidence intervals

Right hand side extensions

Today we are again working with WhoGov dataset. As usual, take a look at the codebook.

These are the following variables we are going to work with:

country_name is a country name
n_individuals number of unique persons in the cabinet
leaderexperience_continuous the number of years the person has been leader of the country in total.
leader_party party of the leader

whogov = read.csv("data/WhoGov.csv")

Fixed Effects

When we introduce a dummy variable into the model, we quite often are interested in controlling for unobserved heterogeneity by allowing each category to have its own intercept. This is referred to as fixed effects.

First, let’s create a dummy variable indicating if a leader is independent or Partisan. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.

whogov = whogov %>%
  mutate(indep = ifelse(leader_party == "independent", "Independent", "Partisan"))

Don’t forget about the class of the variable!

whogov$indep = as.factor(whogov$indep)

Take a moment to think about what unobserved heterogeneity we can control for by including whether a leader is independent or partisan in the model.

model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov) 
summary(model_fe)


Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep, 
    data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.437  -7.078  -1.796   5.377 108.632 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 20.98188    0.30295   69.26   <2e-16 ***
leaderexperience_continuous  0.35700    0.01617   22.08   <2e-16 ***
indepPartisan                3.02909    0.29988   10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.06 on 9126 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.05649,   Adjusted R-squared:  0.05628 
F-statistic: 273.2 on 2 and 9126 DF,  p-value: < 2.2e-16

Let’s relevel to get the results for Independent candidates.

whogov$indep = relevel(whogov$indep, ref = "Partisan")
model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov) 
summary(model_fe)


Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep, 
    data = whogov)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.437  -7.078  -1.796   5.377 108.632 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 24.01096    0.16605  144.60   <2e-16 ***
leaderexperience_continuous  0.35700    0.01617   22.08   <2e-16 ***
indepIndependent            -3.02909    0.29988  -10.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.06 on 9126 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.05649,   Adjusted R-squared:  0.05628 
F-statistic: 273.2 on 2 and 9126 DF,  p-value: < 2.2e-16

To understand what’s going on in the model we might want to visualize the result. Load the ggeffects library.

library(ggeffects)

Then, visualize the result. What can we see?

ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
  plot() +
  labs(title = "Fixed Effects Regression",
       x = "Tenure of a Leader",
       y = "Number of Individuals in a Cabinet",
       color = "Leader's Status") +
  theme_bw()

Interactions

Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption and number of people in the cabinet (n_individuals) depending on the independence of the party leader.

The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.

whr_subset = whr %>%
  select(Country_name, Perceptions_of_corruption)

whogov_subset = whogov %>%
  filter(year == 2021) %>%
  select(country_name, n_individuals, indep)

Now, we are merging them. Once again, there is a great resource for joins (check “the {dplyr} way”)!

whr_whogov = whr_subset %>%
  left_join(whogov_subset, by = c("Country_name" = "country_name"))

Check the result of the left_join()

head(whr_whogov)

  Country_name Perceptions_of_corruption n_individuals    indep
1      Finland                     0.182            23 Partisan
2      Denmark                     0.196            24 Partisan
3      Iceland                     0.668            15 Partisan
4       Israel                     0.708            33 Partisan
5  Netherlands                     0.379            18 Partisan
6       Sweden                     0.202            26 Partisan

Now, let’s interact the variable

model_in = lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
summary(model_in)


Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep, 
    data = whr_whogov)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54000 -0.05737  0.04300  0.10864  0.24382 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.575192   0.061646   9.331 5.59e-16 ***
n_individuals                   0.004818   0.002151   2.240   0.0269 *  
indepIndependent                0.386220   0.159570   2.420   0.0170 *  
n_individuals:indepIndependent -0.010703   0.005501  -1.946   0.0540 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1737 on 123 degrees of freedom
  (10 observations deleted due to missingness)
Multiple R-squared:  0.08324,   Adjusted R-squared:  0.06088 
F-statistic: 3.723 on 3 and 123 DF,  p-value: 0.01326

Let’s plot the result.

ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
  plot(show_ci = FALSE) +
  labs(title = "Regression with Interaction Term",
       x = "Number of Individuals in a Cabinet",
       y = "Perception of Corruption",
       color = "Leader's Status") +
  theme_bw()

I guess after solving pset and quiz you realize why interpreting interactions is hard? You can easily simulate the data (i.e., calculate the marginal effect) using ggpredict(). For example,

ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))

# Predicted values of Perceptions_of_corruption

n_individuals | Predicted |     95% CI
--------------------------------------
           12 |      0.89 | 0.71, 1.07

IV Transformations

One of the assumptions for OLS is that there is a linear relationship between the dependent and independent variables. However, this quite often not the case. Let’s reverse the correction made in World Happiness report data. See below. Does it look linear?

ggplot(whr) +
  geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
  labs(x = "GDP per capita")

It doesn’t. It’s hard to describe this relationship in a linear manner. Natural log would explain this better, right?

ggplot(whr, aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
  geom_point() +
  geom_function(fun = function(x){log(x) - 4}) +
  labs(x = "GDP per capita")

This is why we use the natural logarithm to transform GDP per capita. The transformation reveals a linear relationship between the two variables, allowing us to capture non-linear patterns in a linear format when using OLS regression. Another commonly used transformation is quadratic ($x^2$), which serves the same purpose of addressing non-linear relationships. We call latter ones polynomials.

ggplot(whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
  geom_point() +
  geom_smooth(method = "lm")

Confidence Intervals

Intuition is straightforward: it’s a range within which the true value lies, with a given probability, typically 95%.

For example, we simulate the data from two different distributions (although, both are normal). The distribution X has the true $\mu$ equal to 0, whereas distribution Y has $\mu = 2$. Let’s plot it! Are these distributions different?

set.seed(123)
x = rnorm(100,
          mean = 0,
          sd = 1)

set.seed(111)
y = rnorm(100,
          mean = 2,
          sd = 2)

ggplot() +
  geom_histogram(aes(x = x, fill = "Distribution X"), alpha = 0.5, bins = 10) +
  geom_histogram(aes(x = y, fill = "Distribution Y"), alpha = 0.5, bins = 10) +
  labs(fill = "")

Hypothesis test

Don’t make inferences based solely on the plots!

t.test(x, y)


    Welch Two Sample t-test

data:  x and y
t = -8.0922, df = 133.82, p-value = 3.15e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.344424 -1.423488
sample estimates:
 mean of x  mean of y 
0.09040591 1.97436188

Now, let’s visualize their confidence intervals. First, we simply create a dataframe for our convenience

x_distribution = data.frame(value = x,
                            distribution = "X distribution")

y_distribution = data.frame(value = y,
                            distribution = "Y distribution")

xy_distirbutions = rbind(x_distribution, y_distribution)

head(xy_distirbutions)

        value   distribution
1 -0.56047565 X distribution
2 -0.23017749 X distribution
3  1.55870831 X distribution
4  0.07050839 X distribution
5  0.12928774 X distribution
6  1.71506499 X distribution

Then, let’s calculate statistics. You should spend some time looking through this code and match it with the slides!

ci_xy = xy_distirbutions %>%
  group_by(distribution) %>%
  summarize(
      mean_value = mean(value, na.rm = TRUE),
      lower = mean_value - 1.96 * sd(value, na.rm = TRUE) / sqrt(n()),
      upper = mean_value + 1.96 * sd(value, na.rm = TRUE) / sqrt(n()))

ci_xy

# A tibble: 2 × 4
  distribution   mean_value   lower upper
  <chr>               <dbl>   <dbl> <dbl>
1 X distribution     0.0904 -0.0885 0.269
2 Y distribution     1.97    1.55   2.39

Finally, visualize the confidence intervals

ggplot(ci_xy) +
  geom_linerange(aes(x = distribution,
                     ymin = lower,
                     ymax = upper),
                 size = 2) +
  geom_point(aes(x = distribution,
                 y = mean_value),
             size = 5)

Exercises

We are working with SIPRI Arms Transfers Database. It contains information on all transfers of major conventional arms.

First, load it.

Solution

sipri = ...

Draw a histogram of Import variable

Solution

...

Using table() check the Regime variable.

Solution

YOUR SOLUTION HERE

Check the class() of the Regime variable. Is it correct? Explain. Correct if needed.

Solution

YOUR SOLUTION HERE

Using filter(), leave only Autocratic type of Regime and save to autocracies_import dataframe.

Solution

autocracies_import = ...

Using filter(), leave only Democratic type of Regime and save to democracies_import dataframe.

Solution

autocracies_import = ...

Using rbind(), bind the rows of autocracies_import and democracies_import dataframes. Save it to a new dataframe autdem_import.

Solution

autdem_import = ...

Draw a boxplot (geom_boxplot()), where X is Regime, and Y is Import. What can you see? Are these groups different?

Solution

autdem_import %>%
  ggplot() ...

Now, let’s caulculate a confidence interval. First, group_by() by the Regime type, and summarize() average Import in democracies and autocracies, and lower and upper bounds of confidence interval. Assume the variables are distributed normally.

Solution

ci_autdem = autdem_import %>%
  group_by(...) %>%
  ...(
      mean_value = ...(..., na.rm = TRUE),
      lower = mean_value - 1.96 * sd(..., na.rm = TRUE) / sqrt(n()),
      upper = ... + 1.96 * sd(..., na.rm = TRUE) / sqrt(n()))

Visualize the confidence intervals using geom_linerange() and geom_point(). Are these distributions different?

Solution

ggplot(...) +
  ...(...) +
  geom_point(aes(x = ...,
                 y = mean_value))

Finally, to double check the conclusion, run the t.test(). What is your final say? Are arms imports in democracies and autocracies different? Why?

Solution

...

Check List

I know how to introduce fixed effects, interactions and polynomials to the linear model

I know how to simulate the results of fixed effects and interactions

I know what the confidence interval is

I know how to test whether two distributions are different

Before we start

Review of the previous week

Review of the homework

\(\LaTeX\) issues

Useful functions

Agenda

Right hand side extensions

Fixed Effects

Interactions

IV Transformations

Confidence Intervals

Hypothesis test

Exercises

Check List