library(tidyverse)
whr = read.csv("data/WHR.csv")Review and Confidence Intervals
Week 4
Before we start
Congrats passing through the first quiz! How are you feeling?
Grades for the psets are out.
Review of the previous week
Start with loading the tidyverse library and the World Happiness Report data.
How is the Continent associated with happiness (Ladder_score)? Set up the linear model and present the regression output.
...Load the ggeffects library.
...Using ggpredict() plot the model.
...(model) %>%
...()Using relevel() transform the Continent variable so the North America becomes a reference category. Don’t forget that the variable should be of class factor!
...Run the model again, present the output.
...Draw the plot to gain a better understanding of what’s going on. What has changed? Why does the result look different?
...Review of the homework
Don’t load data multiple times
Don’t leave
install.packages()when you render the scriptDon’t forget to use
stars = TRUEformodelsummary()
\(\LaTeX\) issues
Integration of R, Python, markdown, Latex and other useful languages incredibly useful. But it comes to a price that researchers should be careful. Any \(\LaTeX\) code should go within two $ dollar signs $. For example, an inline formula looks like this: $ Y = 10 $, which produces the following result: \(Y = 10\). Alternatively, you can use a double dollar sign to start a “chunk” for latex. For example:
\[ Y = \beta_0 + \beta_1X + u \]
Useful functions
Sometimes you might want to visualize some mathematical functions using geom_function(). In the HW, you were asked to plot an OLS regression vs the true data generating process. For this purpose example below is super useful (however, geom_abline() for this particular task is perfect, too).
set.seed(123)
example_data = data.frame(x = rnorm(100,
mean = 2,
sd = 15),
y = rnorm(100,
mean = 2,
sd = 15) * rnorm(100))
ggplot(example_data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(aes(color = "Fitted"), method = "lm", se = F) +
geom_function(aes(color = "Known"), fun = function(x){1 + 0.5 * x}, size = 1.2) Sometimes it’s useful to visualize, say, polynomials! How does \(x^3\) look like?
ggplot() +
geom_function(fun = function(x){x^2}) +
geom_vline(xintercept = 0) +
xlim(-1, 1)Agenda
Go over hard concepts: fixed effects, interactions and transformations
Refreshing confidence intervals
Right hand side extensions
Today we are again working with WhoGov dataset. As usual, take a look at the codebook.
These are the following variables we are going to work with:
country_nameis a country namen_individualsnumber of unique persons in the cabinetleaderexperience_continuousthe number of years the person has been leader of the country in total.leader_partyparty of the leader
whogov = read.csv("data/WhoGov.csv")Fixed Effects
When we introduce a dummy variable into the model, we quite often are interested in controlling for unobserved heterogeneity by allowing each category to have its own intercept. This is referred to as fixed effects.
First, let’s create a dummy variable indicating if a leader is independent or Partisan. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.
whogov = whogov %>%
mutate(indep = ifelse(leader_party == "independent", "Independent", "Partisan"))Don’t forget about the class of the variable!
whogov$indep = as.factor(whogov$indep)Take a moment to think about what unobserved heterogeneity we can control for by including whether a leader is independent or partisan in the model.
model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov)
summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep,
data = whogov)
Residuals:
Min 1Q Median 3Q Max
-29.437 -7.078 -1.796 5.377 108.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.98188 0.30295 69.26 <2e-16 ***
leaderexperience_continuous 0.35700 0.01617 22.08 <2e-16 ***
indepPartisan 3.02909 0.29988 10.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.05649, Adjusted R-squared: 0.05628
F-statistic: 273.2 on 2 and 9126 DF, p-value: < 2.2e-16
Let’s relevel to get the results for Independent candidates.
whogov$indep = relevel(whogov$indep, ref = "Partisan")
model_fe = lm(n_individuals ~ leaderexperience_continuous + indep, whogov)
summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep,
data = whogov)
Residuals:
Min 1Q Median 3Q Max
-29.437 -7.078 -1.796 5.377 108.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.01096 0.16605 144.60 <2e-16 ***
leaderexperience_continuous 0.35700 0.01617 22.08 <2e-16 ***
indepIndependent -3.02909 0.29988 -10.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.05649, Adjusted R-squared: 0.05628
F-statistic: 273.2 on 2 and 9126 DF, p-value: < 2.2e-16
To understand what’s going on in the model we might want to visualize the result. Load the ggeffects library.
library(ggeffects)Then, visualize the result. What can we see?
ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
plot() +
labs(title = "Fixed Effects Regression",
x = "Tenure of a Leader",
y = "Number of Individuals in a Cabinet",
color = "Leader's Status") +
theme_bw()Interactions
Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption and number of people in the cabinet (n_individuals) depending on the independence of the party leader.
The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.
whr_subset = whr %>%
select(Country_name, Perceptions_of_corruption)
whogov_subset = whogov %>%
filter(year == 2021) %>%
select(country_name, n_individuals, indep)Now, we are merging them. Once again, there is a great resource for joins (check “the {dplyr} way”)!
whr_whogov = whr_subset %>%
left_join(whogov_subset, by = c("Country_name" = "country_name")) Check the result of the left_join()
head(whr_whogov) Country_name Perceptions_of_corruption n_individuals indep
1 Finland 0.182 23 Partisan
2 Denmark 0.196 24 Partisan
3 Iceland 0.668 15 Partisan
4 Israel 0.708 33 Partisan
5 Netherlands 0.379 18 Partisan
6 Sweden 0.202 26 Partisan
Now, let’s interact the variable
model_in = lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
summary(model_in)
Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep,
data = whr_whogov)
Residuals:
Min 1Q Median 3Q Max
-0.54000 -0.05737 0.04300 0.10864 0.24382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.575192 0.061646 9.331 5.59e-16 ***
n_individuals 0.004818 0.002151 2.240 0.0269 *
indepIndependent 0.386220 0.159570 2.420 0.0170 *
n_individuals:indepIndependent -0.010703 0.005501 -1.946 0.0540 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1737 on 123 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.08324, Adjusted R-squared: 0.06088
F-statistic: 3.723 on 3 and 123 DF, p-value: 0.01326
Let’s plot the result.
ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
plot(show_ci = FALSE) +
labs(title = "Regression with Interaction Term",
x = "Number of Individuals in a Cabinet",
y = "Perception of Corruption",
color = "Leader's Status") +
theme_bw()I guess after solving pset and quiz you realize why interpreting interactions is hard? You can easily simulate the data (i.e., calculate the marginal effect) using ggpredict(). For example,
ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))# Predicted values of Perceptions_of_corruption
n_individuals | Predicted | 95% CI
--------------------------------------
12 | 0.89 | 0.71, 1.07
IV Transformations
One of the assumptions for OLS is that there is a linear relationship between the dependent and independent variables. However, this quite often not the case. Let’s reverse the correction made in World Happiness report data. See below. Does it look linear?
ggplot(whr) +
geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
labs(x = "GDP per capita")It doesn’t. It’s hard to describe this relationship in a linear manner. Natural log would explain this better, right?
ggplot(whr, aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
geom_point() +
geom_function(fun = function(x){log(x) - 4}) +
labs(x = "GDP per capita")This is why we use the natural logarithm to transform GDP per capita. The transformation reveals a linear relationship between the two variables, allowing us to capture non-linear patterns in a linear format when using OLS regression. Another commonly used transformation is quadratic (\(x^2\)), which serves the same purpose of addressing non-linear relationships. We call latter ones polynomials.
ggplot(whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
geom_point() +
geom_smooth(method = "lm")Confidence Intervals
Intuition is straightforward: it’s a range within which the true value lies, with a given probability, typically 95%.
For example, we simulate the data from two different distributions (although, both are normal). The distribution X has the true \(\mu\) equal to 0, whereas distribution Y has \(\mu = 2\). Let’s plot it! Are these distributions different?
set.seed(123)
x = rnorm(100,
mean = 0,
sd = 1)
set.seed(111)
y = rnorm(100,
mean = 2,
sd = 2)
ggplot() +
geom_histogram(aes(x = x, fill = "Distribution X"), alpha = 0.5, bins = 10) +
geom_histogram(aes(x = y, fill = "Distribution Y"), alpha = 0.5, bins = 10) +
labs(fill = "")Hypothesis test
Don’t make inferences based solely on the plots!
t.test(x, y)
Welch Two Sample t-test
data: x and y
t = -8.0922, df = 133.82, p-value = 3.15e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.344424 -1.423488
sample estimates:
mean of x mean of y
0.09040591 1.97436188
Now, let’s visualize their confidence intervals. First, we simply create a dataframe for our convenience
x_distribution = data.frame(value = x,
distribution = "X distribution")
y_distribution = data.frame(value = y,
distribution = "Y distribution")
xy_distirbutions = rbind(x_distribution, y_distribution)
head(xy_distirbutions) value distribution
1 -0.56047565 X distribution
2 -0.23017749 X distribution
3 1.55870831 X distribution
4 0.07050839 X distribution
5 0.12928774 X distribution
6 1.71506499 X distribution
Then, let’s calculate statistics. You should spend some time looking through this code and match it with the slides!
ci_xy = xy_distirbutions %>%
group_by(distribution) %>%
summarize(
mean_value = mean(value, na.rm = TRUE),
lower = mean_value - 1.96 * sd(value, na.rm = TRUE) / sqrt(n()),
upper = mean_value + 1.96 * sd(value, na.rm = TRUE) / sqrt(n()))
ci_xy# A tibble: 2 × 4
distribution mean_value lower upper
<chr> <dbl> <dbl> <dbl>
1 X distribution 0.0904 -0.0885 0.269
2 Y distribution 1.97 1.55 2.39
Finally, visualize the confidence intervals
ggplot(ci_xy) +
geom_linerange(aes(x = distribution,
ymin = lower,
ymax = upper),
size = 2) +
geom_point(aes(x = distribution,
y = mean_value),
size = 5)Exercises
We are working with SIPRI Arms Transfers Database. It contains information on all transfers of major conventional arms.
First, load it.
Draw a histogram of Import variable
Using table() check the Regime variable.
Check the class() of the Regime variable. Is it correct? Explain. Correct if needed.
Using filter(), leave only Autocratic type of Regime and save to autocracies_import dataframe.
Using filter(), leave only Democratic type of Regime and save to democracies_import dataframe.
Using rbind(), bind the rows of autocracies_import and democracies_import dataframes. Save it to a new dataframe autdem_import.
Draw a boxplot (geom_boxplot()), where X is Regime, and Y is Import. What can you see? Are these groups different?
Now, let’s caulculate a confidence interval. First, group_by() by the Regime type, and summarize() average Import in democracies and autocracies, and lower and upper bounds of confidence interval. Assume the variables are distributed normally.
Visualize the confidence intervals using geom_linerange() and geom_point(). Are these distributions different?
Finally, to double check the conclusion, run the t.test(). What is your final say? Are arms imports in democracies and autocracies different? Why?
Check List
I know how to introduce fixed effects, interactions and polynomials to the linear model
I know how to simulate the results of fixed effects and interactions
I know what the confidence interval is
I know how to test whether two distributions are different