library(tidyverse)
= read.csv("data/WHR.csv") whr
Review and Confidence Intervals
Week 4
Before we start
Congrats passing through the first quiz! How are you feeling?
Grades for the psets are out.
Review of the previous week
Start with loading the tidyverse
library and the World Happiness Report data.
How is the Continent
associated with happiness (Ladder_score
)? Set up the linear model and present the regression output.
...
Load the ggeffects
library.
...
Using ggpredict()
plot the model.
...(model) %>%
...()
Using relevel()
transform the Continent
variable so the North America
becomes a reference category. Don’t forget that the variable should be of class factor!
...
Run the model again, present the output.
...
Draw the plot to gain a better understanding of what’s going on. What has changed? Why does the result look different?
...
Review of the homework
Don’t load data multiple times
Don’t leave
install.packages()
when you render the scriptDon’t forget to use
stars = TRUE
formodelsummary()
\(\LaTeX\) issues
Integration of R, Python, markdown, Latex and other useful languages incredibly useful. But it comes to a price that researchers should be careful. Any \(\LaTeX\) code should go within two $
dollar signs $
. For example, an inline formula looks like this: $ Y = 10 $, which produces the following result: \(Y = 10\). Alternatively, you can use a double dollar sign to start a “chunk” for latex. For example:
\[ Y = \beta_0 + \beta_1X + u \]
Useful functions
Sometimes you might want to visualize some mathematical functions using geom_function()
. In the HW, you were asked to plot an OLS regression vs the true data generating process. For this purpose example below is super useful (however, geom_abline()
for this particular task is perfect, too).
set.seed(123)
= data.frame(x = rnorm(100,
example_data mean = 2,
sd = 15),
y = rnorm(100,
mean = 2,
sd = 15) * rnorm(100))
ggplot(example_data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(aes(color = "Fitted"), method = "lm", se = F) +
geom_function(aes(color = "Known"), fun = function(x){1 + 0.5 * x}, size = 1.2)
Sometimes it’s useful to visualize, say, polynomials! How does \(x^3\) look like?
ggplot() +
geom_function(fun = function(x){x^2}) +
geom_vline(xintercept = 0) +
xlim(-1, 1)
Agenda
Go over hard concepts: fixed effects, interactions and transformations
Refreshing confidence intervals
Right hand side extensions
Today we are again working with WhoGov dataset. As usual, take a look at the codebook.
These are the following variables we are going to work with:
country_name
is a country namen_individuals
number of unique persons in the cabinetleaderexperience_continuous
the number of years the person has been leader of the country in total.leader_party
party of the leader
= read.csv("data/WhoGov.csv") whogov
Fixed Effects
When we introduce a dummy variable into the model, we quite often are interested in controlling for unobserved heterogeneity by allowing each category to have its own intercept. This is referred to as fixed effects.
First, let’s create a dummy variable indicating if a leader is independent or Partisan. You can use 1 or 0 instead, but to make it more readable here we stick to more transparent labels.
= whogov %>%
whogov mutate(indep = ifelse(leader_party == "independent", "Independent", "Partisan"))
Don’t forget about the class of the variable!
$indep = as.factor(whogov$indep) whogov
Take a moment to think about what unobserved heterogeneity we can control for by including whether a leader is independent or partisan in the model.
= lm(n_individuals ~ leaderexperience_continuous + indep, whogov)
model_fe summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep,
data = whogov)
Residuals:
Min 1Q Median 3Q Max
-29.437 -7.078 -1.796 5.377 108.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.98188 0.30295 69.26 <2e-16 ***
leaderexperience_continuous 0.35700 0.01617 22.08 <2e-16 ***
indepPartisan 3.02909 0.29988 10.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.05649, Adjusted R-squared: 0.05628
F-statistic: 273.2 on 2 and 9126 DF, p-value: < 2.2e-16
Let’s relevel to get the results for Independent candidates.
$indep = relevel(whogov$indep, ref = "Partisan")
whogov= lm(n_individuals ~ leaderexperience_continuous + indep, whogov)
model_fe summary(model_fe)
Call:
lm(formula = n_individuals ~ leaderexperience_continuous + indep,
data = whogov)
Residuals:
Min 1Q Median 3Q Max
-29.437 -7.078 -1.796 5.377 108.632
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.01096 0.16605 144.60 <2e-16 ***
leaderexperience_continuous 0.35700 0.01617 22.08 <2e-16 ***
indepIndependent -3.02909 0.29988 -10.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 9126 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.05649, Adjusted R-squared: 0.05628
F-statistic: 273.2 on 2 and 9126 DF, p-value: < 2.2e-16
To understand what’s going on in the model we might want to visualize the result. Load the ggeffects
library.
library(ggeffects)
Then, visualize the result. What can we see?
ggpredict(model_fe, terms = c("leaderexperience_continuous", "indep")) %>%
plot() +
labs(title = "Fixed Effects Regression",
x = "Tenure of a Leader",
y = "Number of Individuals in a Cabinet",
color = "Leader's Status") +
theme_bw()
Interactions
Often dummy variables are used to introduce an interaction term in the model. We will explore the association between Perceptions_of_corruption
and number of people in the cabinet (n_individuals
) depending on the independence of the party leader.
The task isn’t trivial as now we planning to use data from two datasets, Let’s subset those.
= whr %>%
whr_subset select(Country_name, Perceptions_of_corruption)
= whogov %>%
whogov_subset filter(year == 2021) %>%
select(country_name, n_individuals, indep)
Now, we are merging them. Once again, there is a great resource for joins (check “the {dplyr} way”)!
= whr_subset %>%
whr_whogov left_join(whogov_subset, by = c("Country_name" = "country_name"))
Check the result of the left_join()
head(whr_whogov)
Country_name Perceptions_of_corruption n_individuals indep
1 Finland 0.182 23 Partisan
2 Denmark 0.196 24 Partisan
3 Iceland 0.668 15 Partisan
4 Israel 0.708 33 Partisan
5 Netherlands 0.379 18 Partisan
6 Sweden 0.202 26 Partisan
Now, let’s interact the variable
= lm(Perceptions_of_corruption ~ n_individuals * indep, whr_whogov)
model_in summary(model_in)
Call:
lm(formula = Perceptions_of_corruption ~ n_individuals * indep,
data = whr_whogov)
Residuals:
Min 1Q Median 3Q Max
-0.54000 -0.05737 0.04300 0.10864 0.24382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.575192 0.061646 9.331 5.59e-16 ***
n_individuals 0.004818 0.002151 2.240 0.0269 *
indepIndependent 0.386220 0.159570 2.420 0.0170 *
n_individuals:indepIndependent -0.010703 0.005501 -1.946 0.0540 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1737 on 123 degrees of freedom
(10 observations deleted due to missingness)
Multiple R-squared: 0.08324, Adjusted R-squared: 0.06088
F-statistic: 3.723 on 3 and 123 DF, p-value: 0.01326
Let’s plot the result.
ggpredict(model_in, terms= c("n_individuals", "indep")) %>%
plot(show_ci = FALSE) +
labs(title = "Regression with Interaction Term",
x = "Number of Individuals in a Cabinet",
y = "Perception of Corruption",
color = "Leader's Status") +
theme_bw()
I guess after solving pset and quiz you realize why interpreting interactions is hard? You can easily simulate the data (i.e., calculate the marginal effect) using ggpredict()
. For example,
ggpredict(model_in, terms= c("n_individuals [12]", "indep [Independent]"))
# Predicted values of Perceptions_of_corruption
n_individuals | Predicted | 95% CI
--------------------------------------
12 | 0.89 | 0.71, 1.07
IV Transformations
One of the assumptions for OLS is that there is a linear relationship between the dependent and independent variables. However, this quite often not the case. Let’s reverse the correction made in World Happiness report data. See below. Does it look linear?
ggplot(whr) +
geom_point(aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
labs(x = "GDP per capita")
It doesn’t. It’s hard to describe this relationship in a linear manner. Natural log would explain this better, right?
ggplot(whr, aes(x = exp(Logged_GDP_per_capita), y = Ladder_score)) +
geom_point() +
geom_function(fun = function(x){log(x) - 4}) +
labs(x = "GDP per capita")
This is why we use the natural logarithm to transform GDP per capita. The transformation reveals a linear relationship between the two variables, allowing us to capture non-linear patterns in a linear format when using OLS regression. Another commonly used transformation is quadratic (\(x^2\)), which serves the same purpose of addressing non-linear relationships. We call latter ones polynomials.
ggplot(whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
geom_point() +
geom_smooth(method = "lm")
Confidence Intervals
Intuition is straightforward: it’s a range within which the true value lies, with a given probability, typically 95%.
For example, we simulate the data from two different distributions (although, both are normal). The distribution X has the true \(\mu\) equal to 0, whereas distribution Y has \(\mu = 2\). Let’s plot it! Are these distributions different?
set.seed(123)
= rnorm(100,
x mean = 0,
sd = 1)
set.seed(111)
= rnorm(100,
y mean = 2,
sd = 2)
ggplot() +
geom_histogram(aes(x = x, fill = "Distribution X"), alpha = 0.5, bins = 10) +
geom_histogram(aes(x = y, fill = "Distribution Y"), alpha = 0.5, bins = 10) +
labs(fill = "")
Hypothesis test
Don’t make inferences based solely on the plots!
t.test(x, y)
Welch Two Sample t-test
data: x and y
t = -8.0922, df = 133.82, p-value = 3.15e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.344424 -1.423488
sample estimates:
mean of x mean of y
0.09040591 1.97436188
Now, let’s visualize their confidence intervals. First, we simply create a dataframe for our convenience
= data.frame(value = x,
x_distribution distribution = "X distribution")
= data.frame(value = y,
y_distribution distribution = "Y distribution")
= rbind(x_distribution, y_distribution)
xy_distirbutions
head(xy_distirbutions)
value distribution
1 -0.56047565 X distribution
2 -0.23017749 X distribution
3 1.55870831 X distribution
4 0.07050839 X distribution
5 0.12928774 X distribution
6 1.71506499 X distribution
Then, let’s calculate statistics. You should spend some time looking through this code and match it with the slides!
= xy_distirbutions %>%
ci_xy group_by(distribution) %>%
summarize(
mean_value = mean(value, na.rm = TRUE),
lower = mean_value - 1.96 * sd(value, na.rm = TRUE) / sqrt(n()),
upper = mean_value + 1.96 * sd(value, na.rm = TRUE) / sqrt(n()))
ci_xy
# A tibble: 2 × 4
distribution mean_value lower upper
<chr> <dbl> <dbl> <dbl>
1 X distribution 0.0904 -0.0885 0.269
2 Y distribution 1.97 1.55 2.39
Finally, visualize the confidence intervals
ggplot(ci_xy) +
geom_linerange(aes(x = distribution,
ymin = lower,
ymax = upper),
size = 2) +
geom_point(aes(x = distribution,
y = mean_value),
size = 5)
Exercises
We are working with SIPRI Arms Transfers Database. It contains information on all transfers of major conventional arms.
First, load it.
Draw a histogram of Import
variable
Using table()
check the Regime
variable.
Check the class()
of the Regime
variable. Is it correct? Explain. Correct if needed.
Using filter()
, leave only Autocratic
type of Regime
and save to autocracies_import
dataframe.
Using filter()
, leave only Democratic
type of Regime
and save to democracies_import
dataframe.
Using rbind()
, bind the rows of autocracies_import
and democracies_import
dataframes. Save it to a new dataframe autdem_import
.
Draw a boxplot (geom_boxplot()
), where X is Regime
, and Y is Import
. What can you see? Are these groups different?
Now, let’s caulculate a confidence interval. First, group_by()
by the Regime
type, and summarize()
average Import
in democracies and autocracies, and lower and upper bounds of confidence interval. Assume the variables are distributed normally.
Visualize the confidence intervals using geom_linerange()
and geom_point()
. Are these distributions different?
Finally, to double check the conclusion, run the t.test()
. What is your final say? Are arms imports in democracies and autocracies different? Why?
Check List
I know how to introduce fixed effects, interactions and polynomials to the linear model
I know how to simulate the results of fixed effects and interactions
I know what the confidence interval is
I know how to test whether two distributions are different