Merging Data and Statistical Relationships

Week 5

Published

October 17, 2025

Before we start

  • Any questions?
Download script

Download data


Review

First, load the tidyverse library

library(tidyverse)

Load the WhoGov dataset.

whogov = read.csv("data/WhoGov.csv")

First of all, data wrangling (take a look on week 2 script).

Create a variable for independent and partisan leaders using case_when().

whogov = whogov %>%
  mutate(indep = ...(leader_party ... "independent" ... "Independent",
                     leader_party != ... "Partisan"))

Check the distribution of the newly created variable using table()

...

Now, visualizations. Draw a histogram of average_total. Add an average. Make sure to name X axis: “Average Tenure of People in cabinet”.

...

Now, draw a boxplot. X axis is indep, and Y axis is average_total.

whogov_boxplot = ...

whogov_boxplot

Add a facet_wrap(~ system_category). And add coord_flip(). How is it different from facet_grid() we used last time?

whogov_boxplot ...

Agenda

  • Learning how to merge data

  • Comparing distributions, statistically

Data Exploration

Today we are working with World Happiness report and V-Dem. Here are respective codebooks: WHR and V-Dem.

Let’s start with WHS.

  • Country_name is the name of the country

  • Ladder_score is the happiness score

Load the data.

whr = read.csv("data/WHR.csv")

Now, load the V-Dem data.

  • country_name is the name of the country

  • year is the year of observation

  • e_v2x_api_3C is Additive polyarchy index ordinal (what is polyarchy?)

I have filtered out only the variables we need to work with today.

load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData")) 

vdem = vdem %>%
  select(country_name, 
         year, 
         regime = e_v2x_api_3C)

Imagine, we are asked: are democracies happier than non-democracies?

Joining Data

We can start exploring this relationship with our skilleset! But we need to add one more: joining or merging the data.

To merge the data, we need to choose a variable that is common to both datasets. In our case, it’s the country names. Often, you would also have a year variable—but since our WHR report data is for the year 2022, let’s subset the V-Dem data to include only observations from 2022. Once again, why are we using==?

vdem_2022 = vdem %>%
  filter(year == 2022)

Now, let’s merge the data using left_join(). Pay attention to the syntax, especially the by part.

whr_vdem = whr %>%
  left_join(vdem_2022, by = c("Country_name" = "country_name")) 

Check the result.

whr_vdem = whr_vdem %>%
  select(Country_name, Ladder_score, regime)

whr_vdem %>%
  head()
  Country_name Ladder_score regime
1      Finland        7.804      1
2      Denmark        7.586      1
3      Iceland        7.530      1
4       Israel        7.473      1
5  Netherlands        7.403      1
6       Sweden        7.395      1

This is what we did, feel free to explore the tutorial.

Comparing Distributions

Comparing two groups is a bit easier, as things quickly get messier when you have more. Let’s start with that. Create a binary variable in the dataset that would indicate if a country is a Democracy or Autocracy. You know how to use this case_when() function already, right?

whr_vdem = whr_vdem %>%
  mutate(regime_categorical = case_when(regime != 1 ~ "Autocratic", 
                                        regime == 1 ~ "Democratic"))

Let’s start exploring patterns. What are the top 5 happiest countries? Are these democracies?

whr_vdem %>%
  slice_max(Ladder_score, n = 5)
  Country_name Ladder_score regime regime_categorical
1      Finland        7.804      1         Democratic
2      Denmark        7.586      1         Democratic
3      Iceland        7.530      1         Democratic
4       Israel        7.473      1         Democratic
5  Netherlands        7.403      1         Democratic

What are the least happy countries?

whr_vdem %>%
  slice_min(Ladder_score, n = 5)
      Country_name Ladder_score regime regime_categorical
1      Afghanistan        1.859    0.0         Autocratic
2          Lebanon        2.392    0.5         Autocratic
3     Sierra Leone        3.138    0.5         Autocratic
4         Zimbabwe        3.204     NA               <NA>
5 Congo (Kinshasa)        3.207     NA               <NA>

And what is in the middle of the distribution?

whr_vdem %>%
  slice(50:55)
  Country_name Ladder_score regime regime_categorical
1  El Salvador        6.122    0.5         Autocratic
2      Hungary        6.041    0.5         Autocratic
3    Argentina        6.024    1.0         Democratic
4     Honduras        6.023    0.5         Autocratic
5   Uzbekistan        6.014    0.0         Autocratic
6     Malaysia        6.012    0.5         Autocratic

Remove position = "identity" to get the sense of what it stands for. Without this, it would simply draw a histogram, stacking observations on top of each other! Finally, a new argument is alpha. It stands for the transparency of the element (bar).

ggplot(whr_vdem %>%
         drop_na(regime_categorical)) +
  geom_histogram(aes(x = Ladder_score, 
                     fill = regime_categorical), 
                 position = "identity",
                 alpha = 0.6)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Now, let’s calculate averages for Democracies and Autocracies.

dem_avg = whr_vdem %>% 
  filter(regime_categorical == "Democratic") %>% # leave only Democratic regimes
  summarize(mean = mean(Ladder_score))           # calculate the average 

aut_avg = whr_vdem %>% 
  filter(regime_categorical == "Autocratic") %>%
  summarize(mean = mean(Ladder_score))

And add the averages to the plot. So, are these groups different?

ggplot(whr_vdem %>%
         drop_na(regime_categorical)) +
  geom_histogram(aes(x = Ladder_score, fill = regime_categorical), 
                 position = "identity",
                 alpha = 0.6) +
  geom_vline(xintercept = dem_avg$mean, color = "blue") +
  geom_vline(xintercept = aut_avg$mean, color = "red")
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Formally, we can compare it using statistical tests. What does p-value indicate?

t.test(whr_vdem$Ladder_score ~ whr_vdem$regime_categorical)

    Welch Two Sample t-test

data:  whr_vdem$Ladder_score by whr_vdem$regime_categorical
t = -6.7424, df = 105.46, p-value = 8.575e-10
alternative hypothesis: true difference in means between group Autocratic and group Democratic is not equal to 0
95 percent confidence interval:
 -1.5392830 -0.8396989
sample estimates:
mean in group Autocratic mean in group Democratic 
                5.074309                 6.263800 

Check List

Optional Exercises

First of all, recreate whr_vdem dataframe to have Continent variable in it. Use left_join() to create a dataset. Select the following variables: Country_name, Ladder_score, Continent

Solution
whr_vdem = ... %>% 
  ...() %>%
  select(Country_name, Ladder_score, Continent)

In whr_vdem dataframe create a variable Continent_binary using mutate() and case_when() on Continent to divide observation on North America and Other.

Solution
whr_vdem = whr_vdem ...
  ...(Continent_binary = case_when(Continent == "North America" ... =  "...",
                                      Continent != "..." ~ "Other"))

Visualize the distributions of happiness score (Ladder_score). Is it helpful?

Solution
ggplot() +
  geom_histogram(data = whr_vdem, aes(x = ... , ... = Continent_binary), 
                 position = ...,
                 alpha = 0.6)

Calculate and add a a median for “Other” continents to the plot. Customize the plot, add theme_bw() and a title!

Solution
other_avg = whr_vdem %>% 
  filter(Continent == "...") ...
  ...(median = median(Ladder_score))

ggplot() +
  ...(..., aes(x = Ladder_score, fill = Continent_binary), 
                 position = "identity",
                 alpha = ...) +
  geom_vline(... = )

Perform a t.test() to understand if the average Happiness between North America and Other continents are statistically different. Interpret the result

Solution
...