Statistical Tests

Week 4

Published

April 24, 2025

Before we start

  • Questions regarding Lab Assignment or the class?

Download script

Download data


Agenda

  • clearing environment
iris = iris   # embedded dataset
b = 2

rm(b)
  • Wrangling data

  • Joining datasets

  • Visualizing relationships

  • Conducting Statistical Tests

Exploring Dataset

Today we are working with the subset of V-Dem dataset. For your reference, their codebook is available here. Pay attention to .RDS file format.

library(tidyverse)

vdem = readRDS("data/vdem_regimes.RDS")

Let’s check the variables included in the data. What does e_v2x_api stand for? Check the codebook!

colnames(vdem)
[1] "country_name" "year"         "e_v2x_api_3C"

To simplify the analysis, let’s rename the variable to regime, this would help us expedite the process of working with the data!

vdem = vdem %>%
  rename(regime = e_v2x_api_3C)

Now, let’s change the regime so it’s easier to comprehend. Let’s discuss syntax for a second, pay special attention to == and !=!. To make the comparison more comprehendable, let’s pretend there are only autocracies and democracies.

vdem = vdem %>%
  mutate(regime_categorical = case_when(regime != 1 ~ "Autocratic", 
                                        regime == 1 ~ "Democratic"))

Now, make sure everything is labelled correctly! Attention to useNA = "always".

table(vdem$regime_categorical, useNA = "always")

Autocratic Democratic       <NA> 
     20738       5837       1159 

Now, let’s draw a histogram.

vdem %>%
  ggplot() +
  geom_histogram(aes(x = regime))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1159 rows containing non-finite outside the scale range
(`stat_bin()`).

Now, let’s draw the same variable, but in categorical type of data. Try adding stat = count to the previous histogram, too. What does it change? Don’t forget to refresh previous script in mind (link)! Basically, we get the same information from the table.

vdem %>%
  ggplot() +
  geom_histogram(aes(x = regime_categorical), stat = "count")
Warning in geom_histogram(aes(x = regime_categorical), stat = "count"):
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`

Joining data

Imagine, we are asked: are democracies happier than non-democracies?

We can approach answering that with the skills we have! But let’s develop a new one: merging datasets.

First, load the World Happiness Report data. I hope you remember the dataset! But just in case, a short reminder on variables:

  • Country_name is the name of the country

  • Ladder_score is the happiness score

whr = read.csv("data/WHR.csv")

To merge the data, we need to choose a variable that is common to both datasets. In our case, it’s the country names. Often, you would also have a year variable—but since our WHR report data is for the year 2022, let’s subset the V-Dem data to include only observations from 2022. Once again, why are we using ==?

vdem_2022 = vdem %>%
  filter(year == 2022)

Now, let’s merge the data using left_join(). Pay attention to the syntax, especially by part.

whr_vdem = whr %>%
  left_join(vdem_2022, by = c("Country_name" = "country_name")) 

Check the result

whr_vdem = whr_vdem %>%
  select(Country_name, Ladder_score, regime, regime_categorical)

whr_vdem %>%
  head()
  Country_name Ladder_score regime regime_categorical
1      Finland        7.804      1         Democratic
2      Denmark        7.586      1         Democratic
3      Iceland        7.530      1         Democratic
4       Israel        7.473      1         Democratic
5  Netherlands        7.403      1         Democratic
6       Sweden        7.395      1         Democratic

This is what we did, feel free to explore the tutorial.

Analyzing missing data

Now, we likely got a missing values as a result of the merger. The problem might arise due to the different spellings or absence of observation for specific year-country in one of the datasets.

Start with is.na() – it checks if there are any missing values. And now, we can easily count if there are any missing values. Everything is here! Makes sense as we used left_join() to attach variables to WHR data.

is.na(whr_vdem$Ladder_score) %>%
  sum()
[1] 0

However, the potential missingness might be in the “right” part of the merge, namely the V-Dem data. Let’s check it, too.

is.na(whr_vdem$regime_categorical) %>%
  sum()
[1] 22

Why this might be the case? Let’s explore. Using mutate() create an indicator variable for the missing values first. Then, using filter() let’s leave only missing values. And finally, using select() let’s leave only country names. What these observations have in common?

missing_data = whr_vdem %>%
  mutate(missing = is.na(regime_categorical)) %>% # create indicator variable for missing data
  filter(missing == TRUE) %>%                     # leave only rows with missing regime type
  select(Country_name)                            # choose the Country_name column

missing_data %>%
  head()
               Country_name
1             United States
2  Taiwan Province of China
3                Kazakhstan
4                   Algeria
5 Hong Kong S.A.R. of China
6       Congo (Brazzaville)

Let’s see how these are spelled in the V-Dem data. For example, let’s take how “United States” can be spelled. Using str_view() we can identify if the pattern exists in vdem_2022 dataframe.

str_view(vdem_2022$country_name, "United States")
[16] │ <United States> of America

Now, let’s see why there is no Kazakhstan. It is definitely missing! Probably, it was not yet ranked by V-Dem.

vdem_2022 %>%
  filter(country_name == "Kazakhstan")
  country_name year regime regime_categorical
1   Kazakhstan 2022     NA               <NA>

All in all, these issues related to the most frequent reasons why data might be missing after merging:

  • “Key” variables are spelled differently

  • The data is not available in one of the datasets

We’ll stick to removing NAs from the dataset using na.omit(). Check out tips in the end of the script to dig into other solutions to the problem of missing values.

whr_vdem = whr_vdem %>% 
  na.omit()

Visualizing Relationships

Remember, we are asked: are democracies happier than non-democracies? Given the type of data we are working with today, we’ll have categories (regime types) and continuous data (happiness score).

Let’s visualize distribution for Ladder_score in whr_vdem using geom_histogram(). Do you remember how to do this?

Coding Task
ggplot(whr_vdem) ...
  ...

Ok! A lot of things are going on. First of all, let’s discuss syntax a bit. Remove position = "identity" to get the sense of what it stands for. Without this, it would simply draw a histogram, stacking observations on top of each other! Finally, a new argument is alpha. It stands for the transparency of the element (bar).

ggplot() +
  geom_histogram(data = whr_vdem, aes(x = Ladder_score, fill = regime_categorical), 
                 position = "identity",
                 alpha = 0.6)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, let’s calculate averages for Democracies and Autocracies.

dem_avg = whr_vdem %>% 
  filter(regime_categorical == "Democratic") %>% # leave only Democratic regimes
  summarize(mean = mean(Ladder_score))           # calculate the average 

aut_avg = whr_vdem %>% 
  filter(regime_categorical == "Autocratic") %>%
  summarize(mean = mean(Ladder_score))

And add the averages to the plot.

ggplot() +
  geom_histogram(data = whr_vdem, aes(x = Ladder_score, fill = regime_categorical), 
                 position = "identity",
                 alpha = 0.6) +
  geom_vline(xintercept = dem_avg$mean, color = "blue") +
  geom_vline(xintercept = aut_avg$mean, color = "red")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Formally, we can compare it using statistical tests. For example, a Student’s t-test. What does p-value indicate?

t.test(whr_vdem$Ladder_score ~ whr_vdem$regime_categorical)

    Welch Two Sample t-test

data:  whr_vdem$Ladder_score by whr_vdem$regime_categorical
t = -6.7424, df = 105.46, p-value = 8.575e-10
alternative hypothesis: true difference in means between group Autocratic and group Democratic is not equal to 0
95 percent confidence interval:
 -1.5392830 -0.8396989
sample estimates:
mean in group Autocratic mean in group Democratic 
                5.074309                 6.263800 

Does it mean that democracies are happier? Is it a causal relation?

Additional Practice

Here are some exercises for you to practice what we did today. Investing some time in this would definitely pay off in the long term!

Coding Task

In whr_vdem dataframe create a variable Continent_binary using mutate() and case_when() on Continent to divide observation on North America and Other.

whr_vdem = whr_vdem ...
  ...(Continent_binary = case_when(Continent == "North America" ... =  "...",
                                      Continent != "..." ~ "Other"))

Visualize the distributions. Is it helpful?

ggplot() +
  geom_histogram(data = whr_vdem, aes(x = ... , ... = Continent_binary), 
                 position = ...,
                 alpha = 0.6)

Calculate and add a a median for “Other” continents to the plot. Customize the plot, add theme_bw() and a title!

other_avg = whr_vdem %>% 
  filter(Continent == "...") ...
  ...(median = median(Ladder_score))

ggplot() +
  ...(..., aes(x = Ladder_score, fill = Continent_binary), 
                 position = "identity",
                 alpha = ...) +
  geom_vline(... = )

Perform a t.test() to understand if the average Happiness between North America and Other continents are statistically different. Interpret the result

Some Tips

What to do with the missing data?

  • Ignore – a preferable option for the purposes of this class

  • Try out right_join() or full_join(), if applicable.

  • If dealing with numeric data, you can insert an average for missing values

  • Impute. A bit challenging! But something to study beyond this class

Check List

I know how clear my environment, and I will do that occasionally

I know how to use mutate() and case_when() to recode my variables

I know how to identify missing values

I have the intuition behind merging the data, and left_join() helps me doing complex stuff

I know how to visualize multiple distributions

I can easily compare distributions using t.test() and interpret the result