= iris # embedded dataset
iris = 2
b
rm(b)
Statistical Tests
Week 4
Before we start
- Questions regarding Lab Assignment or the class?
Agenda
- clearing environment
Wrangling data
Joining datasets
Visualizing relationships
Conducting Statistical Tests
Exploring Dataset
Today we are working with the subset of V-Dem dataset. For your reference, their codebook is available here. Pay attention to .RDS
file format.
library(tidyverse)
= readRDS("data/vdem_regimes.RDS") vdem
Let’s check the variables included in the data. What does e_v2x_api
stand for? Check the codebook!
colnames(vdem)
[1] "country_name" "year" "e_v2x_api_3C"
To simplify the analysis, let’s rename the variable to regime
, this would help us expedite the process of working with the data!
= vdem %>%
vdem rename(regime = e_v2x_api_3C)
Now, let’s change the regime
so it’s easier to comprehend. Let’s discuss syntax for a second, pay special attention to ==
and !=
!. To make the comparison more comprehendable, let’s pretend there are only autocracies and democracies.
= vdem %>%
vdem mutate(regime_categorical = case_when(regime != 1 ~ "Autocratic",
== 1 ~ "Democratic")) regime
Now, make sure everything is labelled correctly! Attention to useNA = "always"
.
table(vdem$regime_categorical, useNA = "always")
Autocratic Democratic <NA>
20738 5837 1159
Now, let’s draw a histogram.
%>%
vdem ggplot() +
geom_histogram(aes(x = regime))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1159 rows containing non-finite outside the scale range
(`stat_bin()`).
Now, let’s draw the same variable, but in categorical type of data. Try adding stat = count
to the previous histogram, too. What does it change? Don’t forget to refresh previous script in mind (link)! Basically, we get the same information from the table.
%>%
vdem ggplot() +
geom_histogram(aes(x = regime_categorical), stat = "count")
Warning in geom_histogram(aes(x = regime_categorical), stat = "count"):
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
Joining data
Imagine, we are asked: are democracies happier than non-democracies?
We can approach answering that with the skills we have! But let’s develop a new one: merging datasets.
First, load the World Happiness Report data. I hope you remember the dataset! But just in case, a short reminder on variables:
Country_name
is the name of the countryLadder_score
is the happiness score
= read.csv("data/WHR.csv") whr
To merge the data, we need to choose a variable that is common to both datasets. In our case, it’s the country names. Often, you would also have a year variable—but since our WHR report data is for the year 2022, let’s subset the V-Dem data to include only observations from 2022. Once again, why are we using ==
?
= vdem %>%
vdem_2022 filter(year == 2022)
Now, let’s merge the data using left_join()
. Pay attention to the syntax, especially by
part.
= whr %>%
whr_vdem left_join(vdem_2022, by = c("Country_name" = "country_name"))
Check the result
= whr_vdem %>%
whr_vdem select(Country_name, Ladder_score, regime, regime_categorical)
%>%
whr_vdem head()
Country_name Ladder_score regime regime_categorical
1 Finland 7.804 1 Democratic
2 Denmark 7.586 1 Democratic
3 Iceland 7.530 1 Democratic
4 Israel 7.473 1 Democratic
5 Netherlands 7.403 1 Democratic
6 Sweden 7.395 1 Democratic
This is what we did, feel free to explore the tutorial.
Analyzing missing data
Now, we likely got a missing values as a result of the merger. The problem might arise due to the different spellings or absence of observation for specific year-country in one of the datasets.
Start with is.na()
– it checks if there are any missing values. And now, we can easily count if there are any missing values. Everything is here! Makes sense as we used left_join()
to attach variables to WHR data.
is.na(whr_vdem$Ladder_score) %>%
sum()
[1] 0
However, the potential missingness might be in the “right” part of the merge, namely the V-Dem data. Let’s check it, too.
is.na(whr_vdem$regime_categorical) %>%
sum()
[1] 22
Why this might be the case? Let’s explore. Using mutate()
create an indicator variable for the missing values first. Then, using filter()
let’s leave only missing values. And finally, using select()
let’s leave only country names. What these observations have in common?
= whr_vdem %>%
missing_data mutate(missing = is.na(regime_categorical)) %>% # create indicator variable for missing data
filter(missing == TRUE) %>% # leave only rows with missing regime type
select(Country_name) # choose the Country_name column
%>%
missing_data head()
Country_name
1 United States
2 Taiwan Province of China
3 Kazakhstan
4 Algeria
5 Hong Kong S.A.R. of China
6 Congo (Brazzaville)
Let’s see how these are spelled in the V-Dem data. For example, let’s take how “United States” can be spelled. Using str_view()
we can identify if the pattern exists in vdem_2022
dataframe.
str_view(vdem_2022$country_name, "United States")
[16] │ <United States> of America
Now, let’s see why there is no Kazakhstan. It is definitely missing! Probably, it was not yet ranked by V-Dem.
%>%
vdem_2022 filter(country_name == "Kazakhstan")
country_name year regime regime_categorical
1 Kazakhstan 2022 NA <NA>
All in all, these issues related to the most frequent reasons why data might be missing after merging:
“Key” variables are spelled differently
The data is not available in one of the datasets
We’ll stick to removing NAs from the dataset using na.omit()
. Check out tips in the end of the script to dig into other solutions to the problem of missing values.
= whr_vdem %>%
whr_vdem na.omit()
Visualizing Relationships
Remember, we are asked: are democracies happier than non-democracies? Given the type of data we are working with today, we’ll have categories (regime types) and continuous data (happiness score).
Let’s visualize distribution for Ladder_score
in whr_vdem
using geom_histogram()
. Do you remember how to do this?
Ok! A lot of things are going on. First of all, let’s discuss syntax a bit. Remove position = "identity"
to get the sense of what it stands for. Without this, it would simply draw a histogram, stacking observations on top of each other! Finally, a new argument is alpha
. It stands for the transparency of the element (bar).
ggplot() +
geom_histogram(data = whr_vdem, aes(x = Ladder_score, fill = regime_categorical),
position = "identity",
alpha = 0.6)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, let’s calculate averages for Democracies and Autocracies.
= whr_vdem %>%
dem_avg filter(regime_categorical == "Democratic") %>% # leave only Democratic regimes
summarize(mean = mean(Ladder_score)) # calculate the average
= whr_vdem %>%
aut_avg filter(regime_categorical == "Autocratic") %>%
summarize(mean = mean(Ladder_score))
And add the averages to the plot.
ggplot() +
geom_histogram(data = whr_vdem, aes(x = Ladder_score, fill = regime_categorical),
position = "identity",
alpha = 0.6) +
geom_vline(xintercept = dem_avg$mean, color = "blue") +
geom_vline(xintercept = aut_avg$mean, color = "red")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Formally, we can compare it using statistical tests. For example, a Student’s t-test. What does p-value indicate?
t.test(whr_vdem$Ladder_score ~ whr_vdem$regime_categorical)
Welch Two Sample t-test
data: whr_vdem$Ladder_score by whr_vdem$regime_categorical
t = -6.7424, df = 105.46, p-value = 8.575e-10
alternative hypothesis: true difference in means between group Autocratic and group Democratic is not equal to 0
95 percent confidence interval:
-1.5392830 -0.8396989
sample estimates:
mean in group Autocratic mean in group Democratic
5.074309 6.263800
Does it mean that democracies are happier? Is it a causal relation?
Additional Practice
Here are some exercises for you to practice what we did today. Investing some time in this would definitely pay off in the long term!
Some Tips
What to do with the missing data?
Ignore – a preferable option for the purposes of this class
Try out
right_join()
orfull_join()
, if applicable.If dealing with numeric data, you can insert an average for missing values
Impute. A bit challenging! But something to study beyond this class
Check List
I know how clear my environment, and I will do that occasionally
I know how to use mutate()
and case_when()
to recode my variables
I know how to identify missing values
I have the intuition behind merging the data, and left_join()
helps me doing complex stuff
I know how to visualize multiple distributions
I can easily compare distributions using t.test()
and interpret the result