library(tidyverse)Merging Data and Statistical Relationships
Week 5
Before we start
- Any questions?
First, load the tidyverse library
Load the WhoGov dataset.
whogov = read.csv("data/WhoGov.csv")First of all, data wrangling (take a look on week 2 script).
Create a variable for independent and partisan leaders using case_when().
whogov = whogov %>%
mutate(indep = ...(leader_party ... "independent" ... "Independent",
leader_party != ... "Partisan"))Check the distribution of the newly created variable using table()
...Now, visualizations. Draw a histogram of average_total. Add an average. Make sure to name X axis: “Average Tenure of People in cabinet”.
...Now, draw a boxplot. X axis is indep, and Y axis is average_total.
whogov_boxplot = ...
whogov_boxplotAdd a facet_wrap(~ system_category). And add coord_flip(). How is it different from facet_grid() we used last time?
whogov_boxplot ...Agenda
Learning how to merge data
Comparing distributions, statistically
Data Exploration
Today we are working with World Happiness report and V-Dem. Here are respective codebooks: WHR and V-Dem.
Let’s start with WHS.
Country_nameis the name of the countryLadder_scoreis the happiness score
Load the data.
whr = read.csv("data/WHR.csv")Now, load the V-Dem data.
country_nameis the name of the countryyearis the year of observatione_v2x_api_3Cis Additive polyarchy index ordinal (what is polyarchy?)
I have filtered out only the variables we need to work with today.
load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))
vdem = vdem %>%
select(country_name,
year,
regime = e_v2x_api_3C)Imagine, we are asked: are democracies happier than non-democracies?
Joining Data
We can start exploring this relationship with our skilleset! But we need to add one more: joining or merging the data.
To merge the data, we need to choose a variable that is common to both datasets. In our case, it’s the country names. Often, you would also have a year variable—but since our WHR report data is for the year 2022, let’s subset the V-Dem data to include only observations from 2022. Once again, why are we using==?
vdem_2022 = vdem %>%
filter(year == 2022)Now, let’s merge the data using left_join(). Pay attention to the syntax, especially the by part.
whr_vdem = whr %>%
left_join(vdem_2022, by = c("Country_name" = "country_name")) Check the result.
whr_vdem = whr_vdem %>%
select(Country_name, Ladder_score, regime)
whr_vdem %>%
head() Country_name Ladder_score regime
1 Finland 7.804 1
2 Denmark 7.586 1
3 Iceland 7.530 1
4 Israel 7.473 1
5 Netherlands 7.403 1
6 Sweden 7.395 1
This is what we did, feel free to explore the tutorial.
Comparing Distributions
Comparing two groups is a bit easier, as things quickly get messier when you have more. Let’s start with that. Create a binary variable in the dataset that would indicate if a country is a Democracy or Autocracy. You know how to use this case_when() function already, right?
whr_vdem = whr_vdem %>%
mutate(regime_categorical = case_when(regime != 1 ~ "Autocratic",
regime == 1 ~ "Democratic"))Let’s start exploring patterns. What are the top 5 happiest countries? Are these democracies?
whr_vdem %>%
slice_max(Ladder_score, n = 5) Country_name Ladder_score regime regime_categorical
1 Finland 7.804 1 Democratic
2 Denmark 7.586 1 Democratic
3 Iceland 7.530 1 Democratic
4 Israel 7.473 1 Democratic
5 Netherlands 7.403 1 Democratic
What are the least happy countries?
whr_vdem %>%
slice_min(Ladder_score, n = 5) Country_name Ladder_score regime regime_categorical
1 Afghanistan 1.859 0.0 Autocratic
2 Lebanon 2.392 0.5 Autocratic
3 Sierra Leone 3.138 0.5 Autocratic
4 Zimbabwe 3.204 NA <NA>
5 Congo (Kinshasa) 3.207 NA <NA>
And what is in the middle of the distribution?
whr_vdem %>%
slice(50:55) Country_name Ladder_score regime regime_categorical
1 El Salvador 6.122 0.5 Autocratic
2 Hungary 6.041 0.5 Autocratic
3 Argentina 6.024 1.0 Democratic
4 Honduras 6.023 0.5 Autocratic
5 Uzbekistan 6.014 0.0 Autocratic
6 Malaysia 6.012 0.5 Autocratic
Remove position = "identity" to get the sense of what it stands for. Without this, it would simply draw a histogram, stacking observations on top of each other! Finally, a new argument is alpha. It stands for the transparency of the element (bar).
ggplot(whr_vdem %>%
drop_na(regime_categorical)) +
geom_histogram(aes(x = Ladder_score,
fill = regime_categorical),
position = "identity",
alpha = 0.6)`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Now, let’s calculate averages for Democracies and Autocracies.
dem_avg = whr_vdem %>%
filter(regime_categorical == "Democratic") %>% # leave only Democratic regimes
summarize(mean = mean(Ladder_score)) # calculate the average
aut_avg = whr_vdem %>%
filter(regime_categorical == "Autocratic") %>%
summarize(mean = mean(Ladder_score))And add the averages to the plot. So, are these groups different?
ggplot(whr_vdem %>%
drop_na(regime_categorical)) +
geom_histogram(aes(x = Ladder_score, fill = regime_categorical),
position = "identity",
alpha = 0.6) +
geom_vline(xintercept = dem_avg$mean, color = "blue") +
geom_vline(xintercept = aut_avg$mean, color = "red")`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Formally, we can compare it using statistical tests. What does p-value indicate?
t.test(whr_vdem$Ladder_score ~ whr_vdem$regime_categorical)
Welch Two Sample t-test
data: whr_vdem$Ladder_score by whr_vdem$regime_categorical
t = -6.7424, df = 105.46, p-value = 8.575e-10
alternative hypothesis: true difference in means between group Autocratic and group Democratic is not equal to 0
95 percent confidence interval:
-1.5392830 -0.8396989
sample estimates:
mean in group Autocratic mean in group Democratic
5.074309 6.263800
Check List
Optional Exercises
First of all, recreate whr_vdem dataframe to have Continent variable in it. Use left_join() to create a dataset. Select the following variables: Country_name, Ladder_score, Continent
In whr_vdem dataframe create a variable Continent_binary using mutate() and case_when() on Continent to divide observation on North America and Other.
Visualize the distributions of happiness score (Ladder_score). Is it helpful?
Calculate and add a a median for “Other” continents to the plot. Customize the plot, add theme_bw() and a title!
Perform a t.test() to understand if the average Happiness between North America and Other continents are statistically different. Interpret the result