library(tidyverse)
library(kableExtra)
Lab 8 Extra
ESS Data Wrangling
Introduction
First of all, I’m glad you opened this link! Feel free to copy the code for your purposes.
Not all data comes in a neat and clean way. Here I’ll show you how I prepared the data for Lab 8.
Turning to Business
As usual, let’s load the library tidyverse
for data wrangling and kableExtra
for better display of the results in HTML document.
Now, let’s load the raw dataset.
= read.csv("data_raw/ESS11-subset.csv") df
Here is how it looks like:
%>%
df head() %>%
kable()
name | essround | edition | proddate | idno | cntry | dweight | pspwght | pweight | anweight | pplfair | trstplt | trstprt | happy | rlgdgr | edulvlb | prob | stratum | psu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ESS11e02 | 11 | 2 | 20.11.2024 | 50014 | AT | 1.1851145 | 0.3928906 | 0.3309145 | 0.1300132 | 5 | 5 | 5 | 8 | 5 | 322 | 0.0005786 | 107 | 317 |
ESS11e02 | 11 | 2 | 20.11.2024 | 50030 | AT | 0.6098981 | 0.3251533 | 0.3309145 | 0.1075980 | 0 | 1 | 0 | 9 | 0 | 423 | 0.0011244 | 69 | 128 |
ESS11e02 | 11 | 2 | 20.11.2024 | 50057 | AT | 1.3923296 | 4.0000234 | 0.3309145 | 1.3236659 | 9 | 4 | 4 | 9 | 8 | 610 | 0.0004925 | 18 | 418 |
ESS11e02 | 11 | 2 | 20.11.2024 | 50106 | AT | 0.5560615 | 0.1762276 | 0.3309145 | 0.0583163 | 6 | 3 | 3 | 7 | 6 | 422 | 0.0012333 | 101 | 295 |
ESS11e02 | 11 | 2 | 20.11.2024 | 50145 | AT | 0.7227953 | 1.0609399 | 0.3309145 | 0.3510804 | 3 | 5 | 5 | 9 | 1 | 322 | 0.0009488 | 115 | 344 |
ESS11e02 | 11 | 2 | 20.11.2024 | 50158 | AT | 0.9926053 | 1.3928125 | 0.3309145 | 0.4609019 | 8 | 5 | 5 | 8 | 3 | 313 | 0.0006909 | 7 | 373 |
You can notice a lot of variables there, and observations for one country repeated multiple times. As it’s a survey data, ESS asks a lot of people from each European Union country. In the Lab we use aggregated data for each country, thus we need to calculate the average.
Take a look on their codebook here. Check how the “pplfair
- Most people try to take advantage of you, or try to be fair” variable is coded:
Value | Category |
---|---|
0 | Most people try to take advantage of me |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 6 |
7 | 7 |
8 | 8 |
9 | 9 |
10 | Most people try to be fair |
77 | Refusal* |
88 | Don’t know* |
99 | No answer* |
Thus, we need to recode values 77
, 88
and 99
to something else, otherwise our aggregated values can get over 10 when the scale doesn’t assume it. Let’s see:
%>%
df ggplot(aes(x = pplfair)) +
geom_histogram()
Let’s record the average without correcting the data.
= mean(df$pplfair) avg_raw
You can see how I recode the variables below using case_when()
from tidyverse
. I rely on documentation of the European Social Survey, as you should do too! This weird NA_real_
makes values simply NA. Syntax is not as straightforward, take some time to understand what’s going on!
= df %>%
df mutate(pplfair = case_when(pplfair == 99 ~ NA_real_,
== 88 ~ NA_real_,
pplfair == 77 ~ NA_real_,
pplfair TRUE ~ pplfair),
trstplt = case_when(trstplt == 99 ~ NA_real_,
== 88 ~ NA_real_,
trstplt == 77 ~ NA_real_,
trstplt TRUE ~ trstplt),
trstprt = case_when(trstprt == 99 ~ NA_real_,
== 88 ~ NA_real_,
trstprt == 77 ~ NA_real_,
trstprt TRUE ~ trstprt),
rlgdgr = case_when(rlgdgr == 99 ~ NA_real_,
== 88 ~ NA_real_,
rlgdgr == 77 ~ NA_real_,
rlgdgr TRUE ~ rlgdgr),
happy = case_when(happy == 99 ~ NA_real_,
== 88 ~ NA_real_,
happy == 77 ~ NA_real_,
happy TRUE ~ happy),
edulvlb = case_when(edulvlb < 610 ~ 1,
== 610 | edulvlb == 620 ~ 2,
edulvlb == 710 | edulvlb == 720 ~ 3,
edulvlb == 800 ~ 4,
edulvlb TRUE ~ NA_real_))
Now, let’s calculate average. Don’t forget, now we have NAs, so we need to use na.rm = TRUE
to avoid errors.
= mean(df$pplfair, na.rm = TRUE) avg_corr
Now, compare. This might be quite meaningful!
data.frame(`Raw` = avg_raw, `Recoded` = avg_corr) %>%
kable()
Raw | Recoded |
---|---|
6.131985 | 5.708889 |
Now, let’s group the data by country to aggregate variables to the country level.
= df %>%
df_groupped group_by(cntry) %>%
summarize(pplfair = mean(pplfair, na.rm = T),
trstplt = mean(trstplt, na.rm = T),
trstprt = mean(trstprt, na.rm = T),
rlgdgr = mean(rlgdgr, na.rm = T),
happy = mean(happy, na.rm = T),
edulvlb = mean(edulvlb, na.rm = T))
head(df_groupped) %>%
kable()
cntry | pplfair | trstplt | trstprt | rlgdgr | happy | edulvlb |
---|---|---|---|---|---|---|
AT | 6.385696 | 3.773544 | 3.723961 | 4.612917 | 7.781570 | 1.243568 |
BE | 6.076585 | 4.172808 | 3.942565 | 4.145283 | 7.778544 | 1.642675 |
CH | 6.431571 | 5.540824 | 5.292646 | 4.482909 | 8.154348 | 1.498912 |
CY | 4.428363 | 2.681481 | 2.497041 | 6.756598 | 6.969118 | 1.469173 |
DE | 6.177006 | 4.004165 | 3.974069 | 3.922599 | 7.762929 | 1.435993 |
ES | 5.578890 | 2.784035 | 2.776864 | 4.104518 | 7.847991 | 1.438520 |
Finally, save the data. This is the data we have been working in the Lab 8.
write.csv(df_groupped, "data/ess.csv", row.names = F)