library(tidyverse)
library(kableExtra)Lab 8 Extra
ESS Data Wrangling
Introduction
First of all, I’m glad you opened this link! Feel free to copy the code for your purposes.
Not all data comes in a neat and clean way. Here I’ll show you how I prepared the data for Lab 8.
Turning to Business
As usual, let’s load the library tidyverse for data wrangling and kableExtra for better display of the results in HTML document.
Now, let’s load the raw dataset.
df = read.csv("data_raw/ESS11-subset.csv")Here is how it looks like:
df %>%
head() %>%
kable()| name | essround | edition | proddate | idno | cntry | dweight | pspwght | pweight | anweight | pplfair | trstplt | trstprt | happy | rlgdgr | edulvlb | prob | stratum | psu |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ESS11e02 | 11 | 2 | 20.11.2024 | 50014 | AT | 1.1851145 | 0.3928906 | 0.3309145 | 0.1300132 | 5 | 5 | 5 | 8 | 5 | 322 | 0.0005786 | 107 | 317 |
| ESS11e02 | 11 | 2 | 20.11.2024 | 50030 | AT | 0.6098981 | 0.3251533 | 0.3309145 | 0.1075980 | 0 | 1 | 0 | 9 | 0 | 423 | 0.0011244 | 69 | 128 |
| ESS11e02 | 11 | 2 | 20.11.2024 | 50057 | AT | 1.3923296 | 4.0000234 | 0.3309145 | 1.3236659 | 9 | 4 | 4 | 9 | 8 | 610 | 0.0004925 | 18 | 418 |
| ESS11e02 | 11 | 2 | 20.11.2024 | 50106 | AT | 0.5560615 | 0.1762276 | 0.3309145 | 0.0583163 | 6 | 3 | 3 | 7 | 6 | 422 | 0.0012333 | 101 | 295 |
| ESS11e02 | 11 | 2 | 20.11.2024 | 50145 | AT | 0.7227953 | 1.0609399 | 0.3309145 | 0.3510804 | 3 | 5 | 5 | 9 | 1 | 322 | 0.0009488 | 115 | 344 |
| ESS11e02 | 11 | 2 | 20.11.2024 | 50158 | AT | 0.9926053 | 1.3928125 | 0.3309145 | 0.4609019 | 8 | 5 | 5 | 8 | 3 | 313 | 0.0006909 | 7 | 373 |
You can notice a lot of variables there, and observations for one country repeated multiple times. As it’s a survey data, ESS asks a lot of people from each European Union country. In the Lab we use aggregated data for each country, thus we need to calculate the average.
Take a look on their codebook here. Check how the “pplfair - Most people try to take advantage of you, or try to be fair” variable is coded:
| Value | Category |
|---|---|
| 0 | Most people try to take advantage of me |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
| 10 | Most people try to be fair |
| 77 | Refusal* |
| 88 | Don’t know* |
| 99 | No answer* |
Thus, we need to recode values 77, 88 and 99 to something else, otherwise our aggregated values can get over 10 when the scale doesn’t assume it. Let’s see:
df %>%
ggplot(aes(x = pplfair)) +
geom_histogram() Let’s record the average without correcting the data.
avg_raw = mean(df$pplfair)You can see how I recode the variables below using case_when() from tidyverse. I rely on documentation of the European Social Survey, as you should do too! This weird NA_real_ makes values simply NA. Syntax is not as straightforward, take some time to understand what’s going on!
df = df %>%
mutate(pplfair = case_when(pplfair == 99 ~ NA_real_,
pplfair == 88 ~ NA_real_,
pplfair == 77 ~ NA_real_,
TRUE ~ pplfair),
trstplt = case_when(trstplt == 99 ~ NA_real_,
trstplt == 88 ~ NA_real_,
trstplt == 77 ~ NA_real_,
TRUE ~ trstplt),
trstprt = case_when(trstprt == 99 ~ NA_real_,
trstprt == 88 ~ NA_real_,
trstprt == 77 ~ NA_real_,
TRUE ~ trstprt),
rlgdgr = case_when(rlgdgr == 99 ~ NA_real_,
rlgdgr == 88 ~ NA_real_,
rlgdgr == 77 ~ NA_real_,
TRUE ~ rlgdgr),
happy = case_when(happy == 99 ~ NA_real_,
happy == 88 ~ NA_real_,
happy == 77 ~ NA_real_,
TRUE ~ happy),
edulvlb = case_when(edulvlb < 610 ~ 1,
edulvlb == 610 | edulvlb == 620 ~ 2,
edulvlb == 710 | edulvlb == 720 ~ 3,
edulvlb == 800 ~ 4,
TRUE ~ NA_real_))Now, let’s calculate average. Don’t forget, now we have NAs, so we need to use na.rm = TRUE to avoid errors.
avg_corr = mean(df$pplfair, na.rm = TRUE)Now, compare. This might be quite meaningful!
data.frame(`Raw` = avg_raw, `Recoded` = avg_corr) %>%
kable()| Raw | Recoded |
|---|---|
| 6.131985 | 5.708889 |
Now, let’s group the data by country to aggregate variables to the country level.
df_groupped = df %>%
group_by(cntry) %>%
summarize(pplfair = mean(pplfair, na.rm = T),
trstplt = mean(trstplt, na.rm = T),
trstprt = mean(trstprt, na.rm = T),
rlgdgr = mean(rlgdgr, na.rm = T),
happy = mean(happy, na.rm = T),
edulvlb = mean(edulvlb, na.rm = T))
head(df_groupped) %>%
kable()| cntry | pplfair | trstplt | trstprt | rlgdgr | happy | edulvlb |
|---|---|---|---|---|---|---|
| AT | 6.385696 | 3.773544 | 3.723961 | 4.612917 | 7.781570 | 1.243568 |
| BE | 6.076585 | 4.172808 | 3.942565 | 4.145283 | 7.778544 | 1.642675 |
| CH | 6.431571 | 5.540824 | 5.292646 | 4.482909 | 8.154348 | 1.498912 |
| CY | 4.428363 | 2.681481 | 2.497041 | 6.756598 | 6.969118 | 1.469173 |
| DE | 6.177006 | 4.004165 | 3.974069 | 3.922599 | 7.762929 | 1.435993 |
| ES | 5.578890 | 2.784035 | 2.776864 | 4.104518 | 7.847991 | 1.438520 |
Finally, save the data. This is the data we have been working in the Lab 8.
write.csv(df_groupped, "data/ess.csv", row.names = F)