Lab 8 Extra

ESS Data Wrangling

Introduction

First of all, I’m glad you opened this link! Feel free to copy the code for your purposes.

Not all data comes in a neat and clean way. Here I’ll show you how I prepared the data for Lab 8.

Turning to Business

As usual, let’s load the library tidyverse for data wrangling and kableExtra for better display of the results in HTML document.

library(tidyverse)
library(kableExtra)

Now, let’s load the raw dataset.

df = read.csv("data_raw/ESS11-subset.csv")

Here is how it looks like:

df %>%
  head() %>%
  kable()
name essround edition proddate idno cntry dweight pspwght pweight anweight pplfair trstplt trstprt happy rlgdgr edulvlb prob stratum psu
ESS11e02 11 2 20.11.2024 50014 AT 1.1851145 0.3928906 0.3309145 0.1300132 5 5 5 8 5 322 0.0005786 107 317
ESS11e02 11 2 20.11.2024 50030 AT 0.6098981 0.3251533 0.3309145 0.1075980 0 1 0 9 0 423 0.0011244 69 128
ESS11e02 11 2 20.11.2024 50057 AT 1.3923296 4.0000234 0.3309145 1.3236659 9 4 4 9 8 610 0.0004925 18 418
ESS11e02 11 2 20.11.2024 50106 AT 0.5560615 0.1762276 0.3309145 0.0583163 6 3 3 7 6 422 0.0012333 101 295
ESS11e02 11 2 20.11.2024 50145 AT 0.7227953 1.0609399 0.3309145 0.3510804 3 5 5 9 1 322 0.0009488 115 344
ESS11e02 11 2 20.11.2024 50158 AT 0.9926053 1.3928125 0.3309145 0.4609019 8 5 5 8 3 313 0.0006909 7 373

You can notice a lot of variables there, and observations for one country repeated multiple times. As it’s a survey data, ESS asks a lot of people from each European Union country. In the Lab we use aggregated data for each country, thus we need to calculate the average.

Take a look on their codebook here. Check how the “pplfair - Most people try to take advantage of you, or try to be fair” variable is coded:

Value Category
0 Most people try to take advantage of me
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 Most people try to be fair
77 Refusal*
88 Don’t know*
99 No answer*

Thus, we need to recode values 77, 88 and 99 to something else, otherwise our aggregated values can get over 10 when the scale doesn’t assume it. Let’s see:

df %>%
  ggplot(aes(x = pplfair)) +
  geom_histogram() 

Let’s record the average without correcting the data.

avg_raw = mean(df$pplfair)

You can see how I recode the variables below using case_when() from tidyverse. I rely on documentation of the European Social Survey, as you should do too! This weird NA_real_ makes values simply NA. Syntax is not as straightforward, take some time to understand what’s going on!

df = df %>%
  mutate(pplfair = case_when(pplfair == 99 ~ NA_real_,
                             pplfair == 88 ~ NA_real_,
                             pplfair == 77 ~ NA_real_,
                             TRUE ~ pplfair),
         trstplt = case_when(trstplt == 99 ~ NA_real_,
                             trstplt == 88 ~ NA_real_,
                             trstplt == 77 ~ NA_real_,
                             TRUE ~ trstplt),
         trstprt = case_when(trstprt == 99 ~ NA_real_,
                             trstprt == 88 ~ NA_real_,
                             trstprt == 77 ~ NA_real_,
                             TRUE ~ trstprt),
         rlgdgr = case_when(rlgdgr == 99 ~ NA_real_,
                             rlgdgr == 88 ~ NA_real_,
                             rlgdgr == 77 ~ NA_real_,
                             TRUE ~ rlgdgr),
         happy = case_when(happy == 99 ~ NA_real_,
                             happy == 88 ~ NA_real_,
                             happy == 77 ~ NA_real_,
                             TRUE ~ happy),
         edulvlb = case_when(edulvlb < 610 ~ 1,
                             edulvlb == 610 | edulvlb == 620 ~ 2,
                             edulvlb == 710 | edulvlb == 720 ~ 3,
                             edulvlb == 800 ~ 4,
                             TRUE ~ NA_real_))

Now, let’s calculate average. Don’t forget, now we have NAs, so we need to use na.rm = TRUE to avoid errors.

avg_corr = mean(df$pplfair, na.rm = TRUE)

Now, compare. This might be quite meaningful!

data.frame(`Raw` = avg_raw, `Recoded` = avg_corr) %>%
  kable()
Raw Recoded
6.131985 5.708889

Now, let’s group the data by country to aggregate variables to the country level.

df_groupped = df %>%
  group_by(cntry) %>%
  summarize(pplfair = mean(pplfair, na.rm = T),
            trstplt = mean(trstplt, na.rm = T),
            trstprt = mean(trstprt, na.rm = T),
            rlgdgr = mean(rlgdgr, na.rm = T),
            happy = mean(happy, na.rm = T),
            edulvlb = mean(edulvlb, na.rm = T))

head(df_groupped) %>%
  kable()
cntry pplfair trstplt trstprt rlgdgr happy edulvlb
AT 6.385696 3.773544 3.723961 4.612917 7.781570 1.243568
BE 6.076585 4.172808 3.942565 4.145283 7.778544 1.642675
CH 6.431571 5.540824 5.292646 4.482909 8.154348 1.498912
CY 4.428363 2.681481 2.497041 6.756598 6.969118 1.469173
DE 6.177006 4.004165 3.974069 3.922599 7.762929 1.435993
ES 5.578890 2.784035 2.776864 4.104518 7.847991 1.438520

Finally, save the data. This is the data we have been working in the Lab 8.

write.csv(df_groupped, "data/ess.csv", row.names = F)