Lab 8 Extra

ESS Data Wrangling

Introduction

First of all, I’m glad you opened this link! Feel free to copy the code for your purposes.

Not all data comes in a neat and clean way. Here I’ll show you how I prepared the data for Lab 8.

Turning to Business

As usual, let’s load the library tidyverse for data wrangling and kableExtra for better display of the results in HTML document.

library(tidyverse)
library(kableExtra)

Now, let’s load the raw dataset.

df = read.csv("data_raw/ESS11-subset.csv")

Here is how it looks like:

df %>%
  head() %>%
  kable()

name	essround	edition	proddate	idno	cntry	dweight	pspwght	pweight	anweight	pplfair	trstplt	trstprt	happy	rlgdgr	edulvlb	prob	stratum	psu
ESS11e02	11	2	20.11.2024	50014	AT	1.1851145	0.3928906	0.3309145	0.1300132	5	5	5	8	5	322	0.0005786	107	317
ESS11e02	11	2	20.11.2024	50030	AT	0.6098981	0.3251533	0.3309145	0.1075980	0	1	0	9	0	423	0.0011244	69	128
ESS11e02	11	2	20.11.2024	50057	AT	1.3923296	4.0000234	0.3309145	1.3236659	9	4	4	9	8	610	0.0004925	18	418
ESS11e02	11	2	20.11.2024	50106	AT	0.5560615	0.1762276	0.3309145	0.0583163	6	3	3	7	6	422	0.0012333	101	295
ESS11e02	11	2	20.11.2024	50145	AT	0.7227953	1.0609399	0.3309145	0.3510804	3	5	5	9	1	322	0.0009488	115	344
ESS11e02	11	2	20.11.2024	50158	AT	0.9926053	1.3928125	0.3309145	0.4609019	8	5	5	8	3	313	0.0006909	7	373

You can notice a lot of variables there, and observations for one country repeated multiple times. As it’s a survey data, ESS asks a lot of people from each European Union country. In the Lab we use aggregated data for each country, thus we need to calculate the average.

Take a look on their codebook here. Check how the “pplfair - Most people try to take advantage of you, or try to be fair” variable is coded:

Value	Category
0	Most people try to take advantage of me
1	1
2	2
3	3
4	4
5	5
6	6
7	7
8	8
9	9
10	Most people try to be fair
77	Refusal*
88	Don’t know*
99	No answer*

Thus, we need to recode values 77, 88 and 99 to something else, otherwise our aggregated values can get over 10 when the scale doesn’t assume it. Let’s see:

df %>%
  ggplot(aes(x = pplfair)) +
  geom_histogram()

Let’s record the average without correcting the data.

avg_raw = mean(df$pplfair)

You can see how I recode the variables below using case_when() from tidyverse. I rely on documentation of the European Social Survey, as you should do too! This weird NA_real_ makes values simply NA. Syntax is not as straightforward, take some time to understand what’s going on!

df = df %>%
  mutate(pplfair = case_when(pplfair == 99 ~ NA_real_,
                             pplfair == 88 ~ NA_real_,
                             pplfair == 77 ~ NA_real_,
                             TRUE ~ pplfair),
         trstplt = case_when(trstplt == 99 ~ NA_real_,
                             trstplt == 88 ~ NA_real_,
                             trstplt == 77 ~ NA_real_,
                             TRUE ~ trstplt),
         trstprt = case_when(trstprt == 99 ~ NA_real_,
                             trstprt == 88 ~ NA_real_,
                             trstprt == 77 ~ NA_real_,
                             TRUE ~ trstprt),
         rlgdgr = case_when(rlgdgr == 99 ~ NA_real_,
                             rlgdgr == 88 ~ NA_real_,
                             rlgdgr == 77 ~ NA_real_,
                             TRUE ~ rlgdgr),
         happy = case_when(happy == 99 ~ NA_real_,
                             happy == 88 ~ NA_real_,
                             happy == 77 ~ NA_real_,
                             TRUE ~ happy),
         edulvlb = case_when(edulvlb < 610 ~ 1,
                             edulvlb == 610 | edulvlb == 620 ~ 2,
                             edulvlb == 710 | edulvlb == 720 ~ 3,
                             edulvlb == 800 ~ 4,
                             TRUE ~ NA_real_))

Now, let’s calculate average. Don’t forget, now we have NAs, so we need to use na.rm = TRUE to avoid errors.

avg_corr = mean(df$pplfair, na.rm = TRUE)

Now, compare. This might be quite meaningful!

data.frame(`Raw` = avg_raw, `Recoded` = avg_corr) %>%
  kable()

Raw	Recoded
6.131985	5.708889

Now, let’s group the data by country to aggregate variables to the country level.

df_groupped = df %>%
  group_by(cntry) %>%
  summarize(pplfair = mean(pplfair, na.rm = T),
            trstplt = mean(trstplt, na.rm = T),
            trstprt = mean(trstprt, na.rm = T),
            rlgdgr = mean(rlgdgr, na.rm = T),
            happy = mean(happy, na.rm = T),
            edulvlb = mean(edulvlb, na.rm = T))

head(df_groupped) %>%
  kable()

cntry	pplfair	trstplt	trstprt	rlgdgr	happy	edulvlb
AT	6.385696	3.773544	3.723961	4.612917	7.781570	1.243568
BE	6.076585	4.172808	3.942565	4.145283	7.778544	1.642675
CH	6.431571	5.540824	5.292646	4.482909	8.154348	1.498912
CY	4.428363	2.681481	2.497041	6.756598	6.969118	1.469173
DE	6.177006	4.004165	3.974069	3.922599	7.762929	1.435993
ES	5.578890	2.784035	2.776864	4.104518	7.847991	1.438520

Finally, save the data. This is the data we have been working in the Lab 8.

write.csv(df_groupped, "data/ess.csv", row.names = F)