library(readxl)
= read_excel("data/cpds.xlsx") cpds
Exploring Data
Week 2
Before we start
No Thursday discussion section next week
But! Script will be published on the website. Go through it! Data classes will be discussed there!
Questions regarding Lab Assignment or the class?
Quick Recap
Substantive
What is a Causal Relation?
What is Confounder?
Independent Variable vs Dependent Variable?
Control Variable?
Coding
What is a Chunk?
What is a CSV file? How is it different from Excel file?
Agenda
Continuing to adapt to R and RStudio
Exploring Data
Tracking Missingness
Markdown and Quarto
This whole website was built using R, Markdown and Quarto. Let’s quickly overview these languages
In RStudio, you can use Markdown language to format text.
For example, this is bold text and this is italic text. And, of course, you can insert images. It’s pretty easy, and after the class you can take a look at some tutorials.
You can do many-many more different things. In this regard, visual editor in RStudio might be helpful. Markdown is also used in several note taking apps, e.g. Obsidian or Notion. Feel free to utilize your Markdown knowledge for your studies.
Generally, what we’ve done so far can be described by the image below. We have used R (“engine”) and RStudio (“car”). In Rstudio we have Quarto, which is this document you are working with right now. We can do a lot of things right away – e.g., render our output to a Word document, PDF or HTML.
Finding Data
Let’s explore Comparative Political Dataset. It consists of political and institutional country-level data. Take a look on their codebook.
Today we are working with the following variables.
year
- year variablecountry
- country variableprefisc_gini
- Gini index. What is it?eu
- member states of the European Union identificationopenc
- Openness of the economy (trade as % of GDP)poco
- post-communist countries post-communist countries identification
If you don’t have readxl
library installed, do it using install.packages()
. Run it only once!
Load the tidyverse
library
library(tidyverse)
Exploring data
First of all, let’s subset the variables we have outlined for the ease of working with data.
= cpds %>%
cpds_subset select(year, country, prefisc_gini, eu, openc, poco)
How does the data look like? Using head()
let’s present first rows to get the sense. What is NA?
head(cpds_subset)
# A tibble: 6 × 6
year country prefisc_gini eu openc poco
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1960 Australia NA 0 27.4 0
2 1961 Australia NA 0 26.6 0
3 1962 Australia NA 0 26.8 0
4 1963 Australia NA 0 28.7 0
5 1964 Australia NA 0 28.5 0
6 1965 Australia NA 0 28.1 0
Explore the distribution of Gini below. What can we observe? Pay attention to aes()
argument.
ggplot(cpds_subset) +
geom_histogram(aes(x = prefisc_gini))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1292 rows containing non-finite outside the scale range
(`stat_bin()`).
What is an average Gini coefficient? Pay attention to the na.rm = TRUE
argument.
mean(cpds_subset$prefisc_gini, na.rm = TRUE)
[1] 41.36394
Let’s include this information on the plot, customizing it in the meantime. Pay attention to theme_bw()
and labs()
functions. You can explore ggplot themes here.
ggplot(cpds_subset) +
geom_histogram(aes(x = prefisc_gini)) +
geom_vline(xintercept = mean(cpds_subset$prefisc_gini, na.rm = TRUE), color = "red") +
theme_bw() +
labs(x = "Gini Coefficient",
y = "Count",
title = "Distribution of Gini Coefficient")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1292 rows containing non-finite outside the scale range
(`stat_bin()`).
Let’s explore the distribution by groups. For example, EU countries to non-EU countries. Use eu
variable for this and geom_boxplot()
. But wow! We didn’t get the group comparison, any ideas why?
ggplot(cpds_subset) +
geom_boxplot(aes(y = prefisc_gini, x = eu))
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: Removed 6 rows containing missing values or values outside the scale range
(`stat_boxplot()`).
Warning: Removed 1286 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Let’s correct the class of variables. We’ll discuss the classes more in a detail next week*. Fantastic! Are these groups different? Add drop_na(eu)
to remove the NA category on the graph.
%>%
cpds_subset mutate(eu = as.factor(eu)) %>%
ggplot() +
geom_boxplot(aes(y = prefisc_gini, x = eu))
Warning: Removed 1292 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Exploring missing values
Quite often there are missing values in the data. Let’s, first of all, understand how big of the problem is. Why are there this many missing values?
is.na(cpds_subset$prefisc_gini) %>%
sum()
[1] 1292
Let’s create a variable indicating if the values are missing or not.
= cpds_subset %>%
cpds_subset mutate(gini_na = is.na(prefisc_gini))
Now, check the dynamics in years. Let’s wrangle the data to count the number of missing/non-missing values per year.
= cpds_subset %>%
missing_years group_by(year, gini_na) %>%
count()
%>%
missing_years head()
# A tibble: 6 × 3
# Groups: year, gini_na [6]
year gini_na n
<dbl> <lgl> <int>
1 1960 TRUE 21
2 1961 TRUE 21
3 1962 TRUE 21
4 1963 FALSE 1
5 1963 TRUE 20
6 1964 FALSE 1
Finally, let’s plot it using geom_col()
- which is quite similar to geom_histogram()
. Take a moment to compare it. Which years have more missing values, and which have fewer?
%>%
missing_years ggplot() +
geom_col(aes(x = year, y = n, fill = gini_na), position = "dodge") +
labs(fill = "Missing",
x = "Year",
y = "Count")
Substantively, it is clear that there are some problems with the data we have to account for: the older the data, the worse is the record track of Gini Coefficient.
Some Tips
QoG and V-Dem were covered in the Lecture – take some time to go through this data for your project
Additionally, take a look on this list of datasets
Sometimes we start with a question and then search for the data. However, sometimes it’s the opposite: there’s data available, and we ask, ‘What can I use it for?’
Merging dataframes are not as trivial, we will cover it in the future. But if you need it right now for you project, check this tutorial
Check List
I undertsand how I can load .csv and .xlsx in R, and if I see some other unsual file extension, it will not scare me
I know how to proceed with exploratory analysis: drawing graphs is fun and useful
I know that there might be missing values, and I will keep this in mind when exploring the relationships between variables
I know what a histogram and a boxplot is. I get how we can visually compare distributions