Exploring Data

Week 2

Published

April 10, 2025

Before we start

  • No Thursday discussion section next week

  • But! Script will be published on the website. Go through it! Data classes will be discussed there!

  • Questions regarding Lab Assignment or the class?


Download script

Download data


Quick Recap

Substantive

  • What is a Causal Relation?

  • What is Confounder?

  • Independent Variable vs Dependent Variable?

  • Control Variable?

Coding

  • What is a Chunk?

  • What is a CSV file? How is it different from Excel file?

Agenda

  • Continuing to adapt to R and RStudio

  • Exploring Data

  • Tracking Missingness

Markdown and Quarto

This whole website was built using R, Markdown and Quarto. Let’s quickly overview these languages

In RStudio, you can use Markdown language to format text.

For example, this is bold text and this is italic text. And, of course, you can insert images. It’s pretty easy, and after the class you can take a look at some tutorials.

Northwestern Logo

You can do many-many more different things. In this regard, visual editor in RStudio might be helpful. Markdown is also used in several note taking apps, e.g. Obsidian or Notion. Feel free to utilize your Markdown knowledge for your studies.

Generally, what we’ve done so far can be described by the image below. We have used R (“engine”) and RStudio (“car”). In Rstudio we have Quarto, which is this document you are working with right now. We can do a lot of things right away – e.g., render our output to a Word document, PDF or HTML.

R software

Finding Data

Let’s explore Comparative Political Dataset. It consists of political and institutional country-level data. Take a look on their codebook.

Today we are working with the following variables.

  • year - year variable

  • country - country variable

  • prefisc_gini - Gini index. What is it?

  • eu - member states of the European Union identification

  • openc - Openness of the economy (trade as % of GDP)

  • poco - post-communist countries post-communist countries identification

If you don’t have readxl library installed, do it using install.packages(). Run it only once!

library(readxl)

cpds = read_excel("data/cpds.xlsx")

Load the tidyverse library

library(tidyverse)

Exploring data

First of all, let’s subset the variables we have outlined for the ease of working with data.

cpds_subset = cpds %>%
  select(year, country, prefisc_gini, eu, openc, poco) 

How does the data look like? Using head() let’s present first rows to get the sense. What is NA?

head(cpds_subset)
# A tibble: 6 × 6
   year country   prefisc_gini    eu openc  poco
  <dbl> <chr>            <dbl> <dbl> <dbl> <dbl>
1  1960 Australia           NA     0  27.4     0
2  1961 Australia           NA     0  26.6     0
3  1962 Australia           NA     0  26.8     0
4  1963 Australia           NA     0  28.7     0
5  1964 Australia           NA     0  28.5     0
6  1965 Australia           NA     0  28.1     0

Explore the distribution of Gini below. What can we observe? Pay attention to aes() argument.

ggplot(cpds_subset) +
  geom_histogram(aes(x = prefisc_gini)) 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1292 rows containing non-finite outside the scale range
(`stat_bin()`).

What is an average Gini coefficient? Pay attention to the na.rm = TRUE argument.

mean(cpds_subset$prefisc_gini, na.rm = TRUE)
[1] 41.36394

Let’s include this information on the plot, customizing it in the meantime. Pay attention to theme_bw() and labs() functions. You can explore ggplot themes here.

ggplot(cpds_subset) +
  geom_histogram(aes(x = prefisc_gini))  +
  geom_vline(xintercept = mean(cpds_subset$prefisc_gini, na.rm = TRUE), color = "red") +
  theme_bw() +
  labs(x = "Gini Coefficient",
       y = "Count",
       title = "Distribution of Gini Coefficient")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1292 rows containing non-finite outside the scale range
(`stat_bin()`).

Let’s explore the distribution by groups. For example, EU countries to non-EU countries. Use eu variable for this and geom_boxplot(). But wow! We didn’t get the group comparison, any ideas why?

ggplot(cpds_subset) +
  geom_boxplot(aes(y = prefisc_gini, x = eu)) 
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: Removed 6 rows containing missing values or values outside the scale range
(`stat_boxplot()`).
Warning: Removed 1286 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Let’s correct the class of variables. We’ll discuss the classes more in a detail next week*. Fantastic! Are these groups different? Add drop_na(eu) to remove the NA category on the graph.

cpds_subset %>%
  mutate(eu = as.factor(eu)) %>%
  ggplot() +
  geom_boxplot(aes(y = prefisc_gini, x = eu)) 
Warning: Removed 1292 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Coding Task

Imagine, you were asked the following question. Does a communist past lead to a more open economy?

Let’s explore these variables:

  • openc - Openness of the economy (trade as % of GDP)

  • poco - post-communist countries post-communist countries identification

They are already in cpds_subset. Draw a distribution of openc variable using geom_histogram().

ggplot(...) +
  ...(aes(x = openc)) 

Add an average of openc to the plot using geom_vline()

ggplot(cpds_subset) +
  geom_histogram(...(x = ...)) +
  ...(xintercept = mean(cpds_subset$openc))

Compare post-communist countries to non post-communist countries (poco) in terms of the openness of the economy (openc). Use geom_boxplot(), and don’t forget to make sure the class of the variable is the right one!

Insert a chunk, add labels and cutsomize the plot.

Did we address the question posed at the beginning? Did we approach it descriptively, predictively, or causally? Take a moment to think through that and write down your thoughts.

Exploring missing values

Quite often there are missing values in the data. Let’s, first of all, understand how big of the problem is. Why are there this many missing values?

is.na(cpds_subset$prefisc_gini) %>%
  sum()
[1] 1292

Let’s create a variable indicating if the values are missing or not.

cpds_subset = cpds_subset %>%
  mutate(gini_na = is.na(prefisc_gini))

Now, check the dynamics in years. Let’s wrangle the data to count the number of missing/non-missing values per year.

missing_years = cpds_subset %>%
  group_by(year, gini_na) %>%
  count() 

missing_years %>%
  head()
# A tibble: 6 × 3
# Groups:   year, gini_na [6]
   year gini_na     n
  <dbl> <lgl>   <int>
1  1960 TRUE       21
2  1961 TRUE       21
3  1962 TRUE       21
4  1963 FALSE       1
5  1963 TRUE       20
6  1964 FALSE       1

Finally, let’s plot it using geom_col() - which is quite similar to geom_histogram(). Take a moment to compare it. Which years have more missing values, and which have fewer?

missing_years %>%
  ggplot() +
  geom_col(aes(x = year, y = n, fill = gini_na), position = "dodge") +
  labs(fill = "Missing",
       x = "Year",
       y = "Count")

Substantively, it is clear that there are some problems with the data we have to account for: the older the data, the worse is the record track of Gini Coefficient.

Some Tips

  • QoG and V-Dem were covered in the Lecture – take some time to go through this data for your project

  • Additionally, take a look on this list of datasets

  • Sometimes we start with a question and then search for the data. However, sometimes it’s the opposite: there’s data available, and we ask, ‘What can I use it for?’

  • Merging dataframes are not as trivial, we will cover it in the future. But if you need it right now for you project, check this tutorial

Check List

I undertsand how I can load .csv and .xlsx in R, and if I see some other unsual file extension, it will not scare me

I know how to proceed with exploratory analysis: drawing graphs is fun and useful

I know that there might be missing values, and I will keep this in mind when exploring the relationships between variables

I know what a histogram and a boxplot is. I get how we can visually compare distributions