--- title: "Exploring Data" subtitle: "Week 2" date: 2025-04-10 format: html: embed-resources: true toc: true --- # Before we start - No Thursday discussion section next week - But! Script will be published on the website. Go through it! Data classes will be discussed there! - Questions regarding Lab Assignment or the class?
Download script

Download data

# Quick Recap ## Substantive - What is a **Causal Relation**? - What is **Confounder**? - Independent Variable vs Dependent Variable? - Control Variable? ## Coding - What is a Chunk? - What is a CSV file? How is it different from Excel file? # Agenda - Continuing to adapt to R and RStudio - Exploring Data - Tracking Missingness # Markdown and Quarto This whole website was built using R, Markdown and Quarto. Let's quickly overview these languages In RStudio, you can use Markdown language to format text. For example, **this is bold text** and *this is italic text*. And, of course, you can insert images. It's pretty easy, and after the class you can take a look at some [tutorials](https://www.markdownguide.org/basic-syntax/). ![Northwestern Logo](https://www.northwestern.edu/brand/images/wordmark/wordmark-vert.jpg) You can do many-many more different things. In this regard, visual editor in RStudio might be helpful. Markdown is also used in several note taking apps, e.g. [Obsidian](https://obsidian.md) or [Notion](https://www.notion.so). Feel free to utilize your Markdown knowledge for your studies. Generally, what we've done so far can be described by the image below. We have used R ("engine") and RStudio ("car"). In Rstudio we have Quarto, which is this document you are working with right now. We can do a lot of things right away -- e.g., render our output to a Word document, PDF or HTML. ![R software](https://artur-baranov.github.io/images/r_rstudio_quarto_output.png) # Finding Data Let’s explore Comparative Political Dataset. It consists of political and institutional country-level data. Take a look on their [codebook](https://cpds-data.org/wp-content/uploads/2024/11/codebook_cpds.pdf). Today we are working with the following variables. - `year` - year variable - `country` - country variable - `prefisc_gini` - Gini index. What is it? - `eu` - member states of the European Union identification - `openc` - Openness of the economy (trade as % of GDP) - `poco` - post-communist countries post-communist countries identification If you don't have `readxl` library installed, do it using `install.packages()`. Run it only once! ```{r} library(readxl) cpds = read_excel("data/cpds.xlsx") ``` Load the `tidyverse` library ```{r} #| message: false library(tidyverse) ``` # Exploring data First of all, let's subset the variables we have outlined for the ease of working with data. ```{r} cpds_subset = cpds %>% select(year, country, prefisc_gini, eu, openc, poco) ``` How does the data look like? Using `head()` let's present first rows to get the sense. What is NA? ```{r} head(cpds_subset) ``` Explore the distribution of Gini below. What can we observe? Pay attention to `aes()` argument. ```{r} ggplot(cpds_subset) + geom_histogram(aes(x = prefisc_gini)) ``` What is an average Gini coefficient? Pay attention to the `na.rm = TRUE` argument. ```{r} mean(cpds_subset$prefisc_gini, na.rm = TRUE) ``` Let's include this information on the plot, customizing it in the meantime. Pay attention to `theme_bw()` and `labs()` functions. You can explore ggplot themes [here](https://ggplot2.tidyverse.org/reference/ggtheme.html). ```{r} ggplot(cpds_subset) + geom_histogram(aes(x = prefisc_gini)) + geom_vline(xintercept = mean(cpds_subset$prefisc_gini, na.rm = TRUE), color = "red") + theme_bw() + labs(x = "Gini Coefficient", y = "Count", title = "Distribution of Gini Coefficient") ``` Let's explore the distribution by groups. For example, EU countries to non-EU countries. Use `eu` variable for this and `geom_boxplot()`. But wow! We didn't get the group comparison, any ideas why? ```{r} ggplot(cpds_subset) + geom_boxplot(aes(y = prefisc_gini, x = eu)) ``` Let's correct the class of variables. We'll discuss the classes more in a detail next week*. Fantastic! Are these groups different? Add `drop_na(eu)` to remove the NA category on the graph. ```{r} cpds_subset %>% mutate(eu = as.factor(eu)) %>% ggplot() + geom_boxplot(aes(y = prefisc_gini, x = eu)) ``` ::: {.callout-tip icon="false"} ## Coding Task Imagine, you were asked the following question. Does a communist past lead to a more open economy? Let's explore these variables: - `openc` - Openness of the economy (trade as % of GDP) - `poco` - post-communist countries post-communist countries identification They are already in `cpds_subset`. Draw a distribution of `openc` variable using `geom_histogram()`. ```{r} #| eval: false ggplot(...) + ...(aes(x = openc)) ``` Add an average of `openc` to the plot using `geom_vline()` ```{r} #| eval: false ggplot(cpds_subset) + geom_histogram(...(x = ...)) + ...(xintercept = mean(cpds_subset$openc)) ``` Compare post-communist countries to non post-communist countries (`poco`) in terms of the openness of the economy (`openc`). Use `geom_boxplot()`, and don't forget to make sure the class of the variable is the right one! ```{r} ``` Insert a chunk, add labels and cutsomize the plot. ... Did we address the question posed at the beginning? Did we approach it descriptively, predictively, or causally? Take a moment to think through that and write down your thoughts. ::: # Exploring missing values Quite often there are missing values in the data. Let's, first of all, understand how big of the problem is. Why are there this many missing values? ```{r} is.na(cpds_subset$prefisc_gini) %>% sum() ``` Let's create a variable indicating if the values are missing or not. ```{r} cpds_subset = cpds_subset %>% mutate(gini_na = is.na(prefisc_gini)) ``` Now, check the dynamics in years. Let's wrangle the data to count the number of missing/non-missing values per year. ```{r} missing_years = cpds_subset %>% group_by(year, gini_na) %>% count() missing_years %>% head() ``` Finally, let's plot it using `geom_col()` - which is quite similar to `geom_histogram()`. Take a moment to compare it. Which years have more missing values, and which have fewer? ```{r} missing_years %>% ggplot() + geom_col(aes(x = year, y = n, fill = gini_na), position = "dodge") + labs(fill = "Missing", x = "Year", y = "Count") ``` Substantively, it is clear that there are some problems with the data we have to account for: the older the data, the worse is the record track of Gini Coefficient. # Some Tips - QoG and V-Dem were covered in the Lecture -- take some time to go through this data for your project - Additionally, take a look on this [list of datasets](https://github.com/erikgahner/PolData) - Sometimes we start with a question and then search for the data. However, sometimes it's the opposite: there's data available, and we ask, 'What can I use it for?' - Merging dataframes are not as trivial, we will cover it in the future. But if you need it right now for you project, check this [tutorial](https://rpubs.com/williamsurles/293454) # Check List I undertsand how I can load .csv and .xlsx in R, and if I see some other unsual file extension, it will not scare me I know how to proceed with exploratory analysis: drawing graphs is fun and useful I know that there might be missing values, and I will keep this in mind when exploring the relationships between variables I know what a histogram and a boxplot is. I get how we can visually compare distributions