Last Quarter’s Review

Week 1

Published

January 9, 2025

Before we start

  • We are expected to have installed R and RStudio, if not see the installing R section.

  • In the discussion section, we will focus on coding and practicing what we have learned in the lectures.

  • Office hours are on Tuesday, 11-12:30 Scott 110.

  • Questions?

Download script

Brief recap of the last quarter

Coding Terminology

Code Chunk

To insert a Code Chunk, you can use Ctrl+Alt+I on Windows and Cmd+Option+I on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using Ctrl + Enter or Command + Return on Mac.

print("Code Chunk")
[1] "Code Chunk"

Function and Arguments

Most of the functions we want to run require an argument For example, the function print() above takes the argument “Code Chunk”.

function(argument)

Data structures

There are many data structures, but the most important to know the following.

  • Objects. Those are individual units, e.g. a number or a word.
number = 1
number

word = "Northwestern"
word
[1] 1
[1] "Northwestern"
  • Vectors. Vectors are collections of objects. To create one, you will need to use function c().
numbers = c(1, 2, 3)
numbers
[1] 1 2 3
  • Dataframes. Dataframes are the most used data structure. Last quarter you spend a lot of time working with it. It is a table with data. Columns are called variables, and those are vectors. You can access a column using $ operator.
df = data.frame(numbers, 
                numbers_multiplied = numbers * 2)
df
df$numbers_multiplied
  numbers numbers_multiplied
1       1                  2
2       2                  4
3       3                  6
[1] 2 4 6

Data classes

We work with various classes of data, and the analysis we perform depends heavily on these classes.

  • Numeric. Continuous data.
numeric_class = c(1.2, 2.5, 7.3)
numeric_class
class(numeric_class)
[1] 1.2 2.5 7.3
[1] "numeric"
  • Integer. Whole numbers (e.g., count data).
integer_class = c(1:3)
class(integer_class)
[1] "integer"
  • Character. Usually, represent textual data.
word
[1] "Northwestern"
class(word)
[1] "character"
  • Factor. Categorical variables, where each value is treated as an identifier for a category.
colors = c("blue", "green")
class(colors)
[1] "character"

As you noticed, R did not identify the class of data correctly. We can change it using as.factor() function. You can easily change the class of your variable (as.numeric(), as.integer(), as.character())

colors = as.factor(colors)
class(colors)
[1] "factor"

Libraries

Quite frequently, we use additional libraries to extend the capabilities of R. I’m sure you remember tidyverse. Let’s load it.

library(tidyverse)

If you updated your R or recently downloaded it, you can easily install libraries using the function install.packages().

Pipes

Pipes (%>% or |>) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to “take an object, and then”.

numbers %>%
  print()
[1] 1 2 3

Describing Data

First task, let’s load the data

load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))

This is the V-Dem dataset. For your reference, their codebook is available here.

The dataset is huge! Be careful

nrow(vdem)
ncol(vdem)
[1] 27734
[1] 4607

Imagine you are interested in the relationship between regime type and physical violence. Let’s select the variables we will work with. Quite unfortunately, the names of the variables are not as straightforward. The regime index is e_v2x_polyarchy_5C and Physical violence index is v2x_clphy.

violence_data = vdem %>%
  select(country_name, year, e_v2x_polyarchy_5C, v2x_clphy) 

Let’s rename the variables so it’s easier to work with them.

violence_data = violence_data %>%
  rename(regime = e_v2x_polyarchy_5C,
         violence = v2x_clphy)

head(violence_data)
  country_name year regime violence
1       Mexico 1789      0    0.322
2       Mexico 1790      0    0.322
3       Mexico 1791      0    0.322
4       Mexico 1792      0    0.322
5       Mexico 1793      0    0.322
6       Mexico 1794      0    0.322

Now, analyze the regime data. We can describe regime data using various statistics. Let’s check the min score for the regime.

min(violence_data$regime, na.rm = T)
[1] 0

Check the max score for the regime variable below.

...(violence_data$regime, na.rm = T)

Check the average score for the regime variable below.

mean(..., na.rm = T)

Finally, use the summary() function to get the descriptive statistics.

summary(violence_data$regime)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.2224  0.2500  1.0000    1139 
Exercise

Select country name (contry_name), year (year) and Political corruption index (v2x_corr) from the vdem dataset.

corruption = vdem %>%
  ...(contry_name, ...)

Leave only year 2010 using filter() function

corruption = corruption
  ...

Calculate min, max and median. Don’t forget about missing values, na.rm = TRUE argument should remove them.

...(corruption$v2x_corr)

Present the summary

You can use the table below for your reference.

Statistic Function Example Usage
Minimum min() min(x)
Maximum max() max(x)
Mean mean() mean(x)
Median median() median(x)
Standard Deviation sd() sd(x)
Variance var() var(x)
Sum sum() sum(x)
Summary summary() summary(x)

Vizualization

Let’s begin with analyzing the distribution of the Violence Index in the year 2000. We need to filter the data for this task.

violence_2000 = violence_data %>%
  filter(year == 2000)

The higher the number, the better the physical integrity in a given country.

ggplot(data = violence_2000) +
  geom_histogram(aes(x = violence))

Let’s customize the plot a bit.

ggplot(data = violence_2000) +
  geom_histogram(aes(x = violence)) +
  labs(title = "Physical Integrity Rights Index",
       subtitle = "In 2000",
       x = "Violence Integrity Index",
       y = "") +
  theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Imagine, you asked the following question. Is it true that in the year 2000 democracies had better violence integrity rights index?

Firstly, you we need to differentiate between the regimes. Using mutate() function we can create and modify existing variables. Let’s do that!

violence_2000 = violence_2000 %>%
  mutate(democracy = case_when(regime >= 0.5 ~ "Democracy",
                               regime < 0.5 ~ "Autocracy"))

head(violence_2000)
  country_name year regime violence democracy
1       Mexico 2000   0.75    0.573 Democracy
2     Suriname 2000   0.75    0.871 Democracy
3       Sweden 2000   1.00    0.990 Democracy
4  Switzerland 2000   1.00    0.974 Democracy
5        Ghana 2000   0.75    0.946 Democracy
6 South Africa 2000   0.75    0.834 Democracy

There are multiple ways to visually compare between two groups.

Let’s draw histograms first.

ggplot(violence_2000) +
  geom_histogram(aes(x = violence, fill = democracy))

But comparing distributions between groups is easier with the boxplot. But are they different?

ggplot(violence_2000) +
  geom_boxplot(aes(x = democracy, y = violence))

Let’s try to reverse engineer the data-generating process and calculate the confidence intervals for those samples to double-check the results. We need to group_by() type of the regime, and then summarize() the data.

violence_ci = violence_2000 %>%
  group_by(democracy) %>%
  summarize(
    mean_violence = mean(violence, na.rm = TRUE),
    lower = mean_violence - 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()),
    upper = mean_violence + 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()))

violence_ci
# A tibble: 2 × 4
  democracy mean_violence lower upper
  <chr>             <dbl> <dbl> <dbl>
1 Autocracy         0.443 0.386 0.499
2 Democracy         0.827 0.791 0.864

Finally, visualize it.

ggplot(violence_ci) +
  geom_linerange(aes(x = democracy,
                     ymin = lower,
                     ymax = upper)) +
  geom_point(aes(x = democracy,
                 y = mean_violence))

Exercise

Add title, and rename X and Y axis.

ggplot(violence_ci) +
  geom_linerange(aes(x = democracy,
                     ymin = lower,
                     ymax = upper)) +
  geom_point(aes(x = democracy,
                 y = mean_violence)) +
  labs()

Draw a histogram of Political corruption index (v2x_corr) from the exercises above.

...(corruption) 
  geom_...()

Sampling

Lastly, sampling. Imagine, we have data for the whole “population”. In our case, these are all countries in the year 2000. We know average of violence index, which is

mean(violence_2000$violence)
[1] 0.6601921

Compare it to the sample of N = 15

set.seed(1)
violence_sample_15 = sample(violence_2000$violence, size = 15)

And now calculate the average of the sample. Is it accurate?

mean(violence_sample_15)
[1] 0.6147333

Let’s repeat (iterate) the process multiple times. First, create an empty dataset to store the data

sample_averages = data.frame() 

Second, repeat the process 100 times

for(i in 1:100){
  temporary_sample = sample(violence_2000$violence, size = 15) 
  temporary_sample_average = mean(temporary_sample)  
  sample_averages = rbind(sample_averages, temporary_sample_average) 
}

Check what we have got

colnames(sample_averages) = "average"
head(sample_averages)
    average
1 0.6158000
2 0.6850667
3 0.7190000
4 0.6084667
5 0.5105333
6 0.5528000

Take a look on the average of the collected averages. Did it get closer to the real parameter? This is essentially bootstrapping.

mean(sample_averages$average) 
[1] 0.6461693

Draw a histogram of the averages.

...(...) 
  geom_histogram(aes(...))

Useful Tidyverse Functions

You can use the table below for your reference.

Function Description
select() Selects specific columns from a data frame
mutate() Adds new variables or modifies existing ones
filter() Filters rows based on specified conditions
group_by() Groups data by one or more variables for subsequent operations
summarize() Summarizes data by applying a function (e.g., mean, sum)
case_when() Modifies a variable based on conditional logic
rename() Renames columns in a data frame

You can check how to use these commands in this scipt, or you can simply use the help option ?function().

Helpful to review

Installing R and RStudio

First, we need to install R. Click the button below and click “Download and Install R”. Choose your OS. For Windows you need to download “base”; for MacOS and Linux you have to choose the version of your OS. Install.

Download R
Step one

For windows:

Second, we need to install RStudio. Click the button below and click “Download RStudio Desktop”. You will be redirected to your version automatically. Install.

Download RStudio
Step two