print("Code Chunk")
[1] "Code Chunk"
Week 1
January 9, 2025
We are expected to have installed R and RStudio, if not see the installing R section.
In the discussion section, we will focus on coding and practicing what we have learned in the lectures.
Office hours are on Tuesday, 11-12:30 Scott 110.
Questions?
To insert a Code Chunk, you can use Ctrl+Alt+I
on Windows and Cmd+Option+I
on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using Ctrl + Enter
or Command + Return
on Mac.
Most of the functions we want to run require an argument For example, the function print()
above takes the argument “Code Chunk”.
There are many data structures, but the most important to know the following.
c()
.$
operator.We work with various classes of data, and the analysis we perform depends heavily on these classes.
As you noticed, R did not identify the class of data correctly. We can change it using as.factor()
function. You can easily change the class of your variable (as.numeric()
, as.integer()
, as.character()
)
Quite frequently, we use additional libraries to extend the capabilities of R. I’m sure you remember tidyverse
. Let’s load it.
If you updated your R or recently downloaded it, you can easily install libraries using the function install.packages()
.
Pipes (%>%
or |>
) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to “take an object, and then”.
First task, let’s load the data
This is the V-Dem dataset. For your reference, their codebook is available here.
The dataset is huge! Be careful
Imagine you are interested in the relationship between regime type and physical violence. Let’s select the variables we will work with. Quite unfortunately, the names of the variables are not as straightforward. The regime index is e_v2x_polyarchy_5C
and Physical violence index is v2x_clphy
.
Let’s rename the variables so it’s easier to work with them.
violence_data = violence_data %>%
rename(regime = e_v2x_polyarchy_5C,
violence = v2x_clphy)
head(violence_data)
country_name year regime violence
1 Mexico 1789 0 0.322
2 Mexico 1790 0 0.322
3 Mexico 1791 0 0.322
4 Mexico 1792 0 0.322
5 Mexico 1793 0 0.322
6 Mexico 1794 0 0.322
Now, analyze the regime data. We can describe regime data using various statistics. Let’s check the min score for the regime.
Check the max score for the regime variable below.
Check the average score for the regime variable below.
Finally, use the summary()
function to get the descriptive statistics.
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.2224 0.2500 1.0000 1139
Select country name (contry_name
), year (year
) and Political corruption index (v2x_corr
) from the vdem
dataset.
Leave only year 2010 using filter()
function
Calculate min, max and median. Don’t forget about missing values, na.rm = TRUE
argument should remove them.
Present the summary
You can use the table below for your reference.
Statistic | Function | Example Usage |
---|---|---|
Minimum | min() |
min(x) |
Maximum | max() |
max(x) |
Mean | mean() |
mean(x) |
Median | median() |
median(x) |
Standard Deviation | sd() |
sd(x) |
Variance | var() |
var(x) |
Sum | sum() |
sum(x) |
Summary | summary() |
summary(x) |
Let’s begin with analyzing the distribution of the Violence Index in the year 2000. We need to filter the data for this task.
The higher the number, the better the physical integrity in a given country.
Let’s customize the plot a bit.
ggplot(data = violence_2000) +
geom_histogram(aes(x = violence)) +
labs(title = "Physical Integrity Rights Index",
subtitle = "In 2000",
x = "Violence Integrity Index",
y = "") +
theme_bw()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Imagine, you asked the following question. Is it true that in the year 2000 democracies had better violence integrity rights index?
Firstly, you we need to differentiate between the regimes. Using mutate()
function we can create and modify existing variables. Let’s do that!
violence_2000 = violence_2000 %>%
mutate(democracy = case_when(regime >= 0.5 ~ "Democracy",
regime < 0.5 ~ "Autocracy"))
head(violence_2000)
country_name year regime violence democracy
1 Mexico 2000 0.75 0.573 Democracy
2 Suriname 2000 0.75 0.871 Democracy
3 Sweden 2000 1.00 0.990 Democracy
4 Switzerland 2000 1.00 0.974 Democracy
5 Ghana 2000 0.75 0.946 Democracy
6 South Africa 2000 0.75 0.834 Democracy
There are multiple ways to visually compare between two groups.
Let’s draw histograms first.
But comparing distributions between groups is easier with the boxplot. But are they different?
Let’s try to reverse engineer the data-generating process and calculate the confidence intervals for those samples to double-check the results. We need to group_by()
type of the regime, and then summarize()
the data.
violence_ci = violence_2000 %>%
group_by(democracy) %>%
summarize(
mean_violence = mean(violence, na.rm = TRUE),
lower = mean_violence - 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()),
upper = mean_violence + 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()))
violence_ci
# A tibble: 2 × 4
democracy mean_violence lower upper
<chr> <dbl> <dbl> <dbl>
1 Autocracy 0.443 0.386 0.499
2 Democracy 0.827 0.791 0.864
Finally, visualize it.
ggplot(violence_ci) +
geom_linerange(aes(x = democracy,
ymin = lower,
ymax = upper)) +
geom_point(aes(x = democracy,
y = mean_violence))
Lastly, sampling. Imagine, we have data for the whole “population”. In our case, these are all countries in the year 2000. We know average of violence index, which is
Compare it to the sample of N = 15
And now calculate the average of the sample. Is it accurate?
Let’s repeat (iterate) the process multiple times. First, create an empty dataset to store the data
Second, repeat the process 100 times
Check what we have got
average
1 0.6158000
2 0.6850667
3 0.7190000
4 0.6084667
5 0.5105333
6 0.5528000
Take a look on the average of the collected averages. Did it get closer to the real parameter? This is essentially bootstrapping.
Draw a histogram of the averages.
You can use the table below for your reference.
Function | Description |
---|---|
select() |
Selects specific columns from a data frame |
mutate() |
Adds new variables or modifies existing ones |
filter() |
Filters rows based on specified conditions |
group_by() |
Groups data by one or more variables for subsequent operations |
summarize() |
Summarizes data by applying a function (e.g., mean, sum) |
case_when() |
Modifies a variable based on conditional logic |
rename() |
Renames columns in a data frame |
You can check how to use these commands in this scipt, or you can simply use the help option ?function()
.
First, we need to install R. Click the button below and click “Download and Install R”. Choose your OS. For Windows you need to download “base”; for MacOS and Linux you have to choose the version of your OS. Install.
For windows:
Second, we need to install RStudio. Click the button below and click “Download RStudio Desktop”. You will be redirected to your version automatically. Install.