--- title: "Last Quarter's Review" subtitle: "Week 1" date: 2025-01-09 format: html: embed-resources: true toc: true --- # Before we start - We are expected to have installed R and RStudio, if not see the [installing R section](#installingr). - In the discussion section, we will focus on coding and practicing what we have learned in the lectures. - Office hours are on Tuesday, 11-12:30 Scott 110. - Questions?
Download script
# Brief recap of the last quarter ## Coding Terminology ### Code Chunk To insert a **Code Chunk**, you can use `Ctrl+Alt+I` on Windows and `Cmd+Option+I` on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using `Ctrl + Enter` or `Command + Return` on Mac. ```{r} print("Code Chunk") ``` ### Function and Arguments Most of the **functions** we want to run require an **argument** For example, the function `print()` above takes the argument "Code Chunk". ```{r} #| eval: false function(argument) ``` ### Data structures There are many **data structures**, but the most important to know the following. - **Objects.** Those are individual units, e.g. a number or a word. ```{r} #| results: hold number = 1 number word = "Northwestern" word ``` - **Vectors**. Vectors are collections of objects. To create one, you will need to use function `c()`. ```{r} numbers = c(1, 2, 3) numbers ``` - **Dataframes**. Dataframes are the most used data structure. Last quarter you spend a lot of time working with it. It is a table with data. Columns are called variables, and those are vectors. You can access a column using `$` operator. ```{r} #| results: hold df = data.frame(numbers, numbers_multiplied = numbers * 2) df df$numbers_multiplied ``` ### Data classes We work with various **classes of data**, and the analysis we perform depends heavily on these classes. - **Numeric**. Continuous data. ```{r} #| results: hold numeric_class = c(1.2, 2.5, 7.3) numeric_class class(numeric_class) ``` - **Integer**. Whole numbers (e.g., count data). ```{r} #| results: hold integer_class = c(1:3) class(integer_class) ``` - **Character**. Usually, represent textual data. ```{r} word class(word) ``` - **Factor**. Categorical variables, where each value is treated as an identifier for a category. ```{r} #| results: hold colors = c("blue", "green") class(colors) ``` As you noticed, R did not identify the class of data correctly. We can change it using `as.factor()` function. You can easily change the class of your variable (`as.numeric()`, `as.integer()`, `as.character()`) ```{r} colors = as.factor(colors) class(colors) ``` ### Libraries Quite frequently, we use additional **libraries** to extend the capabilities of R. I'm sure you remember `tidyverse`. Let's load it. ```{r} #| message: false library(tidyverse) ``` If you updated your R or recently downloaded it, you can easily install libraries using the function `install.packages()`. ### Pipes **Pipes** (`%>%` or `|>`) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to "take an object, and then". ```{r} numbers %>% print() ``` # Describing Data First task, let's load the data ```{r} load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData")) ``` This is the V-Dem dataset. For your reference, their codebook is available [here](https://v-dem.net/documents/38/V-Dem_Codebook_v14.pdf). The dataset is huge! Be careful ```{r} #| results: hold nrow(vdem) ncol(vdem) ``` Imagine you are interested in the relationship between regime type and physical violence. Let’s **select** the variables we will work with. Quite unfortunately, the names of the variables are not as straightforward. The regime index is `e_v2x_polyarchy_5C` and Physical violence index is `v2x_clphy`. ```{r} violence_data = vdem %>% select(country_name, year, e_v2x_polyarchy_5C, v2x_clphy) ``` Let's **rename** the variables so it's easier to work with them. ```{r} violence_data = violence_data %>% rename(regime = e_v2x_polyarchy_5C, violence = v2x_clphy) head(violence_data) ``` Now, analyze the regime data. We can describe regime data using various statistics. Let's check the min score for the regime. ```{r} min(violence_data$regime, na.rm = T) ``` Check the max score for the regime variable below. ```{r} #| eval: false ...(violence_data$regime, na.rm = T) ``` Check the average score for the regime variable below. ```{r} #| eval: false mean(..., na.rm = T) ``` Finally, use the `summary()` function to get the descriptive statistics. ```{r} summary(violence_data$regime) ``` ::: callout-note ## Exercise Select country name (`contry_name`), year (`year`) and Political corruption index (`v2x_corr`) from the `vdem` dataset. ```{r} #| eval: false corruption = vdem %>% ...(contry_name, ...) ``` Leave only year 2010 using `filter()` function ```{r} #| eval: false corruption = corruption ... ``` Calculate min, max and median. Don't forget about missing values, `na.rm = TRUE` argument should remove them. ```{r} #| eval: false #| ...(corruption$v2x_corr) ``` Present the summary ```{r} #| eval: false ``` ::: You can use the table below for your reference. | Statistic | Function | Example Usage | |--------------------|----------------|--------------------------| | Minimum | `min()` | `min(x)` | | Maximum | `max()` | `max(x)` | | Mean | `mean()` | `mean(x)` | | Median | `median()` | `median(x)` | | Standard Deviation | `sd()` | `sd(x)` | | Variance | `var()` | `var(x)` | | Sum | `sum()` | `sum(x)` | | Summary | `summary()` | `summary(x)` | # Vizualization Let's begin with analyzing the distribution of the Violence Index in the year 2000. We need to **filter** the data for this task. ```{r} violence_2000 = violence_data %>% filter(year == 2000) ``` The higher the number, the better the physical integrity in a given country. ```{r} #| warning: false ggplot(data = violence_2000) + geom_histogram(aes(x = violence)) ``` Let's customize the plot a bit. ```{r} ggplot(data = violence_2000) + geom_histogram(aes(x = violence)) + labs(title = "Physical Integrity Rights Index", subtitle = "In 2000", x = "Violence Integrity Index", y = "") + theme_bw() ``` Imagine, you asked the following question. Is it true that in the year 2000 democracies had better violence integrity rights index? Firstly, you we need to differentiate between the regimes. Using `mutate()` function we can create and modify existing variables. Let's do that! ```{r} violence_2000 = violence_2000 %>% mutate(democracy = case_when(regime >= 0.5 ~ "Democracy", regime < 0.5 ~ "Autocracy")) head(violence_2000) ``` There are multiple ways to visually compare between two groups. Let's draw histograms first. ```{r} #| message: false ggplot(violence_2000) + geom_histogram(aes(x = violence, fill = democracy)) ``` But comparing distributions between groups is easier with the boxplot. But are they different? ```{r} ggplot(violence_2000) + geom_boxplot(aes(x = democracy, y = violence)) ``` Let's try to reverse engineer the data-generating process and calculate the confidence intervals for those samples to double-check the results. We need to `group_by()` type of the regime, and then `summarize()` the data. ```{r} violence_ci = violence_2000 %>% group_by(democracy) %>% summarize( mean_violence = mean(violence, na.rm = TRUE), lower = mean_violence - 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()), upper = mean_violence + 1.96 * sd(violence, na.rm = TRUE) / sqrt(n())) violence_ci ``` Finally, visualize it. ```{r} ggplot(violence_ci) + geom_linerange(aes(x = democracy, ymin = lower, ymax = upper)) + geom_point(aes(x = democracy, y = mean_violence)) ``` ::: callout-note ## Exercise Add title, and rename X and Y axis. ```{r} #| eval: false ggplot(violence_ci) + geom_linerange(aes(x = democracy, ymin = lower, ymax = upper)) + geom_point(aes(x = democracy, y = mean_violence)) + labs() ``` Draw a histogram of Political corruption index (`v2x_corr`) from the exercises above. ```{r} #| eval: false ...(corruption) geom_...() ``` ::: # Sampling Lastly, sampling. Imagine, we have data for the whole "population". In our case, these are all countries in the year 2000. We know average of violence index, which is ```{r} mean(violence_2000$violence) ``` Compare it to the sample of N = 15 ```{r} set.seed(1) violence_sample_15 = sample(violence_2000$violence, size = 15) ``` And now calculate the average of the sample. Is it accurate? ```{r} mean(violence_sample_15) ``` Let's repeat (iterate) the process multiple times. First, create an empty dataset to store the data ```{r} sample_averages = data.frame() ``` Second, repeat the process 100 times ```{r} for(i in 1:100){ temporary_sample = sample(violence_2000$violence, size = 15) temporary_sample_average = mean(temporary_sample) sample_averages = rbind(sample_averages, temporary_sample_average) } ``` Check what we have got ```{r} colnames(sample_averages) = "average" head(sample_averages) ``` Take a look on the average of the collected averages. Did it get closer to the real parameter? This is essentially **bootstrapping**. ```{r} mean(sample_averages$average) ``` Draw a histogram of the averages. ```{r} #| eval: false ...(...) geom_histogram(aes(...)) ``` # Useful Tidyverse Functions You can use the table below for your reference. | Function | Description | |---------------|----------------------------------------------------------------| | `select()` | Selects specific columns from a data frame | | `mutate()` | Adds new variables or modifies existing ones | | `filter()` | Filters rows based on specified conditions | | `group_by()` | Groups data by one or more variables for subsequent operations | | `summarize()` | Summarizes data by applying a function (e.g., mean, sum) | | `case_when()` | Modifies a variable based on conditional logic | | `rename()` | Renames columns in a data frame | You can check how to use these commands in this scipt, or you can simply use the help option `?function()`. # Helpful to review - [Summarizing Random Variables](https://gustavodiaz.org/ps403/slides/week3.html#/expected-value) - [Random Samples](https://gustavodiaz.org/ps403/slides/week4.html#/ingredients-for-statistical-inference) - [p-values](https://gustavodiaz.org/ps403/slides/week8.html#/p-values) # Installing R and RStudio {#installingr} **First**, we need to install R. Click the button below and click "Download and Install R". Choose your OS. For Windows you need to download "**base**"; for MacOS and Linux you have to choose the version of your OS. Install.
[Download R](https://cran.rstudio.com){.btn .btn-primary .btn role="button" target="_blank" style="width:200px"}
Step one
![](https://raw.githubusercontent.com/gustavo-diaz/NUmathcamp/main/r/images/install_r_cran.png)
For windows:
![](https://raw.githubusercontent.com/gustavo-diaz/NUmathcamp/main/r/images/install_r_base.png)
**Second**, we need to install RStudio. Click the button below and click "Download RStudio Desktop". You will be redirected to your version automatically. Install.
[Download RStudio](https://posit.co/download/rstudio-desktop/){.btn .btn-primary .btn role="button" target="_blank" style="width:200px"}
Step two
![](https://raw.githubusercontent.com/gustavo-diaz/NUmathcamp/main/r/images/install_rstudio_posit.png)