---
title: "Last Quarter's Review"
subtitle: "Week 1"
date: 2025-01-09
format: 
  html:
    embed-resources: true
toc: true
---

# Before we start

-   We are expected to have installed R and RStudio, if not see the [installing R section](#installingr).

-   In the discussion section, we will focus on coding and practicing what we have learned in the lectures.

-   Office hours are on Tuesday, 11-12:30 Scott 110.

-   Questions?

<center>
<a href="https://artur-baranov.github.io/nu-ps405-ds/scripts/ps405-d_1.qmd" class="btn btn-primary" role="button" download="ps405-d_1.qmd" style="width:200px" target="_blank">Download script</a>
</center>

# Brief recap of the last quarter

## Coding Terminology

### Code Chunk

To insert a **Code Chunk**, you can use `Ctrl+Alt+I` on Windows and `Cmd+Option+I` on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using `Ctrl + Enter` or `Command + Return` on Mac.

```{r}
print("Code Chunk")
```

### Function and Arguments

Most of the **functions** we want to run require an **argument** For example, the function `print()` above takes the argument "Code Chunk".

```{r}
#| eval: false

function(argument)
```


### Data structures

There are many **data structures**, but the most important to know the following.

-   **Objects.** Those are individual units, e.g. a number or a word.

```{r}
#| results: hold
number = 1
number

word = "Northwestern"
word
```


-   **Vectors**. Vectors are collections of objects. To create one, you will need to use function `c()`.

```{r}
numbers = c(1, 2, 3)
numbers
```

-   **Dataframes**. Dataframes are the most used data structure. Last quarter you spend a lot of time working with it. It is a table with data. Columns are called variables, and those are vectors. You can access a column using `$` operator.

```{r}
#| results: hold
df = data.frame(numbers, 
                numbers_multiplied = numbers * 2)
df
df$numbers_multiplied
```


### Data classes

We work with various **classes of data**, and the analysis we perform depends heavily on these classes.

-   **Numeric**. Continuous data. 

```{r}
#| results: hold
numeric_class = c(1.2, 2.5, 7.3)
numeric_class
class(numeric_class)
```

-   **Integer**. Whole numbers (e.g., count data).

```{r}
#| results: hold
integer_class = c(1:3)
class(integer_class)
```

-   **Character**. Usually, represent textual data.

```{r}
word
class(word)
```

-   **Factor**. Categorical variables, where each value is treated as an identifier for a category.

```{r}
#| results: hold

colors = c("blue", "green")
class(colors)
```

As you noticed, R did not identify the class of data correctly. We can change it using `as.factor()` function. You can easily change the class of your variable (`as.numeric()`, `as.integer()`, `as.character()`)

```{r}
colors = as.factor(colors)
class(colors)
```

### Libraries

Quite frequently, we use additional **libraries** to extend the capabilities of R. I'm sure you remember `tidyverse`. Let's load it.

```{r}
#| message: false

library(tidyverse)
```

If you updated your R or recently downloaded it, you can easily install libraries using the function `install.packages()`.

### Pipes

**Pipes** (`%>%` or `|>`) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to "take an object, and then". 

```{r}
numbers %>%
  print()
```

# Describing Data

First task, let's load the data

```{r}
load(url("https://github.com/vdeminstitute/vdemdata/raw/6bee8e170578fe8ccdc1414ae239c5e870996bc0/data/vdem.RData"))
```

This is the V-Dem dataset. For your reference, their codebook is available [here](https://v-dem.net/documents/38/V-Dem_Codebook_v14.pdf).

The dataset is huge! Be careful

```{r}
#| results: hold
nrow(vdem)
ncol(vdem)
```

Imagine you are interested in the relationship between regime type and physical violence. Let’s **select** the variables we will work with. Quite unfortunately, the names of the variables are not as straightforward. The regime index is `e_v2x_polyarchy_5C` and Physical violence index is `v2x_clphy`. 

```{r}
violence_data = vdem %>%
  select(country_name, year, e_v2x_polyarchy_5C, v2x_clphy) 
```

Let's **rename** the variables so it's easier to work with them.

```{r}
violence_data = violence_data %>%
  rename(regime = e_v2x_polyarchy_5C,
         violence = v2x_clphy)

head(violence_data)
```

Now, analyze the regime data. We can describe regime data using various statistics. Let's check the min score for the regime.

```{r}
min(violence_data$regime, na.rm = T)
```

Check the max score for the regime variable below.

```{r}
#| eval: false
...(violence_data$regime, na.rm = T)
```

Check the average score for the regime variable below.

```{r}
#| eval: false
mean(..., na.rm = T)
```

Finally, use the `summary()` function to get the descriptive statistics.

```{r}
summary(violence_data$regime)
```

::: callout-note
## Exercise

Select country name (`contry_name`), year (`year`) and Political corruption index (`v2x_corr`) from the `vdem` dataset.

```{r}
#| eval: false

corruption = vdem %>%
  ...(contry_name, ...)
```

Leave only year 2010 using `filter()` function

```{r}
#| eval: false

corruption = corruption
  ...
```

Calculate min, max and median. Don't forget about missing values, `na.rm = TRUE` argument should remove them.

```{r}
#| eval: false
#| 
...(corruption$v2x_corr)
```

Present the summary

```{r}
#| eval: false


```

:::


You can use the table below for your reference.

| Statistic          | Function       | Example Usage            |
|--------------------|----------------|--------------------------|
| Minimum            | `min()`        | `min(x)`                 |
| Maximum            | `max()`        | `max(x)`                 |
| Mean               | `mean()`       | `mean(x)`                |
| Median             | `median()`     | `median(x)`              |
| Standard Deviation | `sd()`         | `sd(x)`                  |
| Variance           | `var()`        | `var(x)`                 |
| Sum                | `sum()`        | `sum(x)`                 |
| Summary            | `summary()`    | `summary(x)`             |


# Vizualization

Let's begin with analyzing the distribution of the Violence Index in the year 2000. We need to **filter** the data for this task.

```{r}
violence_2000 = violence_data %>%
  filter(year == 2000)
```

The higher the number, the better the physical integrity in a given country. 

```{r}
#| warning: false

ggplot(data = violence_2000) +
  geom_histogram(aes(x = violence))
```

Let's customize the plot a bit.

```{r}
ggplot(data = violence_2000) +
  geom_histogram(aes(x = violence)) +
  labs(title = "Physical Integrity Rights Index",
       subtitle = "In 2000",
       x = "Violence Integrity Index",
       y = "") +
  theme_bw()
``` 

Imagine, you asked the following question. Is it true that in the year 2000 democracies had better violence integrity rights index?

Firstly, you we need to differentiate between the regimes. Using `mutate()` function we can create and modify existing variables. Let's do that!

```{r}
violence_2000 = violence_2000 %>%
  mutate(democracy = case_when(regime >= 0.5 ~ "Democracy",
                               regime < 0.5 ~ "Autocracy"))

head(violence_2000)
```

There are multiple ways to visually compare between two groups. 

Let's draw histograms first.

```{r}
#| message: false

ggplot(violence_2000) +
  geom_histogram(aes(x = violence, fill = democracy))
```

But comparing distributions between groups is easier with the boxplot. But are they different?

```{r}
ggplot(violence_2000) +
  geom_boxplot(aes(x = democracy, y = violence))
```

Let's try to reverse engineer the data-generating process and calculate the confidence intervals for those samples to double-check the results. We need to `group_by()` type of the regime, and then `summarize()` the data.

```{r}
violence_ci = violence_2000 %>%
  group_by(democracy) %>%
  summarize(
    mean_violence = mean(violence, na.rm = TRUE),
    lower = mean_violence - 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()),
    upper = mean_violence + 1.96 * sd(violence, na.rm = TRUE) / sqrt(n()))

violence_ci
```

Finally, visualize it.

```{r}
ggplot(violence_ci) +
  geom_linerange(aes(x = democracy,
                     ymin = lower,
                     ymax = upper)) +
  geom_point(aes(x = democracy,
                 y = mean_violence))
```

::: callout-note
## Exercise

Add title, and rename X and Y axis.

```{r}
#| eval: false

ggplot(violence_ci) +
  geom_linerange(aes(x = democracy,
                     ymin = lower,
                     ymax = upper)) +
  geom_point(aes(x = democracy,
                 y = mean_violence)) +
  labs()
```

Draw a histogram of Political corruption index (`v2x_corr`) from the exercises above.

```{r}
#| eval: false

...(corruption) 
  geom_...()
```


:::


# Sampling

Lastly, sampling. Imagine, we have data for the whole "population". In our case, these are all countries in the year 2000. We know average of violence index, which is

```{r}
mean(violence_2000$violence)
```

Compare it to the sample of N = 15

```{r}
set.seed(1)
violence_sample_15 = sample(violence_2000$violence, size = 15)
```

And now calculate the average of the sample. Is it accurate?

```{r}
mean(violence_sample_15)
```

Let's repeat (iterate) the process multiple times. First, create an empty dataset to store the data

```{r}
sample_averages = data.frame() 
```

Second, repeat the process 100 times

```{r}
for(i in 1:100){
  temporary_sample = sample(violence_2000$violence, size = 15) 
  temporary_sample_average = mean(temporary_sample)  
  sample_averages = rbind(sample_averages, temporary_sample_average) 
}
```

Check what we have got

```{r}
colnames(sample_averages) = "average"
head(sample_averages)
```

Take a look on the average of the collected averages. Did it get closer to the real parameter? This is essentially **bootstrapping**.

```{r}
mean(sample_averages$average) 
```

Draw a histogram of the averages. 

```{r}
#| eval: false

...(...) 
  geom_histogram(aes(...))
```


# Useful Tidyverse Functions

You can use the table below for your reference.

| Function      | Description                                                    |
|---------------|----------------------------------------------------------------|
| `select()`    | Selects specific columns from a data frame                     |
| `mutate()`    | Adds new variables or modifies existing ones                   |
| `filter()`    | Filters rows based on specified conditions                     |
| `group_by()`  | Groups data by one or more variables for subsequent operations |
| `summarize()` | Summarizes data by applying a function (e.g., mean, sum)       |
| `case_when()` | Modifies a variable based on conditional logic                 |
| `rename()`    | Renames columns in a data frame                                |

You can check how to use these commands in this scipt, or you can simply use the help option `?function()`.


# Helpful to review

-   [Summarizing Random Variables](https://gustavodiaz.org/ps403/slides/week3.html#/expected-value)

-   [Random Samples](https://gustavodiaz.org/ps403/slides/week4.html#/ingredients-for-statistical-inference)

-   [p-values](https://gustavodiaz.org/ps403/slides/week8.html#/p-values)


# Installing R and RStudio {#installingr}

**First**, we need to install R. Click the button below and click "Download and Install R". Choose your OS. For Windows you need to download "**base**"; for MacOS and Linux you have to choose the version of your OS. Install.
  
<center>[Download R](https://cran.rstudio.com){.btn .btn-primary .btn role="button" target="_blank" style="width:200px"}</center>

<details>
  <summary>Step one</summary>
  <center>![](https://raw.githubusercontent.com/gustavo-diaz/NUmathcamp/main/r/images/install_r_cran.png)</center>
  
  For windows:
  
  <center>![](https://raw.githubusercontent.com/gustavo-diaz/NUmathcamp/main/r/images/install_r_base.png)</center>
</details>

**Second**, we need to install RStudio. Click the button below and click "Download RStudio Desktop". You will be redirected to your version automatically. Install.

<center>[Download RStudio](https://posit.co/download/rstudio-desktop/){.btn .btn-primary .btn role="button" target="_blank" style="width:200px"}</center>

<details>
  <summary>Step two</summary>
  <center>![](https://raw.githubusercontent.com/gustavo-diaz/NUmathcamp/main/r/images/install_rstudio_posit.png)</center>
</details>