Introduction to Statistical Methods

Week 1

Published

April 3, 2025

Before we start

artur-baranov.github.io/nu-ps312-ds

library(qrcode)
qr_code('https://artur-baranov.github.io/nu-ps312-ds') |>
  plot()

Getting to know each other

First of all, introduction time!

Your name
Your interest in the political science
Your previous experience with R

Agenda

Getting used to work in RStudio
Loading data into R
Practicing asking statistical questions

Introduction to the DS & Software

How we work

Create a folder for the PS312 discussion section.
Download the script and the data, and put them into the created folder.
Follow the script throughout the discussion section, run the code.
Explore the syntax. Amend the code after the class, play with graphs and functions. Experiment!

Download script
Download data

You can install R and RStudio, instructions are available here. Alternatively, feel free to use Posit.Cloud

Navigating RStudio

It will take some time to understand how everything works in RStudio, but once you understand it, it’s quite straightforward. The most classic UI consists of four panes.

Source. Here we write code to run and text.
Environment. This pane allows you to interact with the data loaded into RStudio.
Console. This pane provides an area to interactively execute code.
Files. By default, this pane has your working directory. From here you can access files associated with the project.

Programming

Code Chunk

To insert a Code Chunk, you can use Ctrl+Alt+I on Windows and Cmd+Option+I on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using Ctrl + Enter or Command + Return on Mac.

print("Code Chunk")

[1] "Code Chunk"

Coding Task

Insert a chunk. Print “Northwestern”.

Function and Arguments

Most of the functions we want to run require an argument. For example, the function print() above takes the argument “Code Chunk”.

function(argument)

Pipes

Pipes (%>% or |>) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to “take an object, and then”.

numbers = c(1, 2, 3) # a vector object

numbers |>
  print()

[1] 1 2 3

Libraries

Quite often we rely on non-native functions. To use them, we need to first install and then load libraries. Let’s do it step-by-step,

Firstly, let’s install a package tidyverse. Run the chunk below only once! Generally, it’s considered to be a good manner to remove install.packages() command from scripts.

install.packages("tidyverse")

Make sure you removed the chunk with install.packages() function. Now, load the library to the current R session. We’ll be working with this library extensively throughout the quarter.

library(tidyverse)

Asking Questions

Today we are working with the World Happiness Report. The codebook is here to help us - it’s quite useful to search for one when dealing with third party data. Here are some variables listed in the codebook:

Country_name is the name of the country
Ladder_score is the happiness score
Logged_GDP_per_capita is the log of GDP per capita

And many others.

Let’s load the data.

whr = read_csv("data/WHR.csv")

To print first 6 rows in the dataset you can use head() function.

head(whr)

# A tibble: 6 × 20
  Country_name Ladder_score Standard_error_of_ladder…¹ upperwhisker lowerwhisker
  <chr>               <dbl>                      <dbl>        <dbl>        <dbl>
1 Finland              7.80                      0.036         7.88         7.73
2 Denmark              7.59                      0.041         7.67         7.51
3 Iceland              7.53                      0.049         7.62         7.43
4 Israel               7.47                      0.032         7.54         7.41
5 Netherlands          7.40                      0.029         7.46         7.35
6 Sweden               7.40                      0.037         7.47         7.32
# ℹ abbreviated name: ¹Standard_error_of_ladder_score
# ℹ 15 more variables: Logged_GDP_per_capita <dbl>, Social_support <dbl>,
#   Healthy_life_expectancy <dbl>, Freedom_to_make_life_choices <dbl>,
#   Generosity <dbl>, Perceptions_of_corruption <dbl>,
#   Ladder_score_in_Dystopia <dbl>, Explained_by_Log_GDP_per_capita <dbl>,
#   Explained_by_Social_support <dbl>,
#   Explained_by_Healthy_life_expectancy <dbl>, …

Exctracting Facts from the Data

Often we start with descriptive questions.

What is the happiest country?

whr %>%
  select(Country_name, Ladder_score) %>% # choosing the variables
  arrange(-Ladder_score) %>%             # arranging data in descending order
  head(1)                                # leaving the first observation

# A tibble: 1 × 2
  Country_name Ladder_score
  <chr>               <dbl>
1 Finland              7.80

What is the least happy country?

whr %>%
  select(Country_name, Ladder_score) %>% # choosing the variables
  arrange(Ladder_score) %>%              # arranging data in ascending order
  head(1)                                # leaving the first observation

# A tibble: 1 × 2
  Country_name Ladder_score
  <chr>               <dbl>
1 Afghanistan          1.86

How happy are the countries?

ggplot(data = whr) +
  geom_histogram(aes(x = Ladder_score))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These questions allow us to be informed only about the ‘facts’ or the data we are working with. Though quite descriptive, they provide some intuition about what’s happening. What are some potentially intriguing patterns you observe in the graph above?

Relationships between the variables

Let’s start to explore more interesting patterns. Regularly, the starting point is a normative question. For example, how to make our country happier?.

Qustion

How would you answer this question and what would you need to answer this question?
Think through the question, how can you make it a statistical research question?

We are researching causal relations. But on initial steps, we quite often start with correlations. Let’s focus on the relationship between happinness and wealth measured as GDP.

Qustion

What type of the relationship is between Happiness and Wealth? Is it strong? What is the direction?

cor(whr$Ladder_score, whr$Logged_GDP_per_capita)

[1] 0.7843673

Let’s visualize this relationship.As a rule of thumb, when speaking about a relationship whose causal nature we have not definitively determined, we say there is an association between X and Y. Is there one?

ggplot(data = whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Wealth",
       y = "Happiness")

Tips for Thinking About Statistical Relationships

Ask yourself: what are you interested in? Then, identify what is your dependent (i.e., what are you explaining) and independent variables (i.e., what is the explanatory factor).
Do not underestimate the exploratory analysis. Explore your data! Get as much information as possible, find patterns and explore the relations between various variables of your interest
Think about causality. Are you sure that your independent variable causes change in dependent variable?

Check List

I know how to insert a chunk of code and run it

I know how directories work and how to load data in R

I am familiar with concept of independent and dependent variables

I am starting to understand how to ask statistical research questions, and the word causality does not scare me