library(qrcode)
qr_code('https://artur-baranov.github.io/nu-ps312-ds') |>
plot()
Introduction to Statistical Methods
Week 1
Before we start
Getting to know each other
First of all, introduction time!
Your name
Your interest in the political science
Your previous experience with R
Agenda
Getting used to work in RStudio
Loading data into R
Practicing asking statistical questions
Introduction to the DS & Software
How we work
Create a folder for the PS312 discussion section.
Download the script and the data, and put them into the created folder.
Follow the script throughout the discussion section, run the code.
Explore the syntax. Amend the code after the class, play with graphs and functions. Experiment!
You can install R and RStudio, instructions are available here. Alternatively, feel free to use Posit.Cloud
Programming
Code Chunk
To insert a Code Chunk, you can use Ctrl+Alt+I
on Windows and Cmd+Option+I
on Mac. Run the whole chunk by clicking the green triangle, or one/multiple lines by using Ctrl + Enter
or Command + Return
on Mac.
print("Code Chunk")
[1] "Code Chunk"
Function and Arguments
Most of the functions we want to run require an argument. For example, the function print()
above takes the argument “Code Chunk”.
function(argument)
Pipes
Pipes (%>%
or |>
) are helpful for streamlining the coding. They introduce linearity to the process of writing the code. In plain English, a pipe translates to “take an object, and then”.
= c(1, 2, 3) # a vector object
numbers
|>
numbers print()
[1] 1 2 3
Libraries
Quite often we rely on non-native functions. To use them, we need to first install and then load libraries. Let’s do it step-by-step,
Firstly, let’s install a package tidyverse
. Run the chunk below only once! Generally, it’s considered to be a good manner to remove install.packages()
command from scripts.
install.packages("tidyverse")
Make sure you removed the chunk with install.packages()
function. Now, load the library to the current R session. We’ll be working with this library extensively throughout the quarter.
library(tidyverse)
Asking Questions
Today we are working with the World Happiness Report. The codebook is here to help us - it’s quite useful to search for one when dealing with third party data. Here are some variables listed in the codebook:
Country_name
is the name of the countryLadder_score
is the happiness scoreLogged_GDP_per_capita
is the log of GDP per capita
And many others.
Let’s load the data.
= read_csv("data/WHR.csv") whr
To print first 6 rows in the dataset you can use head()
function.
head(whr)
# A tibble: 6 × 20
Country_name Ladder_score Standard_error_of_ladder…¹ upperwhisker lowerwhisker
<chr> <dbl> <dbl> <dbl> <dbl>
1 Finland 7.80 0.036 7.88 7.73
2 Denmark 7.59 0.041 7.67 7.51
3 Iceland 7.53 0.049 7.62 7.43
4 Israel 7.47 0.032 7.54 7.41
5 Netherlands 7.40 0.029 7.46 7.35
6 Sweden 7.40 0.037 7.47 7.32
# ℹ abbreviated name: ¹Standard_error_of_ladder_score
# ℹ 15 more variables: Logged_GDP_per_capita <dbl>, Social_support <dbl>,
# Healthy_life_expectancy <dbl>, Freedom_to_make_life_choices <dbl>,
# Generosity <dbl>, Perceptions_of_corruption <dbl>,
# Ladder_score_in_Dystopia <dbl>, Explained_by_Log_GDP_per_capita <dbl>,
# Explained_by_Social_support <dbl>,
# Explained_by_Healthy_life_expectancy <dbl>, …
Exctracting Facts from the Data
Often we start with descriptive questions.
What is the happiest country?
%>%
whr select(Country_name, Ladder_score) %>% # choosing the variables
arrange(-Ladder_score) %>% # arranging data in descending order
head(1) # leaving the first observation
# A tibble: 1 × 2
Country_name Ladder_score
<chr> <dbl>
1 Finland 7.80
What is the least happy country?
%>%
whr select(Country_name, Ladder_score) %>% # choosing the variables
arrange(Ladder_score) %>% # arranging data in ascending order
head(1) # leaving the first observation
# A tibble: 1 × 2
Country_name Ladder_score
<chr> <dbl>
1 Afghanistan 1.86
How happy are the countries?
ggplot(data = whr) +
geom_histogram(aes(x = Ladder_score))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These questions allow us to be informed only about the ‘facts’ or the data we are working with. Though quite descriptive, they provide some intuition about what’s happening. What are some potentially intriguing patterns you observe in the graph above?
Relationships between the variables
Let’s start to explore more interesting patterns. Regularly, the starting point is a normative question. For example, how to make our country happier?.
We are researching causal relations. But on initial steps, we quite often start with correlations. Let’s focus on the relationship between happinness and wealth measured as GDP.
Let’s visualize this relationship.As a rule of thumb, when speaking about a relationship whose causal nature we have not definitively determined, we say there is an association between X and Y. Is there one?
ggplot(data = whr, aes(x = Logged_GDP_per_capita, y = Ladder_score)) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Wealth",
y = "Happiness")
Tips for Thinking About Statistical Relationships
Ask yourself: what are you interested in? Then, identify what is your dependent (i.e., what are you explaining) and independent variables (i.e., what is the explanatory factor).
Do not underestimate the exploratory analysis. Explore your data! Get as much information as possible, find patterns and explore the relations between various variables of your interest
Think about causality. Are you sure that your independent variable causes change in dependent variable?
Check List
I know how to insert a chunk of code and run it
I know how directories work and how to load data in R
I am familiar with concept of independent and dependent variables
I am starting to understand how to ask statistical research questions, and the word causality does not scare me