--- title: "GLMs and Quarter Review" subtitle: "Week 9" date: 2025-03-06 format: html: embed-resources: true toc: true --- # Before we start
Download script

Download data
# Optional Review of the previous week ::: callout-note ## Review Load `tidyverse` library ```{r} #| message: false library(tidyverse) ``` Load the European Social Value data (`ess.csv`). ```{r} ``` Rename the variables: - `pplfair` to `trust` - `trstplt` to `trust_politicians` - `rlgdgr` to `religion` ```{r} ``` Run the following model. DV: `happy`, IVs: `trust`, `trust_politicians` and `religion`. Present the `summary()`. ```{r} ``` Using `bptest()` from the `lmtest` library, test if the model is homoscedastic. Interpret the results of Breusch-Pagan test. ```{r} ``` Draw residuals vs fitted plot. Does it match the BP test results? You can use `plot(..., which = 1)`. ```{r} ``` Calculate the cutoff for the Cook's distance using the formula: $\frac{4}{N-k-1}$. Plot the Cook's distance using `plot()` and add a cutoff line using `abline()`. Are there any influential points? ```{r} ``` Set up a model without influential points. How different the estimates are? Present a simple `modelsummary()` comparison. ```{r} #| eval: false model_no = lm(happy ~ trust + trust_politicians + religion, ess[-c(...),]) library(...) modelsummary(list(...), stars = T) ``` Draw a Q-Q plot using `plot(..., which = 2)` for the full model. Are the residuals distributed normally? ```{r} ``` Using `shapiro.test()` test the normality of the `residuals` in the model. Interpret the results. ```{r} ``` What else we have to check? How would you do that? No code needed but feel free to rely on the previous lab for this question. ::: # Agenda - Recap of what we did - And a short introduction to GLM alongside # Recap Today we are working with Transitional Justice Evaluation Tools dataset. You can explore their codebook [here](https://transitionaljusticedata.org/en/downloads.html). Let's load the data. ```{r} #| eval: false tjet = read.csv("data/tjet.csv") ``` Print out the first observations ```{r} #| eval: false ...(tjet) ``` We are going to use the following variables: - `country` and `year` - `dtr` democratic transition - `tcs_operated` truth commission operated - `trials_domestic` count of domestic trials per country and start year - `region` United Nations region Subset the above variables. You can do this with `select()` ```{r} #| eval: false tjet_sub = ... %>% ...(country, year, dtr, tcs_operated, trials_domestic, region) ``` Leave more recent observations. Say, starting year 1975. ```{r} #| eval: false tjet_sub = tjet_sub %>% ...(year >= 1975) ``` First, let's explore the distribution of `trials_domestic` variable. Draw a histogram. ```{r} #| eval: false ... ``` Now, draw a boxplot of `trials_domestic` by `region`. ```{r} #| eval: false ggplot(tjet_sub) + ...(aes(x = region, y = ...)) ``` Now, present the descriptive statistics for the same variable (`trials_domestic`) ```{r} #| eval: false ... ``` Now, let's do a slightly more detailed subgroup analysis. Calculate the average and median `trials_domestic` for all the regions. Save that to a new object. Why is the average so different from the median? ```{r} #| eval: false trials_region = tjet_sub %>% ...(region) %>% ...(average_trials = ...(trials_domestic)) trials_region ``` Now, let's the number of transitions (`dtr`). Why are there so many transitions? ```{r} #| eval: false transitions_region = ... transitions_region ``` Let's visualize various statistics! Combine `trials_region` and `transitions_region` using `left_join()`. Draw a scatteplot. Let `transitions` be on the X axis, and `average_trials` on the Y axis. Add `geom_text()` with `label` argument set to `region`, use `vjust = 1.5`. ```{r} #| eval: false trials_region %>% ... ggplot(...(x = ..., y = average_trials)) + geom_point(size = 4) + geom_text(aes(label = region), vjust = 1.5) ``` Using `ggpairs()` visualize the following columns: `"trials_domestic", "region", "tcs_operated", "dtr"`. What do you notice? ```{r} #| eval: false library(...) ... ``` Set up a multiple linear regerssion. Predict democratic transition `dtr` by `trials_domestic` and `tcs_operated`. What type of model did we create based on the independent variables (IVs)? Present the results. ```{r} #| eval: false model = ... summary(model) ``` Load `ggeffects` library ```{r} #| warning: false library(ggeffects) ``` Visualize the model with terms `trials_domestic` and `tcs_operated`. ```{r} #| eval: false ggpredict(model, terms = c("trials_domestic", "tcs_operated")) %>% plot() ``` Check the normality of residuals. ```{r} #| eval: false ...(model, which = 2) ``` Set the linear probability model, but include country and year fixed effects. ```{r} #| eval: false model_lpm = lm(dtr ~ trials_domestic + tcs_operated + ..., tjet_sub) ``` Check the normality, again. Use Q-Q plot. ```{r} #| eval: false plot(model_lpm, 2) ``` Now, let's check if the residuals are homoscedastic. Use `which = 1` for `plot()`. What's that? Can we fix that? Or even, should we? ```{r} #| eval: false plot(model_lpm, 1) ``` # Logistic Regression As you can see, one of the assumptions is violated. You might try to solve it using SEs - but again, these are not panacea. A slightly more common approach to deal with binary dependent variable is logistic regression. Let's briefly try it out! ```{r} #| eval: false model_glm = glm(dtr ~ trials_domestic + tcs_operated, tjet_sub, family = binomial(link = "logit")) summary(model_glm) ``` Now, visualize how it looks like. Usually, we would expect to see slightly more "S" shaped curve. But still - good! ```{r} #| eval: false ggpredict(model_glm, term = c("trials_domestic", "tcs_operated")) %>% plot(show_data = T) ``` Compare to linear probability model **without** country and year fixed effects. ```{r} #| eval: false ggpredict(model, term = c("trials_domestic", "tcs_operated")) %>% plot(show_data = T) ``` Present both: linear probability model without fixed effects and logistic regression. The problem, though, that Logit is way harder to interpret! Stick to marginal effects, which we have covered in [lab 5](https://artur-baranov.github.io/nu-ps405-ds/ps405-d_5.html)! ```{r} #| eval: false library(modelsummary) modelsummary(list("LPM" = model, "Logit" = model_glm), stars = T) ``` A final note, in most cases you would expect to see logistic regression as something like ```{r} #| warning: false glm(am ~ hp + wt, data = mtcars, family = binomial) %>% ggpredict(terms = c("hp")) %>% plot(show_data = T) ``` # What's next? Congrats with finalizing the class! No exercises this time. - [Helpful resource for R](https://rstudio.github.io/cheatsheets/) - Check the [main page](https://artur-baranov.github.io/nu-ps405-ds/) for additional resources - Learning to code is like learning a new language: if you're interested, keep practicing! - I highly recommend taking POLI_SCI 406-0 Quantitative Causal Inference # What we have covered in coding | Library | Functions | Description | |--------------|------------------------------------|------------------------------------------------------| | tidyverse | `filter()`, `mutate()`, `ggplot()` | data wrangling and visualization | | modelsummary | `modelsummary()` | present good looking tables | | broom | `tidy()` | extract additional information from the models | | ggeffects | `ggpredict()` | calculate and visualize marginal effects | | GGally | `ggpairs()` | extension to ggplot | | lmtest | `bptest()` | additional statistical tests for diagnostics | | estimatr | `lm_robust()` | regression with robust and clustered standard errors | | car | `vif()` | additional statistical tests for diagnostics | # Datasets we have used | Dataset | Description | Link | |---------------------------------------------------|--------------------------------------------|----------------------------------------------------------| | **V-Dem** | Measures democracy worldwide | [V-Dem](https://www.v-dem.net/) | | **World Happiness Report** | Annual happiness report | [World Happiness Report](https://worldhappiness.report/) | | **Who Governs** | Dataset on political elites | [Who Governs](https://doi.org/10.1017/S0003055420000490) | | **SIPRI** | Data on military operations | [SIPRI](https://www.sipri.org/databases) | | **Comparative Political Data Set** | A dataset covering political institutions | [CPDS](https://www.cpds-data.org/) | | **European Social Survey** | Survey measuring attitudes across Europe | [ESS](https://www.europeansocialsurvey.org/) | | **Transitional Justice Evaluation Tools Dataset** | Dataset assessing transitional justice | [TJET Dataset](https://transitionaljusticedata.org) | # Check list I know that every lab has check list below! And I will use it to navigate what we have learned R doesn't scare me anymore I have developed the intuition behind application of quantitative methods