Data Visualization

Week 4

Published

October 10, 2025

Before we start

  • Any questions?
Download script

Download data


Review

First, load the tidyverse library

library(tidyverse)

Load the World Values Survey data from the previous week’s class.

wvs = ...

Rename the variables:

  • B_COUNTRY_ALPHA to country

  • Q191 to violence

  • Q194 to political_violence

...

Calculate the average and standard deviation of political_violence by country using group_by() and summarize(). Save that to a new pviolence_mean object.

pviolence_mean = wvs %>%
  ... %>%
  summarize(mean = ...,
            sd = ...)

Present first observations in the pviolence_mean using head().

head(pviolence_mean)

Leave only North American countries.

north_america = c("CAN", "USA", "MEX")

pviolence_mean_filtered = ...

Load the tinytable library and present a table describing North American countries using tt(). Round the numbers up to two digits. Change the background color of Mexico’s (MEX) standard-deviation cell to yellow. And if you have underscores or some other special symbols (for example, the variable has _ underscores), don’t forget to use format_tt(escape = TRUE).

library(tinytable)

pviolence_mean_filtered %>%
  tt() %>%
  format_tt(...) %>%
  ...(i = 2,
      ...,
      background = "yellow")

And a small preview of what’s going to come!

ggplot(wvs) +
  geom_point(aes(x = violence,
                 y = political_violence))

Now, clear the environment using the broom icon!

Agenda

  • Summarizing Data

  • Presenting Data with Graphs

  • Exploring Relationships Visually

Exploring Data

Today we are working with WhoGov dataset, which provides information on Members of Cabinets. As usual, I recommend taking a look at their codebook.

whogov = read.csv("data/WhoGov.csv")

First of all, these are the following variables we are going to work with today:

  • country_name is a country name

  • n_individuals number of unique persons in the cabinet

  • leaderexperience_continuous the number of years the person has been leader of the country in total.

  • system_category the regime type

  • average_total the average tenure for people in cabinet

Say, you are interested in the relationship between the leader’s duration in the office (leaderexperience_continuous) and the tenure for people in cabinet (average_total).

One Continuous Variable

Start with exploring the distribution of average tenure of people in the cabinet (average_total)

ggplot(whogov) +
  geom_histogram(aes(x = average_total)) 
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

You can inform your reader by providing extra information. Add the average to the plot using geom_vline() and change the number of bins to 20. Save the graph to the avgind_plot object.

avgind_plot = ggplot(whogov) +
  geom_histogram(aes(x = average_total), bins = 20) +
  geom_vline(aes(xintercept = mean(average_total, na.rm = TRUE)), color = "#EE4B2B") 

avgind_plot

Let’s name the axes and customize the graph. Feel free to explore ggplot themes here.

avgind_plot = avgind_plot +
  labs(x = "Average Tenure ",
       y = "Count",
       title = "Distribution of Average Tenure",
       subtitle = "In Cabinets") +
  theme_bw()

avgind_plot

Change the breaks to a more readable format.

avgind_plot +
  scale_x_continuous(breaks = seq(0,                         # start sequence from
                                  max(whogov$average_total,
                                      na.rm = TRUE),         # end sequence with
                                  3))                        # increment
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

Two Continuous Variables

Now, continue exploring the relationship we were interested in. Let’s plot two distributions with densities. Make sure to go through the syntax. And be careful, now the X axis named incorrectly!

  • Compare the distributions

  • What alternative plot would be more suitable for comparing distributions?

Add xlim(-5, 20)

ggplot(whogov) +
  geom_density(aes(x = average_total), fill = "red", alpha = 0.5) +
  geom_density(aes(x = leaderexperience_continuous), fill = "blue", alpha = 0.5) 

Let’s draw a scatter plot using geom_point(). It’s more appropriate if we are interested in a relationship, rather than distribution comparison. What’s going on?

ggplot(whogov) +
  geom_point(aes(x = leaderexperience_continuous,
                 y = average_total))
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_point()`).

Let’s zoom in to explore two political regimes. Use system_category to leave only Military dictatorship and Parliamentary democracy.

two_regimes = whogov %>% 
  filter(system_category %in% c("Military dictatorship",
                                                "Parliamentary democracy"))

Then, color the points.

regime_tenure = ggplot(two_regimes) +
                geom_point(aes(x = leaderexperience_continuous,
                               y = average_total,
                               color = system_category))

regime_tenure
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

The patterns are seen clearer now! But let’s also add panels of graphs with facet_grid(). Add scales = "free" argument to the facet grid function.

regime_tenure = regime_tenure +
  facet_grid(system_category ~ .)

regime_tenure

Add title, name the axes and add a theme_minimal().

regime_tenure +
  labs(y = "Average Cabinet Tenure",
       x = "Leader's Tenure",
       title = "Leader's Tenure and Cabinet Tenure",
       color = "Political Regime") +
  theme_minimal()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Finally, let’s see the correlation. Is it a big one?

cor(two_regimes$n_individuals, two_regimes$leaderexperience_continuous)
[1] 0.2308111

For your reference, take a look on the cheatsheet.

Check list

Extra

Plotting a categorical variable

You can use histograms to count number of observations in each category, and then plot this count. Let’s try this out with political regimes (system_category). First, count the number of observations in each group. This is a dataframe.

regime_count = whogov %>%
  group_by(system_category) %>%
  count()

regime_count
# A tibble: 9 × 2
# Groups:   system_category [9]
  system_category               n
  <chr>                     <int>
1 Civilian dictatorship      2630
2 Crown Colony                  2
3 French Overseas Territory    10
4 Military dictatorship      1680
5 Mixed democratic           1125
6 Parliamentary democracy    1647
7 Part of Yugoslavia            9
8 Presidential democracy     1406
9 Royal dictatorship          644

Now, plot it using geom_col().

regime_count %>%
  ggplot(aes(x = system_category,
             y = n)) +
  geom_col() 

Change the order using reorder(), so the the count is presented consecutively.

regime_count %>%
  ggplot(aes(x = reorder(system_category, n),
             y = n)) +
  geom_col() 

Now, customize the graph. Change the color of the bars with fill = argument in geom_col(), flip the axes making it easier to read names of the regime with coord_flip(). Add names of the axes with labs(), and change the theme to theme_dark.

regime_count %>%
  ggplot(aes(x = reorder(system_category, n),
             y = n)) +
  geom_col(fill = "lightblue") +
  coord_flip() +
  labs(x = "Political Rregime",
       y = "Count") +
  theme_dark()

Plotting ovelapping plots

Imagine you are interested in comparing two distributions: average_total (the average tenure of the cabinet), and the leaderexperience_continuous (tenure of the leader). We have plotted it with in the beginning.

ggplot(whogov) +
  geom_density(aes(x = average_total), fill = "blue", alpha = 0.5) +
  geom_density(aes(x = leaderexperience_continuous), fill = "red", alpha = 0.5) +
  xlim(-5,20)

If you want to compare two groups making one group appear on top of the other, you have an option of doing so using ggridges. But you also need to restructure the data to a long format.

library(ggridges)

whogov_long = whogov %>%
  select(average_total,
         leaderexperience_continuous) %>% # choose the variables
  pivot_longer(1:2)                       # make the data of a longer format

head(whogov_long)
# A tibble: 6 × 2
  name                        value
  <chr>                       <dbl>
1 average_total                1   
2 leaderexperience_continuous  1   
3 average_total                1.88
4 leaderexperience_continuous  2   
5 average_total                1.82
6 leaderexperience_continuous  1   

Finally, plot the restrucutred data.

ggplot(whogov_long) +
  geom_density_ridges2(aes(x = value,
                           y = name)) +
  xlim(-5, 20)
Picking joint bandwidth of 0.48
Warning: Removed 609 rows containing non-finite outside the scale range
(`stat_density_ridges()`).

Optional Exercises

Explore n_individuals variable. Draw a histogram,

...

Explore a new function. Use facet_wrap() with system_cateogry. Compare it with facet_grid(). What’s the difference?

Solution
...

Make the scale = "free" in face_wrap().

Solution
...

Now, explore relationship between leaderexperience_continuous (X) and n_individuals (Y). Draw a geom_point()

Solution
...

Add a facet_wrap() layer to the previous graph, by system_category.

Solution
...

Customize the graph. Name the axes, make the scales free, add color to the observations dots on the graph based on the system_category.

Solution
...