Data Structures and Classes

Week 3

Published

April 17, 2025

Agenda

  • Delving into data structures

  • Discussing data classes

Data Structures

Vectors

Let’s say we asked a group of people their ages in an Evanston coffee shop. We obtained the following data and assigned it to an object called age. The collection of obects called vector.

age = c(18, 20, 21, 19, 24, 21, 20, 22)

You can access the data using indexing. Let’s say you entered the data into your object age as you were receiving the results. You can access the age of your second respondent by indicating [i] to an object.

age[2]
[1] 20

we can calculate the average age in our surveyed group.

mean(age)
[1] 20.625

Alternatively, we can describe our data with minimum and maximum values.

min(age)
max(age)
[1] 18
[1] 24

We also asked people in the coffee shop about their majors and received the following data:

major = c("computer science", "sociology", "sociology", "political science", "political science", "political science", "computer science", "sociology")
Coding Task

Insert a chunk and access major of the 3rd respondent.

Dataframes

A dataframe is one of the most commonly used data structures in data analysis. It is a simple table, similar to those you have probably seen in Excel. Let’s create one. We have two vectors, age and major. We can combine them into one table.

respondents = data.frame(age, major)
respondents
  age             major
1  18  computer science
2  20         sociology
3  21         sociology
4  19 political science
5  24 political science
6  21 political science
7  20  computer science
8  22         sociology

Columns are vectors. In a table format they are referred to as variables (and thus these labels are used interchangeably). Rows are called observations. There are some useful commands that provide information about your dataframe.

  • Names of your variables
names(respondents)
[1] "age"   "major"
  • Number of rows in your dataframe
nrow(respondents)
[1] 8
  • Number of columns in your dataframe
ncol(respondents)
[1] 2
  • Number of dimensions (number of rows and columns)
dim(respondents)
[1] 8 2

To access a variable as vector you can use $ sign.

respondents$age
[1] 18 20 21 19 24 21 20 22

This would allow you to manipulate this variable. For example, let’s visualize this data!

hist(respondents$age)

You can easily combine previously used functions. For example, indexing provides access to any observation.

respondents$major[8]
[1] "sociology"
Coding Task

Load the data below and explore it. This is the embedded to R dataset, so you don’t need to load it externally.

cars_information = mtcars

How many cars are listed in the dataframe (how many rows are there)?

Describe mpg variable. Include average, minimum and maximum.

What is the minimum horsepower (hp)?

Are there any cars that have horse power greater than 200 but less than 250?

Create a new variable called power_to_weight, which calculates the ratio of horsepower (hp) to weight (wt). You need to divide horsepower by weight. Use the snippet below.

cars_information$power_to_weight = ... / ...

What is an average Power to Weight Ratio?

What is a minimum Power to Weight Ratio?

Create a histogram of power_to_weight variable. Hint: you can use previous discussion section scripts to get the code.

Data classes

As you have noticed, we deal with different classes of data. Sometimes these are words (e.g., names of cars or majors) and numbers (e.g., age or horsepower). The analysis we perform is highly dependent on data classes. But before discussing it in a detail, we need to install one library that would help us to grasp this difference.

library(DiagrammeR)
mermaid("
graph LR
    D[Data] --> C[Categorical]
    D --> N[Numerical]
    C --> no[Nominal]
    C --> Or[Ordinal]
    N --> di[Discrete]
    N --> co[Continuous]
    no --> c[Character]
    Or --> f[Factor]
    di --> i[Integer]
    co --> n[Numeric]
")

These are the basic classes of data in R. Some examples might include:

  • Nominal: Names, Labels, Brands, Country names, etc.

  • Ordinal: Educational Levels (High School-BA-MA-PhD), Customer Rating (Unsatisfied-Neutral-Satisfied), etc.

  • Discrete: Number of customers per day, number of seats won by political parties, etc.

  • Continuous: Height of people, voter turnout, etc.

For each object, vector, or variable, you can check its class. For example,

class(cars_information$mpg)
class(respondents$major)
[1] "numeric"
[1] "character"

Alternatively, you can check if this variable is of specific class

is.integer(cars_information$mpg)
is.character(cars_information)
[1] FALSE
[1] FALSE
Coding Task

Take a look on the cyl variable in cars_information dataset. What is its class?

...

Do you think R classified it properly? If a variable is identified incorrectly, you can change it.

For example, you can change it to a factor.

cars_information$cyl = as.factor(cars_information$cyl)

Importantly, the incorrect data class can hinder calculation of our models and visualization of our plots. So be careful! We will discuss it in the future sections.

Check List

I know the base data structures in R: vectors and dataframes do not confuse me

I have a basic understanding of how to differentiate between nominal and ordinal variables, as well as between discrete and continuous variables

I can easily change the class of the variable using as.factor(), as.numeric(), and so on.