= c(18, 20, 21, 19, 24, 21, 20, 22) age
Data Structures and Classes
Week 3
Agenda
Delving into data structures
Discussing data classes
Data Structures
Vectors
Let’s say we asked a group of people their ages in an Evanston coffee shop. We obtained the following data and assigned it to an object called age
. The collection of obects called vector.
You can access the data using indexing. Let’s say you entered the data into your object age
as you were receiving the results. You can access the age of your second respondent by indicating [i]
to an object.
2] age[
[1] 20
we can calculate the average age in our surveyed group.
mean(age)
[1] 20.625
Alternatively, we can describe our data with minimum and maximum values.
min(age)
max(age)
[1] 18
[1] 24
We also asked people in the coffee shop about their majors and received the following data:
= c("computer science", "sociology", "sociology", "political science", "political science", "political science", "computer science", "sociology") major
Dataframes
A dataframe is one of the most commonly used data structures in data analysis. It is a simple table, similar to those you have probably seen in Excel. Let’s create one. We have two vectors, age
and major.
We can combine them into one table.
= data.frame(age, major)
respondents respondents
age major
1 18 computer science
2 20 sociology
3 21 sociology
4 19 political science
5 24 political science
6 21 political science
7 20 computer science
8 22 sociology
Columns are vectors. In a table format they are referred to as variables (and thus these labels are used interchangeably). Rows are called observations. There are some useful commands that provide information about your dataframe.
- Names of your variables
names(respondents)
[1] "age" "major"
- Number of rows in your dataframe
nrow(respondents)
[1] 8
- Number of columns in your dataframe
ncol(respondents)
[1] 2
- Number of dimensions (number of rows and columns)
dim(respondents)
[1] 8 2
To access a variable as vector you can use $
sign.
$age respondents
[1] 18 20 21 19 24 21 20 22
This would allow you to manipulate this variable. For example, let’s visualize this data!
hist(respondents$age)
You can easily combine previously used functions. For example, indexing provides access to any observation.
$major[8] respondents
[1] "sociology"
Data classes
As you have noticed, we deal with different classes of data. Sometimes these are words (e.g., names of cars or majors) and numbers (e.g., age or horsepower). The analysis we perform is highly dependent on data classes. But before discussing it in a detail, we need to install one library that would help us to grasp this difference.
library(DiagrammeR)
mermaid("
graph LR
D[Data] --> C[Categorical]
D --> N[Numerical]
C --> no[Nominal]
C --> Or[Ordinal]
N --> di[Discrete]
N --> co[Continuous]
no --> c[Character]
Or --> f[Factor]
di --> i[Integer]
co --> n[Numeric]
")
These are the basic classes of data in R. Some examples might include:
Nominal: Names, Labels, Brands, Country names, etc.
Ordinal: Educational Levels (High School-BA-MA-PhD), Customer Rating (Unsatisfied-Neutral-Satisfied), etc.
Discrete: Number of customers per day, number of seats won by political parties, etc.
Continuous: Height of people, voter turnout, etc.
For each object, vector, or variable, you can check its class. For example,
class(cars_information$mpg)
class(respondents$major)
[1] "numeric"
[1] "character"
Alternatively, you can check if this variable is of specific class
is.integer(cars_information$mpg)
is.character(cars_information)
[1] FALSE
[1] FALSE
Do you think R classified it properly? If a variable is identified incorrectly, you can change it.
For example, you can change it to a factor.
$cyl = as.factor(cars_information$cyl) cars_information
Importantly, the incorrect data class can hinder calculation of our models and visualization of our plots. So be careful! We will discuss it in the future sections.
Check List
I know the base data structures in R: vectors and dataframes do not confuse me
I have a basic understanding of how to differentiate between nominal and ordinal variables, as well as between discrete and continuous variables
I can easily change the class of the variable using as.factor()
, as.numeric()
, and so on.