age = c(18, 20, 21, 19, 24, 21, 20, 22)Data Structures and Classes
Week 3
Agenda
- Delving into data structures 
- Discussing data classes 
Data Structures
Vectors
Let’s say we asked a group of people their ages in an Evanston coffee shop. We obtained the following data and assigned it to an object called age. The collection of obects called vector.
You can access the data using indexing. Let’s say you entered the data into your object age as you were receiving the results. You can access the age of your second respondent by indicating [i] to an object.
age[2][1] 20we can calculate the average age in our surveyed group.
mean(age)[1] 20.625Alternatively, we can describe our data with minimum and maximum values.
min(age)
max(age)[1] 18
[1] 24We also asked people in the coffee shop about their majors and received the following data:
major = c("computer science", "sociology", "sociology", "political science", "political science", "political science", "computer science", "sociology")Dataframes
A dataframe is one of the most commonly used data structures in data analysis. It is a simple table, similar to those you have probably seen in Excel. Let’s create one. We have two vectors, age and major. We can combine them into one table.
respondents = data.frame(age, major)
respondents  age             major
1  18  computer science
2  20         sociology
3  21         sociology
4  19 political science
5  24 political science
6  21 political science
7  20  computer science
8  22         sociologyColumns are vectors. In a table format they are referred to as variables (and thus these labels are used interchangeably). Rows are called observations. There are some useful commands that provide information about your dataframe.
- Names of your variables
names(respondents)[1] "age"   "major"- Number of rows in your dataframe
nrow(respondents)[1] 8- Number of columns in your dataframe
ncol(respondents)[1] 2- Number of dimensions (number of rows and columns)
dim(respondents)[1] 8 2To access a variable as vector you can use $ sign.
respondents$age[1] 18 20 21 19 24 21 20 22This would allow you to manipulate this variable. For example, let’s visualize this data!
hist(respondents$age)You can easily combine previously used functions. For example, indexing provides access to any observation.
respondents$major[8][1] "sociology"Data classes
As you have noticed, we deal with different classes of data. Sometimes these are words (e.g., names of cars or majors) and numbers (e.g., age or horsepower). The analysis we perform is highly dependent on data classes. But before discussing it in a detail, we need to install one library that would help us to grasp this difference.
library(DiagrammeR)
mermaid("
graph LR
    D[Data] --> C[Categorical]
    D --> N[Numerical]
    C --> no[Nominal]
    C --> Or[Ordinal]
    N --> di[Discrete]
    N --> co[Continuous]
    no --> c[Character]
    Or --> f[Factor]
    di --> i[Integer]
    co --> n[Numeric]
")These are the basic classes of data in R. Some examples might include:
- Nominal: Names, Labels, Brands, Country names, etc. 
- Ordinal: Educational Levels (High School-BA-MA-PhD), Customer Rating (Unsatisfied-Neutral-Satisfied), etc. 
- Discrete: Number of customers per day, number of seats won by political parties, etc. 
- Continuous: Height of people, voter turnout, etc. 
For each object, vector, or variable, you can check its class. For example,
class(cars_information$mpg)
class(respondents$major)[1] "numeric"
[1] "character"Alternatively, you can check if this variable is of specific class
is.integer(cars_information$mpg)
is.character(cars_information)[1] FALSE
[1] FALSEDo you think R classified it properly? If a variable is identified incorrectly, you can change it.
For example, you can change it to a factor.
cars_information$cyl = as.factor(cars_information$cyl)Importantly, the incorrect data class can hinder calculation of our models and visualization of our plots. So be careful! We will discuss it in the future sections.
Check List
I know the base data structures in R: vectors and dataframes do not confuse me
I have a basic understanding of how to differentiate between nominal and ordinal variables, as well as between discrete and continuous variables
 I can easily change the class of the variable using as.factor(), as.numeric(), and so on.