2 Introduction to R

https://learn.datacamp.com/courses/free-introduction-to-r

(Note: If you do Introduction to R for Finance instead, change the title of this bookdown chapter above and change the DataCamp chapter titles below. Whichever version of Intro to R you do, delete this note)

2.1 Intro to basics

How it Works:

We can use ‘#’ to add comments on R which are just texts or reminders which won’t actually run when we run the entire program.

Example:

#This code chunk will run but not return anything since this is just a text/comment.

Arithmetic with R:-

#Addition (will return 9 when we run this chunk)
4 + 5
## [1] 9
#Substraction (will return 5 when we run this chunk)
9 - 4 
## [1] 5
#Multiplication(will return 25 when we run this chunk)
5 * 5 
## [1] 25
#Division(will return 5 when we run this chunk)
10 / 2 
## [1] 5
#Division(will return 8 when we run this chunk)
2^3 
## [1] 8
#modulo(will return the remainder 1 when 5 is divided by 2.)
5 %% 2 
## [1] 1

Variable Assignment:-

We can store a value in a variable using var_name <- 44. We can then do various arithmetical operations with the variables instead rather than the value of the variables. This is most helpful when the value of the variable is a very large string or number.

Example:

var_name <- "43567565343 is a huge number and is annoying to type everytime"
var_name
## [1] "43567565343 is a huge number and is annoying to type everytime"
num_car <- 67 
num_doughnut <- 34

Potato <- num_car - num_doughnut 
Potato
## [1] 33

Basic Data Types in R:

Decimal values like 4.5 are called numerics. Whole numbers like 4 are called integers. Integers are also numerics. Boolean values (TRUE or FALSE) are called logical. Text (or string) values are called characters.

We can check what kind of data we are working with by using the class() function. This is important because we can’t do create functions with different data types.

a<- 4.2
b<- "The Elephant"
c<- TRUE
d<-88

class(a)
## [1] "numeric"
class(b)
## [1] "character"
class(c)
## [1] "logical"
class(d)
## [1] "numeric"

2.2 Vectors

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data.

Example:

music_vector <- c("Country", "Blues", "Indie" )
music_vector
## [1] "Country" "Blues"   "Indie"

We should be careful before naming vectors and think through it to come up with a name which give us an idea of what the vector consists of. We can assign vectors to other vectors to make life easier.

# Total Expenditure for the week 1
week1_exp <- c(140, 30, 20, 220, 40)

# Total Expenditure for the week 1
week2_exp <- c(247, 750, 108, 308, 810)

# The variable days_vector
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
 
# Assign the names of the day to week1_exp and week2_exp
names(week1_exp) <-   days_vector
names(week2_exp) <- days_vector

week1_exp
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       140        30        20       220        40
week2_exp
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       247       750       108       308       810

We can then add the two vectors to find the total expenditure for 2 weeks.

total_dailyexp = week1_exp + week2_exp
total_dailyexp
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       387       780       128       528       850

We can use the sum() function to calculate the sum of all elements of a vector.

total_week1 = sum(week1_exp)
total_week2 = sum(week2_exp)

total_exp = total_week1 + total_week2
total_exp
## [1] 2673

We can compare the two variables using > or < which will return either TRUE or FALSE.

Example:

total_week1<total_week2
## [1] TRUE

We can use the [] to select individual elements from a vector.

Example:

total_thur <- week1_exp[4] + week2_exp[4]
total_thur
## Thursday 
##      528

We can select multiple elements from a vector and make a new vector.

Example:

week1_midweek <- week1_exp[c(3:5)]
week1_midweek
## Wednesday  Thursday    Friday 
##        20       220        40

We can check whether each element of a vector fulfils a certain condition.

For example:

week1exp_100plus <- week1_exp>=100
week1exp_100plus
##    Monday   Tuesday Wednesday  Thursday    Friday 
##      TRUE     FALSE     FALSE      TRUE     FALSE

We can get the the desired elementsfrom a vector by placing the desired vector between the square brackets that follow the main vector.

Example:

week1_exp[week1exp_100plus]
##   Monday Thursday 
##      140      220

2.3 Matrices

Matrix format : matrix(1:9, byrow = TRUE, nrow = 3)

  1. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9)

  2. The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE.

  3. The third argument nrow indicates that the matrix should have three rows.

Example:

matrix(1:25, byrow = TRUE, nrow = 5)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
## [5,]   21   22   23   24   25

We can follow the following code to make a matrix with row names and column names :

jan_exp <- c(700, 400)
feb_exp <- c(800, 500)
mar_exp <- c(400, 600)

monthly_exp_matrix <- matrix(c(jan_exp, feb_exp, mar_exp), nrow = 3, byrow = TRUE)

time_interval <- c("1st Half", "2nd Half")
months <- c("January", "February", "March")

colnames(monthly_exp_matrix) <- time_interval

rownames(monthly_exp_matrix) <- months

monthly_exp_matrix
##          1st Half 2nd Half
## January       700      400
## February      800      500
## March         400      600

We can use the rowSums() and colSums() function to calculate the row wise/column wise sums in a matrix.

rowSums(monthly_exp_matrix)
##  January February    March 
##     1100     1300     1000
colSums(monthly_exp_matrix)
## 1st Half 2nd Half 
##     1900     1500

We can use cbind() and rbind() to add new rows and columns to a matrix.

April<-c(550, 300)
new_matrix <- rbind(monthly_exp_matrix, April)

Extra <- c(31,45,32,12)
updated_matrix <- cbind(new_matrix, Extra)
updated_matrix
##          1st Half 2nd Half Extra
## January       700      400    31
## February      800      500    45
## March         400      600    32
## April         550      300    12

Similar to vectors, we can use the [] to select one or more elements from a matrix. The format for that is my_matrix [row, column]

extra_march <- updated_matrix[3,3]
extra_march
## [1] 32

Standard arithmetic operators like +, -, /, *, etc. work in an element-wise way on matrices in R. We can even use the operators on two seperate matrices.

Example:

mean_spend_perday <- updated_matrix/30
mean_spend_perday_withtax <- updated_matrix * 1.10
mean_tax <- 0.10 * mean_spend_perday
mean_exp_withtax_perday <- mean_tax + mean_spend_perday

mean_spend_perday
##          1st Half 2nd Half    Extra
## January  23.33333 13.33333 1.033333
## February 26.66667 16.66667 1.500000
## March    13.33333 20.00000 1.066667
## April    18.33333 10.00000 0.400000
mean_spend_perday_withtax
##          1st Half 2nd Half Extra
## January       770      440  34.1
## February      880      550  49.5
## March         440      660  35.2
## April         605      330  13.2
mean_exp_withtax_perday
##          1st Half 2nd Half    Extra
## January  25.66667 14.66667 1.136667
## February 29.33333 18.33333 1.650000
## March    14.66667 22.00000 1.173333
## April    20.16667 11.00000 0.440000

2.4 Factors

There are two types of variables : Quantitative or Continuous and Categorical. Quantitative or Continuous variables are the variables which only consists of values which are numbers/countable. Ex: Number of students, Price, etc.

Categorical variables are the variables which are non countable. Ex: Cities, Countries, Sex, Type, Month, etc.

The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

We can factor vectors to it’s factor levels by the factor() function and check it’s factor levels.

Example :

monthly_yearly <- c("Monthly", "Yearly", "Daily")
factor_monthly_yearly <- factor(monthly_yearly)
factor_monthly_yearly
## [1] Monthly Yearly  Daily  
## Levels: Daily Monthly Yearly

There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. Ex : Animals, Planets, Cities

Ordinal variables do have a natural ordering. Ex: Tall, Short or Spicy, Spicier, Spiciest.

We can change the names of the factor levels with the function levels().

Example:

months <- c("j", "f", "j", "j", "f", "j")
factor_months <- factor(months)
levels(factor_months) <- c("January", "February")
factor_months
## [1] February January  February February January  February
## Levels: January February

We can use the summary() function to get a quick overview of the contents of a variable.

Example :

summary(factor_months)
##  January February 
##        2        4

In order to change a factor to an ordinal factor i.e the factors whose categories have a natural ordering, we use :

factor(some_vector, ordered = TRUE, levels = c(“lev1”, “lev2” …))

Example :

factor_month_ordinal <- factor(factor_months, ordered = TRUE, levels = c("January", "February"))
summary(factor_month_ordinal)
##  January February 
##        2        4

2.5 Data frames

A data frame has the variables of a data set as columns and the observations as rows.

The function head() enables you to show the first observations of a data frame.

The function tail() prints out the last observations in your data set.

The function str() shows you the structure of your data set.

To create a data frame:

day <- c("Sunday", "Monday", "Tuesday", 
          "Wednesday", "Thursday", "Friday", 
          "Saturday")
day_type <- c("Lazy", 
          "Hard Work", 
          "Hard Work", 
          "Hard Work", "Hard Work", 
          "Medium Work", "Lazy")
water_intake <- c(2.5, 5.9, 7.0, 6.5, 
              7.2, 4.1, 4.0)
excercise_time <- c(20, 100, 120, 90, 
             70, 60, 45)
morning_jog <- c(FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)

fitness_data <- data.frame(day,day_type, water_intake, excercise_time, morning_jog)

str(fitness_data)
## 'data.frame':    7 obs. of  5 variables:
##  $ day           : chr  "Sunday" "Monday" "Tuesday" "Wednesday" ...
##  $ day_type      : chr  "Lazy" "Hard Work" "Hard Work" "Hard Work" ...
##  $ water_intake  : num  2.5 5.9 7 6.5 7.2 4.1 4
##  $ excercise_time: num  20 100 120 90 70 60 45
##  $ morning_jog   : logi  FALSE TRUE TRUE FALSE TRUE FALSE ...

We can use [] to bring out values of certain rows and columns of the data set.

Example :

#Prints out value for Water Intake on Sunday
fitness_data[1,3]
## [1] 2.5
#Prints out value for all columns on Sunday
fitness_data[1,]
##      day day_type water_intake excercise_time morning_jog
## 1 Sunday     Lazy          2.5             20       FALSE
#Prints out value for Water Intake for each day of the week
fitness_data$water_intake
## [1] 2.5 5.9 7.0 6.5 7.2 4.1 4.0

We can also do this using the subset function.

subset(fitness_data, subset = (morning_jog == TRUE))
##        day  day_type water_intake excercise_time morning_jog
## 2   Monday Hard Work          5.9            100        TRUE
## 3  Tuesday Hard Work          7.0            120        TRUE
## 5 Thursday Hard Work          7.2             70        TRUE
## 7 Saturday      Lazy          4.0             45        TRUE

We can sort/order data according to a certain variable in the data set. In R, this is done with the help of the function order().

sorted_fitness <- order(fitness_data$excercise_time)
fitness_data[sorted_fitness, ]
##         day    day_type water_intake excercise_time morning_jog
## 1    Sunday        Lazy          2.5             20       FALSE
## 7  Saturday        Lazy          4.0             45        TRUE
## 6    Friday Medium Work          4.1             60       FALSE
## 5  Thursday   Hard Work          7.2             70        TRUE
## 4 Wednesday   Hard Work          6.5             90       FALSE
## 2    Monday   Hard Work          5.9            100        TRUE
## 3   Tuesday   Hard Work          7.0            120        TRUE

Summary till now:

Vectors (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.

Matrices (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.

Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type.

2.6 Lists

We can construct a list with the list() function with matrices, vectors, or other lists.

intro_list <- list(total_dailyexp, monthly_exp_matrix, fitness_data)
intro_list
## [[1]]
##    Monday   Tuesday Wednesday  Thursday    Friday 
##       387       780       128       528       850 
## 
## [[2]]
##          1st Half 2nd Half
## January       700      400
## February      800      500
## March         400      600
## 
## [[3]]
##         day    day_type water_intake excercise_time morning_jog
## 1    Sunday        Lazy          2.5             20       FALSE
## 2    Monday   Hard Work          5.9            100        TRUE
## 3   Tuesday   Hard Work          7.0            120        TRUE
## 4 Wednesday   Hard Work          6.5             90       FALSE
## 5  Thursday   Hard Work          7.2             70        TRUE
## 6    Friday Medium Work          4.1             60       FALSE
## 7  Saturday        Lazy          4.0             45        TRUE

We can change names of each list components using:

names(intro_list) <- list("daily_exp", "monthly_exp", "fitness")

#or another method

intro_list <- list(daily_exp = total_dailyexp, monthly_exp = monthly_exp_matrix, fitness = fitness_data)

We can select elements/components from a list usin the []. If we want to select the 2nd element from the 3rd component of the list :

intro_list[[3]][3]
##   water_intake
## 1          2.5
## 2          5.9
## 3          7.0
## 4          6.5
## 5          7.2
## 6          4.1
## 7          4.0