10 Cleaning Data in R

https://learn.datacamp.com/courses/cleaning-data-in-r

library(tidyverse)
## ── Attaching packages ───────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(gapminder)

10.1 Common Data Problems

Data Type Constraints:

Firstly, we need to make sure that each variable has the right type assigned to it. We can use the glimpse() function from dplyr library to see what data type is assigned to each variable.

glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134,…

Another way to check data types is to use the is.numeric function and the assert_is function from the assertive package. This throws an error and stops the script from running.

library(assertive)
## 
## Attaching package: 'assertive'
## The following objects are masked from 'package:purrr':
## 
##     is_atomic, is_character, is_double, is_empty, is_formula,
##     is_function, is_integer, is_list, is_logical, is_null, is_numeric,
##     is_vector
## The following object is masked from 'package:tibble':
## 
##     has_rownames
is.numeric(gapminder$pop)
## [1] TRUE
assert_is_numeric(gapminder$pop)

We can use the class function on the revenue column ro see the character type.

class(gapminder$continent)
## [1] "factor"

To remove something from the observations of a variable we use the str_remove(variable, “what we want to remove”) from the stringr package.

salary <- c("3,4444", "1,234", "4,567")

library(stringr)

salary_trimmed <- str_remove(salary, ",")
salary_trimmed
## [1] "34444" "1234"  "4567"

To convert a variable into a numeric type we use as.numeric function.

salary_trimmed_numeric <- as.numeric(salary_trimmed) 
salary_trimmed_character <- as.character(salary_trimmed) 
salary_trimmed_logical <- as.logical(salary_trimmed) 
salary_trimmed_factor <- as.factor(salary_trimmed) 

Range Constraints :

To see if there is any data that is out of range, we can use the assert_all_are_in_closed_range(dataset$variable_name, lower = lower value we want, upper = upper value we want). It will throw an error with a data frame of values out of range.

#assert_all_are_in_closed_range(salary_trimmed, lower = 1000, upper = 3000)

We can either remove the out of range values if there isn’t a lot, or set them as NA, or we can replace the out of range values with the range limit (if there is a rating as 6 we can replace it as 5 so that it is in range.)

To remove, we use

gapminder_1977to2000 <- gapminder %>% filter(year > 1977, year < 2000)

To replace the values as NA , we use the replace(col, condition, replacement) function :

b <- gapminder_1977to2000 %>% mutate(replace_NA = replace(year, year == 1982, NA))

We can use the assert_all_are_in_past(dataset$variable) function to check to see if there is any date in future.

Uniqueness constraints :

We can use the duplicated function to check for duplicated observations. duplicated(df)

sum(duplicated(gapminder))
## [1] 0
#to check which observations are duplicated 

filter(gapminder, duplicated(gapminder))
## # A tibble: 0 x 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## #   lifeExp <dbl>, pop <int>, gdpPercap <dbl>

We can drop full duplicates using the distinct(df) funciton.

gapminder <- distinct(gapminder)

Partial duplicates are duplicates which have a few different value for each observation.

To remove partial duplicates, we use the distinct function again but this time using specific column names and set .keep_all = TRUE, which will keep all columns and not just first and last name.

a <- gapminder %>% distinct(country, continent, .keep_all = TRUE)

10.2 Categorical and Text Data

Categorical Data: Categorical variables have a fixed and known set of possible values. They are stored as factors in R. Factors have something called levels which are all possible values that a factor can hold.

We can use the anti_join or semi_join functions to find invalid factors in our data. Ther anti_join function separates the data from our data frame which are not present in the data frame we are joining it with.

Catagorical Data problems:

Lizards and lizards should fall under the same category regardless of the capital initial letter. Pug, Lab, Boxer should fall under one category - Dog since they refer to dogs.

sfo_survey <- readRDS("sfo_survey_ch2_1.rds")

sfo_survey %>% count(dest_size)
##   dest_size    n
## 1   Small      1
## 2       Hub    1
## 3       Hub 1756
## 4     Large  143
## 5   Large      1
## 6    Medium  682
## 7     Small  225

To fix caps and spacing issues we use the following functions : str_trim and str_to_upper, str_to_lower.

# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
  # dest_size_trimmed: dest_size without whitespace
  mutate(dest_size_trimmed = str_trim(dest_size),
         # cleanliness_lower: cleanliness converted to lowercase
         cleanliness_lower = str_to_lower(cleanliness))

# Count values of dest_size_trimmed
sfo_survey %>%
  count(dest_size_trimmed)
##   dest_size_trimmed    n
## 1               Hub 1757
## 2             Large  144
## 3            Medium  682
## 4             Small  226
# Count values of cleanliness_lower
sfo_survey %>%
  count(cleanliness_lower)
##   cleanliness_lower    n
## 1           average  433
## 2             clean  970
## 3             dirty    2
## 4    somewhat clean 1254
## 5    somewhat dirty   30
## 6              <NA>  120

If we have multiple observations for the same category with insignificant number of observations compared to the rest of the dataframe, we can combine all of them into one row by calling it “other”. We can use the fct_collapse function to do this.

# Count categories of dest_region
sfo_survey %>%
  count(dest_region)
##             dest_region   n
## 1                  Asia 260
## 2 Australia/New Zealand  66
## 3         Canada/Mexico 220
## 4 Central/South America  29
## 5               East US 498
## 6                Europe 401
## 7           Middle East  79
## 8            Midwest US 281
## 9               West US 975
# Categories to map to Europe
europe_categories <- c("EU", "Europ", "eur")

# Add a new col dest_region_collapsed
sfo_survey %>%
  # Map all categories in europe_categories to Europe
  mutate(dest_region_collapsed = fct_collapse(dest_region, 
                                     Europe = europe_categories)) %>%
  # Count categories of dest_region_collapsed
 count(dest_region_collapsed)
## Warning: Problem with `mutate()` input `dest_region_collapsed`.
## ℹ Unknown levels in `f`: EU, Europ, eur
## ℹ Input `dest_region_collapsed` is `fct_collapse(dest_region, Europe = europe_categories)`.
## Warning: Unknown levels in `f`: EU, Europ, eur
##   dest_region_collapsed   n
## 1                  Asia 260
## 2 Australia/New Zealand  66
## 3         Canada/Mexico 220
## 4 Central/South America  29
## 5               East US 498
## 6                Europe 401
## 7           Middle East  79
## 8            Midwest US 281
## 9               West US 975

Cleaning text data :

Unstructured data problems include formatting inconsistency (“61718279912”, “(868) 992-4498”), (“123 123 123 123” vs “123123123123”) or information inconsistency (“+1 617 717 71777” vs 617-717-71777“), (”Sam Jhonson" vs “Josiah”), or invalid data (3 digit phone number or 2 digit zipcodes)

To take care of these issues we use the following functions :

Filter for rows with "-" in the phone column

sfo_survey %>% filter(str_detect(phone, “-”))

Filter for rows with "-" in the phone column

sfo_survey %>% filter(str_detect(phone, “-”))

Remove parentheses from phone column

phone_no_parens <- sfo_survey$phone %>%

Remove "("s

str_remove_all(fixed(“(”)) %>%

Remove ")"s

str_remove_all(fixed(“)”))

Remove parentheses from phone column

phone_no_parens <- sfo_survey$phone %>%

Remove "("s

str_remove_all(fixed(“(”)) %>%

Remove ")"s

str_remove_all(fixed(“)”))

Add phone_no_parens as column

sfo_survey %>% mutate(phone_no_parens = phone_no_parens)

 Remove parentheses from phone column

phone_no_parens <- sfo_survey$phone %>%

 Remove "("s

str_remove_all(fixed(“(”)) %>%

 Remove ")"s

str_remove_all(fixed(“)”))

Add phone_no_parens as column

sfo_survey %>% mutate(phone_no_parens = phone_no_parens,

Replace all hyphens in phone_no_parens with spaces
     phone_clean = str_replace_all(phone_no_parens, "-", " "))
     
Check out the invalid numbers

sfo_survey %>% filter(str_length(phone) != 12)

Remove rows with invalid numbers

sfo_survey %>% filter(str_length(phone) == 12)

10.3 Advanced Data Problems

Uniformity:

Uniformity issues are when continuous data points have different units or formats. Ex : Temperatiuse in C or F, Weight in kg,g, or lb, Money in USD, GBP, or Dates in different formats.

To apply certain changes to specific observations we can use the ifelse(condition, value_if_true, value_isf_false)

a <- gapminder %>% select(country, year , gdpPercap)
a <- a %>% filter(year > 1989)
a %>% mutate(generation = ifelse(year >= 1990 & year <2000, "GenZ", "Millenial"))
## # A tibble: 568 x 4
##    country      year gdpPercap generation
##    <fct>       <int>     <dbl> <chr>     
##  1 Afghanistan  1992      649. GenZ      
##  2 Afghanistan  1997      635. GenZ      
##  3 Afghanistan  2002      727. Millenial 
##  4 Afghanistan  2007      975. Millenial 
##  5 Albania      1992     2497. GenZ      
##  6 Albania      1997     3193. GenZ      
##  7 Albania      2002     4604. Millenial 
##  8 Albania      2007     5937. Millenial 
##  9 Algeria      1992     5023. GenZ      
## 10 Algeria      1997     4797. GenZ      
## # … with 558 more rows

Dates:

We can parse dates to make all dates in the same format. If we type ?strptime in the r console we will be able to see all the date formats that R offers. To convert all the dates to one specific format, we use the parse_date_time function from the lubridate package.

accounts <- readRDS("ch3_1_accounts (1).rds")

# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")

# Convert dates to the same format
#accounts %>%
  #mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))

Cross field validation:

Cross field validation is essentially a sanity check on our data. To check that one value makes sense based on other values in the dataset.

Cross field validation depends on datasets. For the accounts data set, we can add the total of the three fundsand assign it a new variable called theoretical_total and check to see if it matches the total variable.

Example code :

  1.  Find invalid totals

    accounts %>% theoretical_total: sum of the three funds mutate(theoretical_total = fund_A + fund_B + fund_C) %>%

    Find accounts where total doesn’t match theoretical_total filter(total != theoretical_total)

  2. Find invalid acct_age accounts %>% theoretical_age: age of acct based on date_opened mutate(theoretical_age = floor(as.numeric(date_opened %–% today(), “years”))) %>%

Filter for rows where acct_age is different from theoretical_age filter(acct_age != theoretical_age)

Completeness:

Missing data is the data when there is no data value stored for a variable in an observation. (NA, nan, 0, 99,..)

We can find missing values in a data set using the is.na(df) function.

sum(is.na(gapminder))
## [1] 0

We can use the vis_miss(df) function from the visdat package to visualize missing data.

library(visdat) [not supported for R 4.0.2] vis_miss(accounts)

We can filter missing values by the following code:

filter_values <- accounts %>%
  filter(!is.na(id), !is.na(total))

10.4 Record Linkage

Comparing strings:

To calculate edit distance in R, we can use the stringdist function from the stringdist package. stringdist(“comparing_element1”, “comparing_element2”, method = “method_we_want_to_use”)

Different types of methods include : Damerau-Lavenshtein, Lavenshtein, Longest Common Subsequence, Jaccard, etc

library(stringdist)
## 
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
## 
##     extract
stringdist("baboon", "typhoon", method = "dl")
## [1] 4
stringdist("baboon", "typhoon", method = "lcs")
## [1] 7
stringdist("baboon", "typhoon", method = "jaccard")
## [1] 0.75

The fuzzyjoin package allows us to join based on string distance. The code to do that looks something like this:

#library(fuzzyjoin)
#stringdist_left_join(df1, df2, by = "common factor", method = "dl", max_dist = "max dist you want to allow ")

Record Linkage:

Record linkage involves linking data together that come from multiple sources that dont share common identifier but contain data reagrding the same entity.

To generate pairs, first we need to check if any ofg the observation of tibble A matches any of the observations of tibble B.

To generate pairs in rows in R we use the reclin package. The blocking variable is the variable we think matches the pairs.

#pair_blocking(df_A, df_B, blocking_var = "blocking var you choose")

Scoring and linking example codes:

zagat <- readRDS("zagat (1).rds")
fodors <- readRDS("fodors (1).rds")
library(reclin)
## Loading required package: lvec
## 
## Attaching package: 'lvec'
## The following object is masked from 'package:base':
## 
##     order
## Loading required package: ldat
## Loading required package: Rcpp
## 
## Attaching package: 'ldat'
## The following objects are masked from 'package:base':
## 
##     append, match, table, which
## 
## Attaching package: 'reclin'
## The following object is masked from 'package:base':
## 
##     identical
# Create pairs
mmm <- pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%

 # Select pairs
  select_n_to_m() %>%
  
# Link data 
  link()
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.