10 Cleaning Data in R
https://learn.datacamp.com/courses/cleaning-data-in-r
## ── Attaching packages ───────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
10.1 Common Data Problems
Data Type Constraints:
Firstly, we need to make sure that each variable has the right type assigned to it. We can use the glimpse() function from dplyr library to see what data type is assigned to each variable.
## Rows: 1,704
## Columns: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghani…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,…
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134,…
Another way to check data types is to use the is.numeric function and the assert_is function from the assertive package. This throws an error and stops the script from running.
##
## Attaching package: 'assertive'
## The following objects are masked from 'package:purrr':
##
## is_atomic, is_character, is_double, is_empty, is_formula,
## is_function, is_integer, is_list, is_logical, is_null, is_numeric,
## is_vector
## The following object is masked from 'package:tibble':
##
## has_rownames
## [1] TRUE
We can use the class function on the revenue column ro see the character type.
## [1] "factor"
To remove something from the observations of a variable we use the str_remove(variable, “what we want to remove”) from the stringr package.
salary <- c("3,4444", "1,234", "4,567")
library(stringr)
salary_trimmed <- str_remove(salary, ",")
salary_trimmed
## [1] "34444" "1234" "4567"
To convert a variable into a numeric type we use as.numeric function.
salary_trimmed_numeric <- as.numeric(salary_trimmed)
salary_trimmed_character <- as.character(salary_trimmed)
salary_trimmed_logical <- as.logical(salary_trimmed)
salary_trimmed_factor <- as.factor(salary_trimmed)
Range Constraints :
To see if there is any data that is out of range, we can use the assert_all_are_in_closed_range(dataset$variable_name, lower = lower value we want, upper = upper value we want). It will throw an error with a data frame of values out of range.
We can either remove the out of range values if there isn’t a lot, or set them as NA, or we can replace the out of range values with the range limit (if there is a rating as 6 we can replace it as 5 so that it is in range.)
To remove, we use
To replace the values as NA , we use the replace(col, condition, replacement) function :
We can use the assert_all_are_in_past(dataset$variable) function to check to see if there is any date in future.
Uniqueness constraints :
We can use the duplicated function to check for duplicated observations. duplicated(df)
## [1] 0
## # A tibble: 0 x 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## # lifeExp <dbl>, pop <int>, gdpPercap <dbl>
We can drop full duplicates using the distinct(df) funciton.
Partial duplicates are duplicates which have a few different value for each observation.
To remove partial duplicates, we use the distinct function again but this time using specific column names and set .keep_all = TRUE, which will keep all columns and not just first and last name.
10.2 Categorical and Text Data
Categorical Data: Categorical variables have a fixed and known set of possible values. They are stored as factors in R. Factors have something called levels which are all possible values that a factor can hold.
We can use the anti_join or semi_join functions to find invalid factors in our data. Ther anti_join function separates the data from our data frame which are not present in the data frame we are joining it with.
Catagorical Data problems:
Lizards and lizards should fall under the same category regardless of the capital initial letter. Pug, Lab, Boxer should fall under one category - Dog since they refer to dogs.
## dest_size n
## 1 Small 1
## 2 Hub 1
## 3 Hub 1756
## 4 Large 143
## 5 Large 1
## 6 Medium 682
## 7 Small 225
To fix caps and spacing issues we use the following functions : str_trim and str_to_upper, str_to_lower.
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = str_trim(dest_size),
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = str_to_lower(cleanliness))
# Count values of dest_size_trimmed
sfo_survey %>%
count(dest_size_trimmed)
## dest_size_trimmed n
## 1 Hub 1757
## 2 Large 144
## 3 Medium 682
## 4 Small 226
## cleanliness_lower n
## 1 average 433
## 2 clean 970
## 3 dirty 2
## 4 somewhat clean 1254
## 5 somewhat dirty 30
## 6 <NA> 120
If we have multiple observations for the same category with insignificant number of observations compared to the rest of the dataframe, we can combine all of them into one row by calling it “other”. We can use the fct_collapse function to do this.
## dest_region n
## 1 Asia 260
## 2 Australia/New Zealand 66
## 3 Canada/Mexico 220
## 4 Central/South America 29
## 5 East US 498
## 6 Europe 401
## 7 Middle East 79
## 8 Midwest US 281
## 9 West US 975
# Categories to map to Europe
europe_categories <- c("EU", "Europ", "eur")
# Add a new col dest_region_collapsed
sfo_survey %>%
# Map all categories in europe_categories to Europe
mutate(dest_region_collapsed = fct_collapse(dest_region,
Europe = europe_categories)) %>%
# Count categories of dest_region_collapsed
count(dest_region_collapsed)
## Warning: Problem with `mutate()` input `dest_region_collapsed`.
## ℹ Unknown levels in `f`: EU, Europ, eur
## ℹ Input `dest_region_collapsed` is `fct_collapse(dest_region, Europe = europe_categories)`.
## Warning: Unknown levels in `f`: EU, Europ, eur
## dest_region_collapsed n
## 1 Asia 260
## 2 Australia/New Zealand 66
## 3 Canada/Mexico 220
## 4 Central/South America 29
## 5 East US 498
## 6 Europe 401
## 7 Middle East 79
## 8 Midwest US 281
## 9 West US 975
Cleaning text data :
Unstructured data problems include formatting inconsistency (“61718279912”, “(868) 992-4498”), (“123 123 123 123” vs “123123123123”) or information inconsistency (“+1 617 717 71777” vs 617-717-71777“), (”Sam Jhonson" vs “Josiah”), or invalid data (3 digit phone number or 2 digit zipcodes)
To take care of these issues we use the following functions :
Filter for rows with "-" in the phone column
sfo_survey %>% filter(str_detect(phone, “-”))
Filter for rows with "-" in the phone column
sfo_survey %>% filter(str_detect(phone, “-”))
Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
Remove "("s
str_remove_all(fixed(“(”)) %>%
Remove ")"s
str_remove_all(fixed(“)”))
Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
Remove "("s
str_remove_all(fixed(“(”)) %>%
Remove ")"s
str_remove_all(fixed(“)”))
Add phone_no_parens as column
sfo_survey %>% mutate(phone_no_parens = phone_no_parens)
Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
Remove "("s
str_remove_all(fixed(“(”)) %>%
Remove ")"s
str_remove_all(fixed(“)”))
Add phone_no_parens as column
sfo_survey %>% mutate(phone_no_parens = phone_no_parens,
Replace all hyphens in phone_no_parens with spaces
phone_clean = str_replace_all(phone_no_parens, "-", " "))
Check out the invalid numbers
sfo_survey %>% filter(str_length(phone) != 12)
Remove rows with invalid numbers
sfo_survey %>% filter(str_length(phone) == 12)
10.3 Advanced Data Problems
Uniformity:
Uniformity issues are when continuous data points have different units or formats. Ex : Temperatiuse in C or F, Weight in kg,g, or lb, Money in USD, GBP, or Dates in different formats.
To apply certain changes to specific observations we can use the ifelse(condition, value_if_true, value_isf_false)
a <- gapminder %>% select(country, year , gdpPercap)
a <- a %>% filter(year > 1989)
a %>% mutate(generation = ifelse(year >= 1990 & year <2000, "GenZ", "Millenial"))
## # A tibble: 568 x 4
## country year gdpPercap generation
## <fct> <int> <dbl> <chr>
## 1 Afghanistan 1992 649. GenZ
## 2 Afghanistan 1997 635. GenZ
## 3 Afghanistan 2002 727. Millenial
## 4 Afghanistan 2007 975. Millenial
## 5 Albania 1992 2497. GenZ
## 6 Albania 1997 3193. GenZ
## 7 Albania 2002 4604. Millenial
## 8 Albania 2007 5937. Millenial
## 9 Algeria 1992 5023. GenZ
## 10 Algeria 1997 4797. GenZ
## # … with 558 more rows
Dates:
We can parse dates to make all dates in the same format. If we type ?strptime in the r console we will be able to see all the date formats that R offers. To convert all the dates to one specific format, we use the parse_date_time function from the lubridate package.
accounts <- readRDS("ch3_1_accounts (1).rds")
# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")
# Convert dates to the same format
#accounts %>%
#mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))
Cross field validation:
Cross field validation is essentially a sanity check on our data. To check that one value makes sense based on other values in the dataset.
Cross field validation depends on datasets. For the accounts data set, we can add the total of the three fundsand assign it a new variable called theoretical_total and check to see if it matches the total variable.
Example code :
Find invalid totals
accounts %>% theoretical_total: sum of the three funds mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
Find accounts where total doesn’t match theoretical_total filter(total != theoretical_total)
Find invalid acct_age accounts %>% theoretical_age: age of acct based on date_opened mutate(theoretical_age = floor(as.numeric(date_opened %–% today(), “years”))) %>%
Filter for rows where acct_age is different from theoretical_age filter(acct_age != theoretical_age)
Completeness:
Missing data is the data when there is no data value stored for a variable in an observation. (NA, nan, 0, 99,..)
We can find missing values in a data set using the is.na(df) function.
## [1] 0
We can use the vis_miss(df) function from the visdat package to visualize missing data.
library(visdat) [not supported for R 4.0.2] vis_miss(accounts)
We can filter missing values by the following code:
10.4 Record Linkage
Comparing strings:
To calculate edit distance in R, we can use the stringdist function from the stringdist package. stringdist(“comparing_element1”, “comparing_element2”, method = “method_we_want_to_use”)
Different types of methods include : Damerau-Lavenshtein, Lavenshtein, Longest Common Subsequence, Jaccard, etc
##
## Attaching package: 'stringdist'
## The following object is masked from 'package:tidyr':
##
## extract
## [1] 4
## [1] 7
## [1] 0.75
The fuzzyjoin package allows us to join based on string distance. The code to do that looks something like this:
#library(fuzzyjoin)
#stringdist_left_join(df1, df2, by = "common factor", method = "dl", max_dist = "max dist you want to allow ")
Record Linkage:
Record linkage involves linking data together that come from multiple sources that dont share common identifier but contain data reagrding the same entity.
To generate pairs, first we need to check if any ofg the observation of tibble A matches any of the observations of tibble B.
To generate pairs in rows in R we use the reclin package. The blocking variable is the variable we think matches the pairs.
Scoring and linking example codes:
## Loading required package: lvec
##
## Attaching package: 'lvec'
## The following object is masked from 'package:base':
##
## order
## Loading required package: ldat
## Loading required package: Rcpp
##
## Attaching package: 'ldat'
## The following objects are masked from 'package:base':
##
## append, match, table, which
##
## Attaching package: 'reclin'
## The following object is masked from 'package:base':
##
## identical
# Create pairs
mmm <- pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
select_n_to_m() %>%
# Link data
link()
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.