2018-03-22
The janitor functions expedite the initial data exploration and cleaning that comes with any new data set. This catalog describes the usage for each function.
Functions for everyday use.
clean_names()Call this function every time you read data.
It works in a %>% pipeline, and handles problematic variable names, especially those that are so well-preserved by readxl::read_excel() and readr::read_csv().
œ to oe.# Create a data.frame with dirty names
test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)",
"REPEAT VALUE", "REPEAT VALUE", "")Clean the variable names, returning a data.frame:
test_df %>%
clean_names()
#> first_name abc percent_successful_2009 repeat_value repeat_value_2 x
#> 1 NA NA NA NA NA NACompare to what base R produces:
make.names(names(test_df))
#> [1] "firstName" "ábc..." "X..successful..2009."
#> [4] "REPEAT.VALUE" "REPEAT.VALUE" "X"tabyl() - a better version of table()tabyl() is a tidyverse-oriented replacement for table(). It counts combinations of one, two, or three variables, and then can be formatted with a suite of adorn_* functions to look just how you want. For instance:
mtcars %>%
tabyl(gear, cyl) %>%
adorn_totals("col") %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
adorn_title()
#> cyl
#> gear 4 6 8 Total
#> 3 6.67% (1) 13.33% (2) 80.00% (12) 100.00% (15)
#> 4 66.67% (8) 33.33% (4) 0.00% (0) 100.00% (12)
#> 5 40.00% (2) 20.00% (1) 40.00% (2) 100.00% (5)Learn more in the tabyls vignette.
get_dupes()This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn’t be any.
For example, in a tidy data.frame you might expect to have a unique ID repeated for each year, but no duplicated pairs of unique ID & year. Say you want to check for and study any such duplicated records.
get_dupes() returns the records (and inserts a count of duplicates) so you can examine the problematic cases:
get_dupes(mtcars, wt, cyl) # or mtcars %>% get_dupes(wt, cyl) if you prefer to pipe
#> # A tibble: 4 x 12
#> wt cyl dupe_count mpg disp hp drat qsec vs am gear
#> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3.44 6. 2 19.2 168. 123. 3.92 18.3 1. 0. 4.
#> 2 3.44 6. 2 17.8 168. 123. 3.92 18.9 1. 0. 4.
#> 3 3.57 8. 2 14.3 360. 245. 3.21 15.8 0. 0. 3.
#> 4 3.57 8. 2 15.0 301. 335. 3.54 14.6 0. 1. 5.
#> # ... with 1 more variable: carb <dbl>Smaller functions for use in particular situations. More human-readable than the equivalent code they replace.
excel_numeric_to_date()Ever load data from Excel and see a value like 42223 where a date should be? This function converts those serial numbers to class Date, and contains an option for specifying the alternate date system for files created with Excel for Mac 2008 and earlier versions (which count from a different starting point).
excel_numeric_to_date(41103)
#> [1] "2012-07-13"
excel_numeric_to_date(41103, date_system = "mac pre-2011")
#> [1] "2016-07-14"remove_empty() rows and columnsDoes what it says. For cases like cleaning Excel files that contain empty rows and columns after being read into R.
q <- data.frame(v1 = c(1, NA, 3),
v2 = c(NA, NA, NA),
v3 = c("a", NA, "b"))
q %>%
remove_empty(c("rows", "cols"))
#> v1 v3
#> 1 1 a
#> 3 3 bJust a simple wrapper for one-line functions, but it saves a little thinking for both the code writer and the reader.
round_half_up()R uses “banker’s rounding”, i.e., halves are rounded to the nearest even number. This function, an exact implementation of https://stackoverflow.com/questions/12688717/round-up-from-5/12688836#12688836, will round all halves up. Compare:
top_levels()Originally designed for use with Likert survey data stored as factors. Returns a tbl_df frequency table with appropriately-named rows, grouped into head/middle/tail groups.
NA values.f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"),
levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree"))
top_levels(f)
#> f n percent
#> strongly agree, agree 3 0.5000000
#> neutral 2 0.3333333
#> disagree, strongly disagree 1 0.1666667
top_levels(f, n = 1)
#> f n percent
#> strongly agree 2 0.3333333
#> agree, neutral, disagree 4 0.6666667
#> strongly disagree 0 0.0000000