knitr::opts_chunk$set(collapse = T, comment = "#>") library(janitor)
The janitor functions expedite the initial data exploration and cleaning that comes with any new data set. This catalog describes the usage for each function.
Functions for everyday use.
Call this function every time you read data.
It works in a
%>% pipeline, and handles problematic variable names, especially those that are so well-preserved by
# Create a data.frame with dirty names test_df <- as.data.frame(matrix(ncol = 6)) names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)", "REPEAT VALUE", "REPEAT VALUE", "")
Clean the variable names, returning a data.frame:
test_df %>% clean_names()
Compare to what base R produces:
This function is powered by the underlying exported function
make_clean_names(), which accepts and returns a character vector of names (see below). This allows for cleaning the names of any object, not just a data.frame.
clean_names() is retained for its convenience in piped workflows, and can be called on an
sf simple features object or a
tbl_graph tidygraph object in addition to a data.frame.
For cases when you are given a set of data files that should be identical, and you wish to read and combine them for analysis. But then
rbind() fails, because of different columns or because the column classes don't match across data.frames.
compare_df_cols() takes unquoted names of data.frames / tibbles, or a list of data.frames, and returns a summary of how they compare. See what the column types are, which are missing or present in the different inputs, and how column types differ.
df1 <- data.frame(a = 1:2, b = c("big", "small")) df2 <- data.frame(a = 10:12, b = c("medium", "small", "big"), c = 0, stringsAsFactors = TRUE) # here, column b is a factor df3 <- df1 %>% dplyr::mutate(b = as.character(b)) compare_df_cols(df1, df2, df3) compare_df_cols(df1, df2, df3, return = "mismatch") compare_df_cols(df1, df2, df3, return = "mismatch", bind_method = "rbind") # default is dplyr::bind_rows
FALSE indicating if the data.frames can be successfully row-bound with the given binding method:
compare_df_cols_same(df1, df3) compare_df_cols_same(df2, df3)
tabyl()- a better version of
tabyl() is a tidyverse-oriented replacement for
table(). It counts combinations of one, two, or three variables, and then can be formatted with a suite of
adorn_* functions to look just how you want. For instance:
mtcars %>% tabyl(gear, cyl) %>% adorn_totals("col") %>% adorn_percentages("row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns() %>% adorn_title()
Learn more in the tabyls vignette.
This is for hunting down and examining duplicate records during data cleaning - usually when there shouldn't be any.
For example, in a tidy data.frame you might expect to have a unique ID repeated for each year, but no duplicated pairs of unique ID & year. Say you want to check for and study any such duplicated records.
get_dupes() returns the records (and inserts a count of duplicates) so you can examine the problematic cases:
get_dupes(mtcars, wt, cyl) # or mtcars %>% get_dupes(wt, cyl) if you prefer to pipe
Smaller functions for use in particular situations. More human-readable than the equivalent code they replace.
Like base R's
make.names(), but with the stylings and case choice of the long-time janitor function
clean_names() is still offered for use in data.frame pipeline with
make_clean_names() allows for more general usage, e.g., on a vector.
It can also be used as an argument to
.name_repair in the newest version of
tibble::as_tibble(iris, .name_repair = janitor::make_clean_names)
remove_empty()rows and columns
Does what it says. For cases like cleaning Excel files that contain empty rows and columns after being read into R.
q <- data.frame(v1 = c(1, NA, 3), v2 = c(NA, NA, NA), v3 = c("a", NA, "b")) q %>% remove_empty(c("rows", "cols"))
Just a simple wrapper for one-line functions, but it saves a little thinking for both the code writer and the reader.
Drops columns from a data.frame that contain only a single constant value (with an
na.rm option to control whether NAs should be considered as different values from the constant).
remove_empty work on matrices as well as data.frames.
a <- data.frame(good = 1:3, boring = "the same") a %>% remove_constant()
R uses "banker's rounding", i.e., halves are rounded to the nearest even number. This function, an exact implementation of https://stackoverflow.com/questions/12688717/round-up-from-5/12688836#12688836, will round all halves up. Compare:
nums <- c(2.5, 3.5) round(nums) round_half_up(nums)
Say your data should only have values of quarters: 0, 0.25, 0.5, 0.75, 1, etc. But there are either user-entered bad values like
0.2 or floating-point precision problems like
round_to_fraction() will enforce the desired fractional distribution by rounding the values to the nearest value given the specified denominator.
There's also a
digits argument for optional subsequent rounding.
Ever load data from Excel and see a value like
42223 where a date should be? This function converts those serial numbers to class
Date, with options for different Excel date encoding systems, preserving fractions of a date as time (in which case the returned value is of class
POSIXlt), and specifying a time zone.
excel_numeric_to_date(41103) excel_numeric_to_date(41103.01) # ignores decimal places, returns Date object excel_numeric_to_date(41103.01, include_time = TRUE) # returns POSIXlt object excel_numeric_to_date(41103.01, date_system = "mac pre-2011")
excel_numeric_to_date(), the new functions
convert_to_datetime() are more robust to a mix of inputs. Handy when reading many spreadsheets that should have the same column formats, but don't.
For instance, here a vector with a date and an Excel datetime sees both values successfully converted to Date class:
If a data.frame has the intended variable names stored in one of its rows,
row_to_names will elevate the specified row to become the names of the data.frame and optionally (by default) remove the row in which names were stored and/or the rows above it.
dirt <- data.frame(X_1 = c(NA, "ID", 1:3), X_2 = c(NA, "Value", 4:6)) row_to_names(dirt, 2)
Originally designed for use with Likert survey data stored as factors. Returns a
tbl_df frequency table with appropriately-named rows, grouped into head/middle/tail groups.
f <- factor(c("strongly agree", "agree", "neutral", "neutral", "disagree", "strongly agree"), levels = c("strongly agree", "agree", "neutral", "disagree", "strongly disagree")) top_levels(f) top_levels(f, n = 1)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.