In sfirke/janitor: Simple Tools for Examining and Cleaning Dirty Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
options(width = 110)

janitor

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" (New York Times, 2014)

!Monthly Downloads !Downloads

janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can perform many of these tasks already, but with janitor they can do it faster and save their thinking for the fun stuff.

The main janitor functions:

perfectly format data.frame column names;
create and format frequency tables of one, two, or three variables - think an improved table(); and
provide other tools for cleaning and examining data.frames.

The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.

janitor is a #tidyverse-oriented package. Specifically, it plays nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

Installation

You can install:

the most recent officially-released version from CRAN with

install.packages("janitor")

the latest development version from GitHub with

# install.packages("remotes")
remotes::install_github("sfirke/janitor")
# or from r-universe
install.packages("janitor", repos = c("https://sfirke.r-universe.dev", "https://cloud.r-project.org"))

Using janitor

A full description of each function, organized by topic, can be found in janitor's catalog of functions vignette. There you will find functions not mentioned in this README, like compare_df_cols() which provides a summary of differences in column names and types when given a set of data.frames.

Below are quick examples of how janitor tools are commonly used.

Cleaning dirty data

Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx: All kinds of dirty.

Dirtiness includes:

A header at the top
Dreadful column names
Rows and columns containing Excel formatting but no data
Dates in two different formats in a single column (MM/DD/YYYY and numbers)
Values spread inconsistently over the "Certification" columns

Here's that data after being read in to R:

library(readxl)
library(janitor)
library(dplyr)
library(here)

roster_raw <- read_excel(here("dirty_data.xlsx")) # available at https://github.com/sfirke/janitor
glimpse(roster_raw)

Now, to clean it up, starting with the column names.

Name cleaning comes in two flavors. make_clean_names() operates on character vectors and can be used during data import:

roster_raw_cleaner <- read_excel(here("dirty_data.xlsx"),
  skip = 1,
  .name_repair = make_clean_names
)
glimpse(roster_raw_cleaner)

clean_names() is a convenience version of make_clean_names() that can be used for piped data.frame workflows. The equivalent steps with clean_names() would be:

roster_raw <- roster_raw %>%
  row_to_names(row_number = 1) %>%
  clean_names()

The data.frame now has clean names. Let's tidy it up further:

roster <- roster_raw %>%
  remove_empty(c("rows", "cols")) %>%
  remove_constant(na.rm = TRUE, quiet = FALSE) %>% # remove the column of all "Yes" values
  mutate(
    hire_date = convert_to_date(
      hire_date, # handle the mixed-format dates
      character_fun = lubridate::mdy
    ),
    cert = dplyr::coalesce(certification, certification_2)
  ) %>%
  select(-certification, -certification_2) # drop unwanted columns

roster

Examining dirty data

Finding duplicates

Use get_dupes() to identify and examine duplicate records during data cleaning. Let's see if any teachers are listed more than once:

roster %>% get_dupes(contains("name"))

Yes, some teachers appear twice. We ought to address this before counting employees.

Tabulating tools

A variable (or combinations of two or three variables) can be tabulated with tabyl(). The resulting data.frame can be tweaked and formatted with the suite of adorn_ functions for quick analysis and printing of pretty results in a report. adorn_ functions can be helpful with non-tabyls, too.

`tabyl()`

Like table(), but pipe-able, data.frame-based, and fully featured.

tabyl() can be called two ways:

On a vector, when tabulating a single variable: tabyl(roster$subject)
On a data.frame, specifying 1, 2, or 3 variable names to tabulate: roster %>% tabyl(subject, employee_status).
- Here the data.frame is passed in with the %>% pipe; this allows tabyl to be used in an analysis pipeline

One variable:

roster %>%
  tabyl(subject)

Two variables:

roster %>%
  filter(hire_date > as.Date("1950-01-01")) %>%
  tabyl(employee_status, full_time)

Three variables:

roster %>%
  tabyl(full_time, subject, employee_status, show_missing_levels = FALSE)

Adorning tabyls

The adorn_ functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:

roster %>%
  tabyl(employee_status, full_time) %>%
  adorn_totals("row") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>%
  adorn_ns() %>%
  adorn_title("combined")

Pipe that right into knitr::kable() in your RMarkdown report.

These modular adornments can be layered to reduce R's deficit against Excel and SPSS when it comes to quick, informative counts. Learn more about tabyl() and the adorn_ functions from the tabyls vignette.