knitr::opts_chunk$set(collapse = T, comment = "#>")
Analysts do a lot of counting. Indeed, it's been said that "data science is mostly counting things." But the base R function for counting,
table(), leaves much to be desired:
tabyl() is an approach to tabulating variables that addresses these shortcomings. It's part of the janitor package because counting is such a fundamental part of data cleaning and exploration.
tabyl() is tidyverse-aligned and is primarily built upon the dplyr and tidyr packages.
On its surface,
tabyl() produces frequency tables using 1, 2, or 3 variables. Under the hood,
tabyl() also attaches a copy of these counts as an attribute of the resulting data.frame.
The result looks like a basic data.frame of counts, but because it's also a
tabyl containing this metadata, you can use
adorn_ functions to add additional information and pretty formatting.
adorn_ functions are built to work on
tabyls, but have been adapted to work with similar, non-tabyl data.frames that need formatting.
This vignette demonstrates
tabyl in the context of studying humans in the
starwars dataset from dplyr:
library(dplyr) humans <- starwars %>% filter(species == "Human")
Tabulating a single variable is the simplest kind of tabyl:
library(janitor) t1 <- humans %>% tabyl(eye_color) t1
NA values are present,
tabyl() also displays "valid" percentages, i.e., with missing values removed from the denominator. And while
tabyl() is built to take a data.frame and column names, you can also produce a one-way tabyl by calling it directly on a vector:
x <- c("big", "big", "small", "small", "small", NA) tabyl(x)
adorn_ helper functions are built for 2-way tabyls, but those that make sense for a 1-way tabyl do work:
t1 %>% adorn_totals("row") %>% adorn_pct_formatting()
This is often called a "crosstab" or "contingency" table. Calling
tabyl on two columns of a data.frame produces the same result as the common combination of
dplyr::count(), followed by
tidyr::pivot_wider() to wide form:
t2 <- humans %>% tabyl(gender, eye_color) t2
Since it's a
tabyl, we can enhance it with
adorn_ helper functions. For instance:
t2 %>% adorn_percentages("row") %>% adorn_pct_formatting(digits = 2) %>% adorn_ns()
Adornments have options to control axes, rounding, and other relevant formatting choices (more on that below).
table() accepts three variables, so does
tabyl(), producing a list of tabyls:
t3 <- humans %>% tabyl(eye_color, skin_color, gender) # the result is a tabyl of eye color x skin color, split into a list by gender t3
adorn_ helper functions are called on a list of data.frames - like the output of a three-way
tabyl call - they will call
purrr::map() to apply themselves to each data.frame in the list:
library(purrr) humans %>% tabyl(eye_color, skin_color, gender, show_missing_levels = FALSE) %>% adorn_totals("row") %>% adorn_percentages("all") %>% adorn_pct_formatting(digits = 1) %>% adorn_ns %>% adorn_title
This automatic mapping supports interactive data analysis that switches between combinations of 2 and 3 variables. That way, if a user starts with
humans %>% tabyl(eye_color, skin_color), adds some
adorn_ calls, then decides to split the tabulation by gender and modifies their first line to
humans %>% tabyl(eye_color, skin_color, gender), they don't have to rewrite the subsequent adornment calls to use
However, if feels more natural to call these with
lapply(), that is still supported. For instance,
t3 %>% lapply(adorn_percentages) would produce the same result as
t3 %>% adorn_percentages.
tabylwill show missing levels (levels not present in the data) in the result
NAvalues can be displayed or suppressed
tabylsprint without displaying row numbers
You can call
fisher.test() on a two-way tabyl to perform those statistical tests, just like on a base R
These modular functions build on a
tabyl to approximate the functionality of a PivotTable in Microsoft Excel. They print elegant results for interactive analysis or for sharing in a report, e.g., with
knitr::kable(). For example:
humans %>% tabyl(gender, eye_color) %>% adorn_totals(c("row", "col")) %>% adorn_percentages("row") %>% adorn_pct_formatting(rounding = "half up", digits = 0) %>% adorn_ns() %>% adorn_title("combined") %>% knitr::kable()
adorn_totals(): Add totals row, column, or both.
adorn_percentages(): Calculate percentages along either axis or over the entire tabyl
adorn_pct_formatting(): Format percentage columns, controlling the number of digits to display and whether to append the
adorn_rounding(): Round a data.frame of numbers (usually the result of
adorn_percentages), either using the base R
round()function or using janitor's
round_half_up()to round all ties up (thanks, StackOverflow).
adorn_rounding()returns columns of class
numeric, allowing for graphing, sorting, etc. It's a less-aggressive substitute for
adorn_pct_formatting(); these two functions should not be called together.
adorn_ns(): add Ns to a tabyl. These can be drawn from the tabyl's underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user.
adorn_title(): add a title to a tabyl (or other data.frame). Options include putting the column title in a new row on top of the data.frame or combining the row and column titles in the data.frame's first name slot.
These adornments should be called in a logical order, e.g., you probably want to add totals before percentages are calculated. In general, call them in the order they appear above.
You can also call
adorn_ functions on other data.frames, not only the results of calls to
mtcars %>% adorn_totals("col") %>% adorn_percentages("col") performs as expected, despite
mtcars not being a
This can be handy when you have a data.frame that is not a simple tabulation generated by
tabyl but would still benefit from the
adorn_ formatting functions.
A simple example: calculate the proportion of records meeting a certain condition, then format the results.
percent_above_165_cm <- humans %>% group_by(gender) %>% summarise(pct_above_165_cm = mean(height > 165, na.rm = TRUE), .groups = "drop") percent_above_165_cm %>% adorn_pct_formatting()
You can control which columns are adorned by using the
... argument. It accepts the tidyselect helpers. That is, you can specify columns the same way you would using
For instance, say you have a numeric column that should not be included in percentage formatting and you wish to exempt it. Here, only the
proportion column is adorned:
mtcars %>% count(cyl, gear) %>% rename(proportion = n) %>% adorn_percentages("col", na.rm = TRUE, proportion) %>% adorn_pct_formatting(,,,proportion) # the commas say to use the default values of the other arguments
Here we specify that only two consecutive numeric columns should be totaled (
year is numeric but should not be included):
cases <- data.frame( region = c("East", "West"), year = 2015, recovered = c(125, 87), died = c(13, 12) ) cases %>% adorn_totals(c("col", "row"), fill = "-", na.rm = TRUE, name = "Total Cases", recovered:died)
Here's a more complex example that uses a data.frame of means, not counts. We create a table containing the mean of a 3rd variable when grouped by two other variables, then use
adorn_ functions to round the values and append Ns. The first part is pretty straightforward:
library(tidyr) # for spread() mpg_by_cyl_and_am <- mtcars %>% group_by(cyl, am) %>% summarise(mpg = mean(mpg), .groups = "drop") %>% spread(am, mpg) mpg_by_cyl_and_am
adorn_ it. Since this is not the result of a
tabyl() call, it doesn't have the underlying Ns stored in the
core attribute, so we'll have to supply them:
mpg_by_cyl_and_am %>% adorn_rounding() %>% adorn_ns( ns = mtcars %>% # calculate the Ns on the fly by calling tabyl on the original data tabyl(cyl, am) ) %>% adorn_title("combined", row_name = "Cylinders", col_name = "Is Automatic")
If needed, Ns can be manipulated in their own data.frame before they are appended. Here a tabyl with values in the thousands has its Ns formatted to include the separating character
, as typically seen in American numbers, e.g.,
First we create the tabyl to adorn:
set.seed(1) raw_data <- data.frame(sex = rep(c("m", "f"), 3000), age = round(runif(3000, 1, 102), 0)) raw_data$agegroup = cut(raw_data$age, quantile(raw_data$age, c(0, 1/3, 2/3, 1))) comparison <- raw_data %>% tabyl(agegroup, sex, show_missing_levels = F) %>% adorn_totals(c("row", "col")) %>% adorn_percentages("col") %>% adorn_pct_formatting(digits = 1) comparison
At this point, the Ns are unformatted:
comparison %>% adorn_ns()
Now we format them to insert the thousands commas. A tabyl's raw Ns are stored in its
"core" attribute. Here we retrieve those with
attr(), then apply the base R function
format() to all numeric columns. Lastly, we append these Ns using
formatted_ns <- attr(comparison, "core") %>% # extract the tabyl's underlying Ns adorn_totals(c("row", "col")) %>% # to match the data.frame we're appending to dplyr::mutate_if(is.numeric, format, big.mark = ",") comparison %>% adorn_ns(position = "rear", ns = formatted_ns)
File an issue on GitHub if you have suggestions related to
tabyl() and its
adorn_ helpers or encounter problems while using them.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.