set.seed(23479) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(digits = 4)
Santoku is a package for cutting data into intervals. It provides chop()
,
a replacement for base R's cut()
function, as well as several convenience
functions to cut different kinds of intervals.
To install santoku, run:
install.packages("santoku")
Use chop()
like cut()
, to cut numeric data into intervals between a set of
breaks
.
library(santoku) x <- runif(10, 0, 10) (chopped <- chop(x, breaks = 0:10)) data.frame(x, chopped)
chop()
returns a factor.
If data is beyond the limits of breaks
, they will be extended automatically:
chopped <- chop(x, breaks = 3:7) data.frame(x, chopped)
To chop a single number into a separate category, put the number twice in
breaks
:
x_fives <- x x_fives[1:5] <- 5 chopped <- chop(x_fives, c(2, 5, 5, 8)) data.frame(x_fives, chopped)
To quickly produce a table of chopped data, use tab()
:
tab(1:10, c(2, 5, 8))
To chop into fixed-width intervals, starting at the minimum value, use
chop_width()
:
chopped <- chop_width(x, 2) data.frame(x, chopped)
To chop into a fixed number of intervals, each with the same width,
use chop_evenly()
:
chopped <- chop_evenly(x, intervals = 3) data.frame(x, chopped)
To chop into groups with a fixed number of members, use chop_n()
:
chopped <- chop_n(x, 4) table(chopped)
To chop into a fixed number of groups, each with the same number of elements,
use chop_equally()
:
chopped <- chop_equally(x, groups = 5) table(chopped)
To chop data up by quantiles, use chop_quantiles()
:
chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75)) data.frame(x, chopped)
To chop data up by proportions of the data range, use chop_proportions()
:
chopped <- chop_proportions(x, c(0.25, 0.5, 0.75)) data.frame(x, chopped)
You can think of these six functions as logically arranged in a table.
To chop into... | Sizing intervals by... |
:------------------------------|:--------------------------|:----------------------
| number of elements: | interval width:
a specific number of equal intervals... | chop_equally()
| chop_evenly()
intervals of one specific size... | chop_n()
| chop_width()
intervals of different specific sizes... | chop_quantiles()
| chop_proportions()
: Different ways to chop by size
To chop data by standard deviations around the mean, use chop_mean_sd()
:
chopped <- chop_mean_sd(x) data.frame(x, chopped)
To chop data into attractive intervals, use chop_pretty()
. This
selects intervals which are a multiple of 2, 5 or 10. It's useful for producing
bar plots.
chopped <- chop_pretty(x) data.frame(x, chopped)
tab_n()
, tab_width()
, and friends act similarly to
tab()
, calling the related chop_*
function and then table()
on the result.
tab_n(x, 4) tab_width(x, 2) tab_evenly(x, 5) tab_mean_sd(x)
By default, santoku labels intervals using mathematical notation:
[0, 1]
means all numbers between 0 and 1 inclusive.(0, 1)
means all numbers strictly between 0 and 1, not including the
endpoints.[0, 1)
means all numbers between 0 and 1, including 0 but not 1.(0, 1]
means all numbers between 0 and 1, including 1 but not 0.{0}
means just the number 0.To override these labels, provide names to the breaks
argument:
chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8)) data.frame(x, chopped)
Or, you can specify factor labels with the labels
argument:
chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest")) data.frame(x, chopped)
You need as many labels as there are intervals - one fewer than length(breaks)
if your data doesn't extend beyond breaks
, one more than length(breaks)
if
it does.
To label intervals with a dash, use lbl_dash()
:
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash()) data.frame(x, chopped)
To label integer data, use lbl_discrete()
. It uses more informative right
endpoints:
chopped <- chop(1:10, c(2, 5, 8), labels = lbl_discrete()) chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash()) data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)
You can customize the first or last labels:
chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+")) data.frame(x, chopped)
To label intervals in order use lbl_seq()
:
chopped <- chop(x, c(2, 5, 8), labels = lbl_seq()) data.frame(x, chopped)
You can use numerals or even roman numerals:
chop(x, c(2, 5, 8), labels = lbl_seq("(1)")) chop(x, c(2, 5, 8), labels = lbl_seq("i."))
Other labelling functions include:
lbl_endpoints()
- use left endpoints as labelslbl_midpoints()
- use interval midpoints as labelslbl_glue()
- specify labels flexibly with the {glue}
packageBy default, chop()
extends breaks
if necessary. If you don't want
that, set extend = FALSE
:
chopped <- chop(x, c(3, 5, 7), extend = FALSE) data.frame(x, chopped)
Data outside the range of breaks
will become NA
.
By default, intervals are closed on the left, i.e. they include their left
endpoints. If you want right-closed intervals, set left = FALSE
:
y <- 1:5 data.frame( y = y, left_closed = chop(y, 1:5), right_closed = chop(y, 1:5, left = FALSE) )
By default, the last interval is closed on both ends.
If you want to keep the last interval open at the end,
set close_end = FALSE
:
data.frame( y = y, end_closed = chop(y, 1:5), end_open = chop(y, 1:5, close_end = FALSE) )
You can chop many kinds of vectors with santoku, including Date objects...
y2k <- as.Date("2000-01-01") + 0:10 * 7 data.frame( y2k = y2k, chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01"))) )
... and POSIXct (date-time) objects:
# hours of the 2020 Crew Dragon flight: crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"), length.out = 24, by = "hours") liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York") dock <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York") data.frame( crew_dragon = crew_dragon, chopped = chop(crew_dragon, c(liftoff, dock), labels = c("pre-flight", "flight", "docked")) )
Note how santoku correctly handles the different timezones.
You can use chop_width()
with objects from the lubridate
package,
to chop by irregular periods such as months:
library(lubridate) data.frame( y2k = y2k, chopped = chop_width(y2k, months(1)) )
You can format labels using format strings from strptime()
.
lbl_discrete()
is useful here:
data.frame( y2k = y2k, chopped = chop_width(y2k, months(1), labels = lbl_discrete(fmt = "%e %b")) )
You can also chop vectors with units, using the units
package:
library(units) x <- set_units(1:10 * 10, cm) br <- set_units(1:3, ft) data.frame( x = x, chopped = chop(x, br) )
You should be able to chop anything that has a comparison operator. You can even chop character data using lexical ordering. By default santoku emits a warning in this case, to avoid accidentally misinterpreting results:
chop(letters[1:10], c("d", "f"))
If you find a type of data that you can't chop, please file an issue.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.