Introduction to santoku

set.seed(23479)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(digits = 4)

Introduction

Santoku is a package for cutting data into intervals. It provides chop(), a replacement for base R's cut() function, as well as several convenience functions to cut different kinds of intervals.

To install santoku, run:

install.packages("santoku")

Basic usage

Use chop() like cut(), to cut numeric data into intervals between a set of breaks.

library(santoku)

x <- runif(10, 0, 10)
(chopped <- chop(x, breaks = 0:10))
data.frame(x, chopped)

chop() returns a factor.

If data is beyond the limits of breaks, they will be extended automatically:

chopped <- chop(x, breaks = 3:7)
data.frame(x, chopped)

To chop a single number into a separate category, put the number twice in breaks:

x_fives <- x
x_fives[1:5] <- 5
chopped <- chop(x_fives, c(2, 5, 5, 8))
data.frame(x_fives, chopped)

To quickly produce a table of chopped data, use tab():

tab(1:10, c(2, 5, 8))

More ways to chop

To chop into fixed-width intervals, starting at the minimum value, use chop_width():

chopped <- chop_width(x, 2)
data.frame(x, chopped)

To chop into a fixed number of intervals, each with the same width, use chop_evenly():

chopped <- chop_evenly(x, intervals = 3)
data.frame(x, chopped)

To chop into groups with a fixed number of members, use chop_n():

chopped <- chop_n(x, 4)
table(chopped)

To chop into a fixed number of groups, each with the same number of elements, use chop_equally():

chopped <- chop_equally(x, groups = 5)
table(chopped)

To chop data up by quantiles, use chop_quantiles():

chopped <- chop_quantiles(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)

To chop data up by proportions of the data range, use chop_proportions():

chopped <- chop_proportions(x, c(0.25, 0.5, 0.75))
data.frame(x, chopped)

You can think of these six functions as logically arranged in a table.

To chop into... | Sizing intervals by... |
:------------------------------|:--------------------------|:----------------------   | number of elements: | interval width: a specific number of equal intervals... | chop_equally() | chop_evenly() intervals of one specific size... | chop_n() | chop_width() intervals of different specific sizes... | chop_quantiles() | chop_proportions()

: Different ways to chop by size

To chop data by standard deviations around the mean, use chop_mean_sd():

chopped <- chop_mean_sd(x)
data.frame(x, chopped)

To chop data into attractive intervals, use chop_pretty(). This selects intervals which are a multiple of 2, 5 or 10. It's useful for producing bar plots.

chopped <- chop_pretty(x)
data.frame(x, chopped)

tab_n(), tab_width(), and friends act similarly to tab(), calling the related chop_* function and then table() on the result.

tab_n(x, 4)
tab_width(x, 2)
tab_evenly(x, 5)
tab_mean_sd(x)

Specifying labels

By default, santoku labels intervals using mathematical notation:

To override these labels, provide names to the breaks argument:

chopped <- chop(x, c(Lowest = 1, Low = 2, Higher = 5, Highest = 8))
data.frame(x, chopped)

Or, you can specify factor labels with the labels argument:

chopped <- chop(x, c(2, 5, 8), labels = c("Lowest", "Low", "Higher", "Highest"))
data.frame(x, chopped)

You need as many labels as there are intervals - one fewer than length(breaks) if your data doesn't extend beyond breaks, one more than length(breaks) if it does.

To label intervals with a dash, use lbl_dash():

chopped <- chop(x, c(2, 5, 8), labels = lbl_dash())
data.frame(x, chopped)

To label integer data, use lbl_discrete(). It uses more informative right endpoints:

chopped  <- chop(1:10, c(2, 5, 8), labels = lbl_discrete())
chopped2 <- chop(1:10, c(2, 5, 8), labels = lbl_dash())
data.frame(x = 1:10, lbl_discrete = chopped, lbl_dash = chopped2)

You can customize the first or last labels:

chopped <- chop(x, c(2, 5, 8), labels = lbl_dash(first = "< 2", last = "8+"))
data.frame(x, chopped)

To label intervals in order use lbl_seq():

chopped <- chop(x, c(2, 5, 8), labels = lbl_seq())
data.frame(x, chopped)

You can use numerals or even roman numerals:

chop(x, c(2, 5, 8), labels = lbl_seq("(1)"))
chop(x, c(2, 5, 8), labels = lbl_seq("i."))

Other labelling functions include:

Specifying breaks

By default, chop() extends breaks if necessary. If you don't want that, set extend = FALSE:

chopped <- chop(x, c(3, 5, 7), extend = FALSE)
data.frame(x, chopped)

Data outside the range of breaks will become NA.

By default, intervals are closed on the left, i.e. they include their left endpoints. If you want right-closed intervals, set left = FALSE:

y <- 1:5
data.frame(
        y = y, 
        left_closed = chop(y, 1:5), 
        right_closed = chop(y, 1:5, left = FALSE)
      )

By default, the last interval is closed on both ends. If you want to keep the last interval open at the end, set close_end = FALSE:

data.frame(
  y = y,
  end_closed = chop(y, 1:5),
  end_open   = chop(y, 1:5, close_end = FALSE)
)

Chopping dates, times and other vectors

You can chop many kinds of vectors with santoku, including Date objects...

y2k <- as.Date("2000-01-01") + 0:10 * 7
data.frame(
  y2k = y2k,
  chopped = chop(y2k, as.Date(c("2000-02-01", "2000-03-01")))
)

... and POSIXct (date-time) objects:

# hours of the 2020 Crew Dragon flight:
crew_dragon <- seq(as.POSIXct("2020-05-30 18:00", tz = "GMT"), 
                     length.out = 24, by = "hours")
liftoff <- as.POSIXct("2020-05-30 15:22", tz = "America/New_York")
dock    <- as.POSIXct("2020-05-31 10:16", tz = "America/New_York")

data.frame(
  crew_dragon = crew_dragon,
  chopped = chop(crew_dragon, c(liftoff, dock), 
                   labels = c("pre-flight", "flight", "docked"))
)

Note how santoku correctly handles the different timezones.

You can use chop_width() with objects from the lubridate package, to chop by irregular periods such as months:

library(lubridate)
data.frame(
  y2k = y2k,
  chopped = chop_width(y2k, months(1))
)

You can format labels using format strings from strptime(). lbl_discrete() is useful here:

data.frame(
  y2k = y2k,
  chopped = chop_width(y2k, months(1), labels = lbl_discrete(fmt = "%e %b"))
)

You can also chop vectors with units, using the units package:

library(units)

x <- set_units(1:10 * 10, cm)
br <- set_units(1:3, ft)
data.frame(
  x = x,
  chopped = chop(x, br)
)

You should be able to chop anything that has a comparison operator. You can even chop character data using lexical ordering. By default santoku emits a warning in this case, to avoid accidentally misinterpreting results:

chop(letters[1:10], c("d", "f"))

If you find a type of data that you can't chop, please file an issue.



Try the santoku package in your browser

Any scripts or data that you put into this service are public.

santoku documentation built on Oct. 12, 2023, 5:13 p.m.