factors: A cheaper version of 'factor()' along with cheaper utilities
In cheapr: Simple Functions to Save Time and Memory

factor_

R Documentation

A cheaper version of `factor()` along with cheaper utilities

Description

A fast version of factor() using the collapse package.

There are some additional utilities, most of which begin with the prefix 'levels_', such as as_factor() which is an efficient way to coerce both vectors and factors, levels_factor() which returns the levels of a factor, as a factor, levels_used() which returns the used levels of a factor, levels_unused() which returns the unused levels of a factor, levels_add() adds the specified levels onto the existing levels, levels_rm() removes the specified levels, levels_add_na() which adds an explicit NA level, levels_drop_na() which drops the NA level, levels_drop() which drops unused factor levels, levels_rename() for renaming levels, levels_lump() which returns top n levels and lumps all others into the same category,
levels_count() which returns the counts of each level, and finally levels_reorder() which reorders the levels of x based on y using the ordered median values of y for each level.

Usage

factor_(
  x = integer(),
  levels = NULL,
  order = TRUE,
  na_exclude = TRUE,
  ordered = is.ordered(x)
)

as_factor(x)

levels_factor(x)

levels_used(x)

levels_unused(x)

levels_rm(x, levels)

levels_add(x, levels, where = c("last", "first"))

levels_add_na(x, name = NA, where = c("last", "first"))

levels_drop_na(x)

levels_drop(x)

levels_reorder(x, order_by, decreasing = FALSE)

levels_rename(x, ..., .fun = NULL)

levels_lump(
  x,
  n,
  prop,
  other_category = "Other",
  ties = c("min", "average", "first", "last", "random", "max")
)

levels_count(x)

Arguments

`x`	A vector.
`levels`	Optional factor levels.
`order`	Should factor levels be sorted? Default is `TRUE`. It typically is faster to set this to `FALSE`, in which case the levels are sorted by order of first appearance.
`na_exclude`	Should `NA` values be excluded from the factor levels? Default is `TRUE`.
`ordered`	Should the result be an ordered factor?
`where`	Where should `NA` level be placed? Either first or last.
`name`	Name of `NA` level.
`order_by`	A vector to order the levels of `x` by using the medians of `order_by`.
`decreasing`	Should the reordered levels be in decreasing order? Default is `FALSE`.
`...`	Key-value pairs where the key is the new name and value is the name to replace that with the new name. For example `levels_rename(x, new = old)` replaces the level "old" with the level "new".
`.fun`	Renaming function applied to each level.
`n`	Top n number of levels to calculate.
`prop`	Top proportion of levels to calculate. This is a proportion of the total unique levels in x.
`other_category`	Name of 'other' category.
`ties`	Ties method to use. See `?rank`.

Details

This operates similarly to collapse::qF().
The main difference internally is that collapse::funique() is used and therefore s3 methods can be written for it.
Furthermore, for date-times factor_ differs in that it differentiates all instances in time whereas factor differentiates calendar times. Using a daylight savings example where the clocks go back:
factor(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5)) produces 4 levels whereas
factor_(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5)) produces 5 levels.

levels_lump() is a cheaper version of forcats::lump_n() but returns levels in order of highest frequency to lowest. This can be very useful for plotting.

Value

A factor or character in the case of levels_used and levels_unused. levels_count returns a data frame of counts and proportions for each level.

Examples

library(cheapr)

x <- factor_(sample(letters[sample.int(26, 10)], 100, TRUE), levels = letters)
x
# Used/unused levels

levels_used(x)
levels_unused(x)

# Drop unused levels
levels_drop(x)

# Top 3 letters by by frequency
lumped_letters <- levels_lump(x, 3)
levels_count(lumped_letters)

# To remove the "other" category, use `levels_rm()`

levels_count(levels_rm(lumped_letters, "Other"))

# We can use levels_lump to create a generic top n function for non-factors too

get_top_n <- function(x, n){
  f <- levels_lump(factor_(x, order = FALSE), n = n)
  levels_count(f)
}

get_top_n(x, 3)

# A neat way to order the levels of a factor by frequency
# is the following:

levels(levels_lump(x, prop = 1)) # Highest to lowest
levels(levels_lump(x, prop = -1)) # Lowest to highest

cheapr documentation built on June 8, 2025, 11:35 a.m.