discretize: Discretize Numeric Variables

View source: R/discretize.R

discretizeR Documentation

Discretize Numeric Variables


discretize() converts a numeric vector into a factor with bins having approximately the same number of data points (based on a training set).


discretize(x, ...)

## Default S3 method:
discretize(x, ...)

## S3 method for class 'numeric'
  cuts = 4,
  labels = NULL,
  prefix = "bin",
  keep_na = TRUE,
  infs = TRUE,
  min_unique = 10,

## S3 method for class 'discretize'
predict(object, new_data, ...)



A numeric vector


Options to pass to stats::quantile() that should not include x or probs.


An integer defining how many cuts to make of the data.


A character vector defining the factor levels that will be in the new factor (from smallest to largest). This should have length cuts+1 and should not include a level for missing (see keep_na below).


A single parameter value to be used as a prefix for the factor levels (e.g. bin1, bin2, ...). If the string is not a valid R name, it is coerced to one. If prefix = NULL then the factor levels will be labelled according to the output of cut().


A logical for whether a factor level should be created to identify missing values in x. If keep_na is set to TRUE then na.rm = TRUE is used when calling stats::quantile().


A logical indicating whether the smallest and largest cut point should be infinite.


An integer defining a sample size line of dignity for the binning. If (the number of unique values)⁠/(cuts+1)⁠ is less than min_unique, no discretization takes place.


An object of class discretize.


A new numeric object to be binned.


discretize estimates the cut points from x using percentiles. For example, if cuts = 3, the function estimates the quartiles of x and uses these as the cut points. If cuts = 2, the bins are defined as being above or below the median of x.

The predict method can then be used to turn numeric vectors into factor vectors.

If keep_na = TRUE, a suffix of "_missing" is used as a factor level (see the examples below).

If infs = FALSE and a new value is greater than the largest value of x, a missing value will result.


discretize returns an object of class discretize and predict.discretize returns a factor vector.


data(biomass, package = "modeldata")

biomass_tr <- biomass[biomass$dataset == "Training", ]
biomass_te <- biomass[biomass$dataset == "Testing", ]

discretize(biomass_tr$carbon, cuts = 2)
discretize(biomass_tr$carbon, cuts = 2, infs = FALSE)
discretize(biomass_tr$carbon, cuts = 2, infs = FALSE, keep_na = FALSE)
discretize(biomass_tr$carbon, cuts = 2, prefix = "maybe a bad idea to bin")

carbon_binned <- discretize(biomass_tr$carbon)
table(predict(carbon_binned, biomass_tr$carbon))

carbon_no_infs <- discretize(biomass_tr$carbon, infs = FALSE)
predict(carbon_no_infs, c(50, 100))

recipes documentation built on Aug. 26, 2023, 1:08 a.m.