optbin: Optimal Binning function
In OneR: One Rule Machine Learning Classification Algorithm with Enhancements

Description Usage Arguments Details Value Methods (by class) Author(s) References See Also Examples

Discretizes all numerical data in a data frame into categorical bins where the cut points are optimally aligned with the target categories, thereby a factor is returned. When building a OneR model this could result in fewer rules with enhanced accuracy.

optbin(x, ...)

## S3 method for class 'formula'
optbin(formula, data, method = c("logreg", "infogain",
  "naive"), na.omit = TRUE, ...)

## S3 method for class 'data.frame'
optbin(x, method = c("logreg", "infogain", "naive"),
  na.omit = TRUE, ...)

`x`	data frame with the last column containing the target variable.
`...`	arguments passed to or from other methods.
`formula`	formula, additionally the argument `data` is needed.
`data`	data frame which contains the data, only needed when using the formula interface.
`method`	character string specifying the method for optimal binning, see 'Details'; can be abbreviated.
`na.omit`	logical value whether instances with missing values should be removed.

The cutpoints are calculated by pairwise logistic regressions (method "logreg"), information gain (method "infogain") or as the means of the expected values of the respective classes ("naive"). The function is likely to give unsatisfactory results when the distributions of the respective classes are not (linearly) separable. Method "naive" should only be used when distributions are (approximately) normal, although in this case "logreg" should give comparable results, so it is the preferable (and therefore default) method.

Method "infogain" is an entropy based method which calculates cut points based on information gain. The idea is that uncertainty is minimized by making the resulting bins as pure as possible. This method is the standard method of many decision tree algorithms.

Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. If the target is numeric it is turned into a factor with the number of levels equal to the number of values. Additionally a warning is given.

When "na.omit = FALSE" an additional level "NA" is added to each factor with missing values. If the target contains unused factor levels (e.g. due to subsetting) these are ignored and a warning is given.

A data frame with the target variable being in the last column.

formula: method for formulas.
data.frame: method for data frames.

Holger von Jouanne-Diedrich

https://github.com/vonjd/OneR

OneR, bin

data <- iris # without optimal binning
model <- OneR(data, verbose = TRUE)
summary(model)

data_opt <- optbin(iris) # with optimal binning
model_opt <- OneR(data_opt, verbose = TRUE)
summary(model_opt)

## The same with the formula interface:
data_opt <- optbin(Species ~., data = iris)
model_opt <- OneR(data_opt, verbose = TRUE)
summary(model_opt)