Description Usage Arguments Details Value Author(s) Examples
This function implements several basic unsupervised methods to convert a continuous variable into a categorical variable (factor) using different binning strategies. For convenience, a whole data.frame can be discretized (i.e., all numeric columns are discretized).
1 2 3 4 5 6 
x 
a numeric vector (continuous variable). 
method 
discretization method. Available are: 
breaks, categories 

labels 
character vector; labels for the levels of the resulting category. By default, labels are constructed using "(a,b]" interval notation. If 
include.lowest 
logical; should the first interval be closed to the left? 
right 
logical; should the intervals be closed on the right (and open on the left) or vice versa? 
dig.lab 
integer; number of digits used to create labels. 
ordered_result 
logical; return a ordered factor? 
infinity 
logical; should the first/last break boundary changed to +/Inf? 
onlycuts 
logical; return only computed interval boundaries? 
... 
for method "cluster" further arguments are passed on to

.
df 
data.frame; each numeric column in the data.frame is discretized. 
methods 
named list of lists or a data.frame;
the named list contains list of discretization parameters
(see parameters of 
default 
named list; parameters for 
discretize
only implements unsupervised discretization. See packages arulesCBA, discretization or RWeka for supervised
discretization.
discretizeDF
applies discretization to each numeric column.
Individual discretization parameters can be specified in the form:
methods = list(column_name1 = list(method = ,...), column_name2 = list(...))
.
If no discretization method is specified for a column, then the discretization in default
is applied (NULL
invokes the default method in discretize()
). The special method "none"
can be specified to suppress discretization for a column.
A factor representing the categorized continuous variable
with attribute "discretized:breaks"
indicating the used breaks
or and "discretized:method"
giving the used method. If
onlycuts = TRUE
is used, a vector with the calculated
interval boundaries is returned.
discretizeDF
returns a discretized data.frame.
Michael Hahsler
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65  data(iris)
x < iris[,1]
### look at the distribution before discretizing
hist(x, breaks = 20, main = "Data")
def.par < par(no.readonly = TRUE) # save default
layout(mat = rbind(1:2,3:4))
### convert continuous variables into categories (there are 3 types of flowers)
### the default method is equal frequency
table(discretize(x, breaks = 3))
hist(x, breaks = 20, main = "Equal Frequency")
abline(v = discretize(x, breaks = 3,
onlycuts = TRUE), col = "red")
# Note: the frequencies are not exactly equal because of ties in the data
### equal interval width
table(discretize(x, method = "interval", breaks = 3))
hist(x, breaks = 20, main = "Equal Interval length")
abline(v = discretize(x, method = "interval", breaks = 3,
onlycuts = TRUE), col = "red")
### kmeans clustering
table(discretize(x, method = "cluster", breaks = 3))
hist(x, breaks = 20, main = "KMeans")
abline(v = discretize(x, method = "cluster", breaks = 3,
onlycuts = TRUE), col = "red")
### userspecified (with labels)
table(discretize(x, method = "fixed", breaks = c(Inf, 6, Inf),
labels = c("small", "large")))
hist(x, breaks = 20, main = "Fixed")
abline(v = discretize(x, method = "fixed", breaks = c(Inf, 6, Inf),
onlycuts = TRUE), col = "red")
par(def.par) # reset to default
### prepare the iris data set for association rule mining
### use default discretization
irisDisc < discretizeDF(iris)
head(irisDisc)
### discretize all numeric columns differently
irisDisc < discretizeDF(iris, default = list(method = "interval", breaks = 2,
labels = c("small", "large")))
head(irisDisc)
### specify discretization for the petal columns and don't discretize the others
irisDisc < discretizeDF(iris, methods = list(
Petal.Length = list(method = "frequency", breaks = 3,
labels = c("short", "medium", "long")),
Petal.Width = list(method = "frequency", breaks = 2,
labels = c("narrow", "wide"))
),
default = list(method = "none")
)
head(irisDisc)
### discretize new data using the same discretization scheme as the
### data.frame supplied in methods. Note: NAs may occure if a new
### value falls outside the range of values observed in the
### originally discretized table (use argument infinity = TRUE in
### discretize to prevent this case.)
discretizeDF(iris[sample(1:nrow(iris), 5),], methods = irisDisc)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.