bin: Binning function

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Discretizes all numerical data in a data frame into categorical bins of equal length or content or based on automatically determined clusters.

Usage

1
2
bin(data, nbins = 5, labels = NULL, method = c("length", "content",
  "clusters"), na.omit = TRUE)

Arguments

data

data frame or vector which contains the data.

nbins

number of bins (= levels).

labels

character vector of labels for the resulting category.

method

character string specifying the binning method, see 'Details'; can be abbreviated.

na.omit

logical value whether instances with missing values should be removed.

Details

Character strings and logical strings are coerced into factors. Matrices are coerced into data frames. When called with a single vector only the respective factor (and not a data frame) is returned. Method "length" gives intervals of equal length, method "content" gives intervals of equal content (via quantiles). Method "clusters" determins "nbins" clusters via 1D kmeans with deterministic seeding of the initial cluster centres (Jenks natural breaks optimization).

When "na.omit = FALSE" an additional level "NA" is added to each factor with missing values.

Value

A data frame or vector.

Author(s)

Holger von Jouanne-Diedrich

References

https://github.com/vonjd/OneR

See Also

OneR, optbin

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
data <- iris
str(data)
str(bin(data))
str(bin(data, nbins = 3))
str(bin(data, nbins = 3, labels = c("small", "medium", "large")))

## Difference between methods "length" and "content"
set.seed(1); table(bin(rnorm(900), nbins = 3))
set.seed(1); table(bin(rnorm(900), nbins = 3, method = "content"))

## Method "clusters"
intervals <- paste(levels(bin(faithful$waiting, nbins = 2, method = "cluster")), collapse = " ")
hist(faithful$waiting, main = paste("Intervals:", intervals))
abline(v = c(42.9, 67.5, 96.1), col = "blue")

## Missing values
bin(c(1:10, NA), nbins = 2, na.omit = FALSE) # adds new level "NA"
bin(c(1:10, NA), nbins = 2)                  # omits missing values by default (with warning)

Example output

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: Factor w/ 5 levels "(4.3,5.02]","(5.02,5.74]",..: 2 1 1 1 1 2 1 1 1 1 ...
 $ Sepal.Width : Factor w/ 5 levels "(2,2.48]","(2.48,2.96]",..: 4 3 3 3 4 4 3 3 2 3 ...
 $ Petal.Length: Factor w/ 5 levels "(0.994,2.18]",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : Factor w/ 5 levels "(0.0976,0.58]",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: Factor w/ 3 levels "(4.3,5.5]","(5.5,6.7]",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.Width : Factor w/ 3 levels "(2,2.8]","(2.8,3.6]",..: 2 2 2 2 2 3 2 2 2 2 ...
 $ Petal.Length: Factor w/ 3 levels "(0.994,2.97]",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : Factor w/ 3 levels "(0.0976,0.9]",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Sepal.Width : Factor w/ 3 levels "small","medium",..: 2 2 2 2 2 3 2 2 2 2 ...
 $ Petal.Length: Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Petal.Width : Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

(-3.01,-0.735]  (-0.735,1.54]    (1.54,3.82] 
           212            623             65 

(-3.01,-0.423] (-0.423,0.444]   (0.444,3.82] 
           300            300            300 
 [1] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (5.5,10]   
 [7] (5.5,10]    (5.5,10]    (5.5,10]    (5.5,10]    NA         
Levels: (0.991,5.5] (5.5,10] NA
 [1] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (5.5,10]   
 [7] (5.5,10]    (5.5,10]    (5.5,10]    (5.5,10]   
Levels: (0.991,5.5] (5.5,10]
Warning message:
In bin(c(1:10, NA), nbins = 2) :
  1 instance(s) removed due to missing values

OneR documentation built on May 2, 2019, 9:33 a.m.