bin | R Documentation |

Tool to easily group the values of a given variable.

bin(x, bin)

`x` |
A vector whose values have to be grouped. Can be of any type but must be atomic. |

`bin` |
A list of values to be grouped, a vector, a formula, or the special values |

It returns a vector of the same length as `x`

.

Numeric vectors can be cut easily into: a) equal parts, b) user-specified bins.

Use `"cut::n"`

to cut the vector into `n`

(roughly) equal parts. Percentiles are used to partition the data, hence some data distributions can lead to create less than `n`

parts (for example if P0 is the same as P50).

The user can specify custom bins with the following syntax: `"cut::a]b]c]"etc`

. Here the numbers `a`

, `b`

, `c`

, etc, are a sequence of increasing numbers, each followed by an open or closed square bracket. The numbers can be specified as either plain numbers (e.g. `"cut::5]12[32["`

), quartiles (e.g. `"cut::q1]q3["`

), or percentiles (e.g. `"cut::p10]p15]p90]"`

). Values of different types can be mixed: `"cut::5]q2[p80["`

is valid provided the median (`q2`

) is indeed greater than `5`

, otherwise an error is thrown.

The square bracket right of each number tells whether the numbers should be included or excluded from the current bin. For example, say `x`

ranges from 0 to 100, then `"cut::5]"`

will create two bins: one from 0 to 5 and a second from 6 to 100. With `"cut::5["`

the bins would have been 0-4 and 5-100.

A factor is returned. The labels report the min and max values in each bin.

To have user-specified bin labels, just add them in the character vector following `'cut::values'`

. You don't need to provide all of them, and `NA`

values fall back to the default label. For example, `bin = c("cut::4", "Q1", NA, "Q3")`

will modify only the first and third label that will be displayed as `"Q1"`

and `"Q3"`

.

`bin`

vs `ref`

The functions `bin`

and `ref`

are able to do the same thing, then why use one instead of the other? Here are the differences:

`ref`

always returns a factor. This is in contrast with`bin`

which returns, when possible, a vector of the same type as the vector in input.`ref`

always places the values modified in the first place of the factor levels. On the other hand,`bin`

tries to not modify the ordering of the levels. It is possible to make`bin`

mimic the behavior of`ref`

by adding an`"@"`

as the first element of the list in the argument`bin`

.when a vector (and not a list) is given in input,

`ref`

will place each element of the vector in the first place of the factor levels. The behavior of`bin`

is totally different,`bin`

will transform all the values in the vector into a single value in`x`

(i.e. it's binning).

Laurent Berge

To re-factor variables: `ref`

.

data(airquality) month_num = airquality$Month table(month_num) # Grouping the first two values table(bin(month_num, 5:6)) # ... plus changing the name to '10' table(bin(month_num, list("10" = 5:6))) # ... and grouping 7 to 9 table(bin(month_num, list("g1" = 5:6, "g2" = 7:9))) # Grouping every two months table(bin(month_num, "bin::2")) # ... every 2 consecutive elements table(bin(month_num, "!bin::2")) # ... idem starting from the last one table(bin(month_num, "!!bin::2")) # Using .() for list(): table(bin(month_num, .("g1" = 5:6))) # # with non numeric data # month_lab = c("may", "june", "july", "august", "september") month_fact = factor(month_num, labels = month_lab) # Grouping the first two elements table(bin(month_fact, c("may", "jun"))) # ... using regex table(bin(month_fact, "@may|jun")) # ...changing the name table(bin(month_fact, list("spring" = "@may|jun"))) # Grouping every 2 consecutive months table(bin(month_fact, "!bin::2")) # ...idem but starting from the last table(bin(month_fact, "!!bin::2")) # Relocating the months using "@d" in the name table(bin(month_fact, .("@5" = "may", "@1 summer" = "@aug|jul"))) # Putting "@" as first item means subsequent items will be placed first table(bin(month_fact, .("@", "aug", "july"))) # # "Cutting" numeric data # data(iris) plen = iris$Petal.Length # 3 parts of (roughly) equal size table(bin(plen, "cut::3")) # Three custom bins table(bin(plen, "cut::2]5]")) # .. same, excluding 5 in the 2nd bin table(bin(plen, "cut::2]5[")) # Using quartiles table(bin(plen, "cut::q1]q2]q3]")) # Using percentiles table(bin(plen, "cut::p20]p50]p70]p90]")) # Mixing all table(bin(plen, "cut::2[q2]p90]")) # NOTA: # -> the labels always contain the min/max values in each bin # Custom labels can be provided, just give them in the char. vector # NA values lead to the default label table(bin(plen, c("cut::2[q2]p90]", "<2", "]2; Q2]", NA, ">90%"))) # # With a formula # data(iris) plen = iris$Petal.Length # We need to use "x" table(bin(plen, list("< 2" = ~x < 2, ">= 2" = ~x >= 2)))

