bin | R Documentation |
Tool to easily group the values of a given variable.
bin(x, bin)
x |
A vector whose values have to be grouped. Can be of any type but must be atomic. |
bin |
A list of values to be grouped, a vector, a formula, or the special
values |
It returns a vector of the same length as x
.
Numeric vectors can be cut easily into: a) equal parts, b) user-specified bins.
Use "cut::n"
to cut the vector into n
(roughly) equal parts. Percentiles are
used to partition the data, hence some data distributions can lead to create less
than n
parts (for example if P0 is the same as P50).
The user can specify custom bins with the following syntax: "cut::a]b]c]"
. Here
the numbers a
, b
, c
, etc, are a sequence of increasing numbers, each followed
by an open or closed square bracket. The numbers can be specified as either
plain numbers (e.g. "cut::5]12[32["
), quartiles (e.g. "cut::q1]q3["
),
or percentiles (e.g. "cut::p10]p15]p90]"
). Values of different types can be mixed:
"cut::5]q2[p80["
is valid provided the median (q2
) is indeed greater
than 5
, otherwise an error is thrown.
The square bracket right of each number tells whether the numbers should be included
or excluded from the current bin. For example, say x
ranges from 0 to 100,
then "cut::5]"
will create two bins: one from 0 to 5 and a second from 6 to 100.
With "cut::5["
the bins would have been 0-4 and 5-100.
A factor is always returned. The labels always report the min and max values in each bin.
To have user-specified bin labels, just add them in the character vector
following 'cut::values'
. You don't need to provide all of them, and NA
values
fall back to the default label. For example, bin = c("cut::4", "Q1", NA, "Q3")
will modify only the first and third label that will be displayed as "Q1"
and "Q3"
.
bin
vs ref
The functions bin
and ref
are able to do the same thing, then why use one
instead of the other? Here are the differences:
ref
always returns a factor. This is in contrast with bin
which returns,
when possible, a vector of the same type as the vector in input.
ref
always places the values modified in the first place of the factor levels.
On the other hand, bin
tries to not modify the ordering of the levels. It is possible
to make bin
mimic the behavior of ref
by adding an "@"
as the first element of
the list in the argument bin
.
when a vector (and not a list) is given in input, ref
will place each element of
the vector in the first place of the factor levels. The behavior of bin
is
totally different, bin
will transform all the values in the vector into a single
value in x
(i.e. it's binning).
Laurent Berge
To re-factor variables: ref
.
data(airquality)
month_num = airquality$Month
table(month_num)
# Grouping the first two values
table(bin(month_num, 5:6))
# ... plus changing the name to '10'
table(bin(month_num, list("10" = 5:6)))
# ... and grouping 7 to 9
table(bin(month_num, list("g1" = 5:6, "g2" = 7:9)))
# Grouping every two months
table(bin(month_num, "bin::2"))
# ... every 2 consecutive elements
table(bin(month_num, "!bin::2"))
# ... idem starting from the last one
table(bin(month_num, "!!bin::2"))
# Using .() for list():
table(bin(month_num, .("g1" = 5:6)))
#
# with non numeric data
#
month_lab = c("may", "june", "july", "august", "september")
month_fact = factor(month_num, labels = month_lab)
# Grouping the first two elements
table(bin(month_fact, c("may", "jun")))
# ... using regex
table(bin(month_fact, "@may|jun"))
# ...changing the name
table(bin(month_fact, list("spring" = "@may|jun")))
# Grouping every 2 consecutive months
table(bin(month_fact, "!bin::2"))
# ...idem but starting from the last
table(bin(month_fact, "!!bin::2"))
# Relocating the months using "@d" in the name
table(bin(month_fact, .("@5" = "may", "@1 summer" = "@aug|jul")))
# Putting "@" as first item means subsequent items will be placed first
table(bin(month_fact, .("@", "aug", "july")))
#
# "Cutting" numeric data
#
data(iris)
plen = iris$Petal.Length
# 3 parts of (roughly) equal size
table(bin(plen, "cut::3"))
# Three custom bins
table(bin(plen, "cut::2]5]"))
# .. same, excluding 5 in the 2nd bin
table(bin(plen, "cut::2]5["))
# Using quartiles
table(bin(plen, "cut::q1]q2]q3]"))
# Using percentiles
table(bin(plen, "cut::p20]p50]p70]p90]"))
# Mixing all
table(bin(plen, "cut::2[q2]p90]"))
# NOTA:
# -> the labels always contain the min/max values in each bin
# Custom labels can be provided, just give them in the char. vector
# NA values lead to the default label
table(bin(plen, c("cut::2[q2]p90]", "<2", "]2; Q2]", NA, ">90%")))
#
# With a formula
#
data(iris)
plen = iris$Petal.Length
# We need to use "x"
table(bin(plen, list("< 2" = ~x < 2, ">= 2" = ~x >= 2)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.