categorize | R Documentation |
This functions divides the range of variables into intervals and recodes
the values inside these intervals according to their related interval.
It is basically a wrapper around base R's cut()
, providing a simplified
and more accessible way to define the interval breaks (cut-off values).
categorize(x, ...)
## S3 method for class 'numeric'
categorize(
x,
split = "median",
n_groups = NULL,
range = NULL,
lowest = 1,
breaks = "exclusive",
labels = NULL,
verbose = TRUE,
...
)
## S3 method for class 'data.frame'
categorize(
x,
select = NULL,
exclude = NULL,
split = "median",
n_groups = NULL,
range = NULL,
lowest = 1,
breaks = "exclusive",
labels = NULL,
append = FALSE,
ignore_case = FALSE,
regex = FALSE,
verbose = TRUE,
...
)
x |
A (grouped) data frame, numeric vector or factor. |
... |
not used. |
split |
Character vector, indicating at which breaks to split variables,
or numeric values with values indicating breaks. If character, may be one
of |
n_groups |
If |
range |
If |
lowest |
Minimum value of the recoded variable(s). If |
breaks |
Character, indicating whether breaks for categorizing data are
|
labels |
Character vector of value labels. If not |
verbose |
Toggle warnings. |
select |
Variables that will be included when performing the required tasks. Can be either
If |
exclude |
See |
append |
Logical or string. If |
ignore_case |
Logical, if |
regex |
Logical, if |
x
, recoded into groups. By default x
is numeric, unless labels
is specified. In this case, a factor is returned, where the factor levels
(i.e. recoded groups are labelled accordingly.
Breaks are by default exclusive, this means that these values indicate
the lower bound of the next group or interval to begin. Take a simple
example, a numeric variable with values from 1 to 9. The median would be 5,
thus the first interval ranges from 1-4 and is recoded into 1, while 5-9
would turn into 2 (compare cbind(1:9, categorize(1:9))
). The same variable,
using split = "quantile"
and n_groups = 3
would define breaks at 3.67
and 6.33 (see quantile(1:9, probs = c(1/3, 2/3))
), which means that values
from 1 to 3 belong to the first interval and are recoded into 1 (because
the next interval starts at 3.67), 4 to 6 into 2 and 7 to 9 into 3.
The opposite behaviour can be achieved using breaks = "inclusive"
, in which
case
split = "equal_length"
and split = "equal_range"
try to divide the
range of x
into intervals of similar (or same) length. The difference is
that split = "equal_length"
will divide the range of x
into n_groups
pieces and thereby defining the intervals used as breaks (hence, it is
equivalent to cut(x, breaks = n_groups)
), while split = "equal_range"
will cut x
into intervals that all have the length of range
, where the
first interval by defaults starts at 1
. The lowest (or starting) value
of that interval can be defined using the lowest
argument.
select
argumentFor most functions that have a select
argument (including this function),
the complete input data frame is returned, even when select
only selects
a range of variables. That is, the function is only applied to those variables
that have a match in select
, while all other variables remain unchanged.
In other words: for this function, select
will not omit any non-included
variables, so that the returned data frame will include all variables
from the input data frame.
Functions to rename stuff: data_rename()
, data_rename_rows()
, data_addprefix()
, data_addsuffix()
Functions to reorder or remove columns: data_reorder()
, data_relocate()
, data_remove()
Functions to reshape, pivot or rotate data frames: data_to_long()
, data_to_wide()
, data_rotate()
Functions to recode data: rescale()
, reverse()
, categorize()
,
recode_values()
, slide()
Functions to standardize, normalize, rank-transform: center()
, standardize()
, normalize()
, ranktransform()
, winsorize()
Split and merge data frames: data_partition()
, data_merge()
Functions to find or select columns: data_select()
, extract_column_names()
Functions to filter rows: data_match()
, data_filter()
set.seed(123)
x <- sample(1:10, size = 50, replace = TRUE)
table(x)
# by default, at median
table(categorize(x))
# into 3 groups, based on distribution (quantiles)
table(categorize(x, split = "quantile", n_groups = 3))
# into 3 groups, user-defined break
table(categorize(x, split = c(3, 5)))
set.seed(123)
x <- sample(1:100, size = 500, replace = TRUE)
# into 5 groups, try to recode into intervals of similar length,
# i.e. the range within groups is the same for all groups
table(categorize(x, split = "equal_length", n_groups = 5))
# into 5 groups, try to return same range within groups
# i.e. 1-20, 21-40, 41-60, etc. Since the range of "x" is
# 1-100, and we have a range of 20, this results into 5
# groups, and thus is for this particular case identical
# to the previous result.
table(categorize(x, split = "equal_range", range = 20))
# return factor with value labels instead of numeric value
set.seed(123)
x <- sample(1:10, size = 30, replace = TRUE)
categorize(x, "equal_length", n_groups = 3)
categorize(x, "equal_length", n_groups = 3, labels = c("low", "mid", "high"))
# cut numeric into groups with the mean or median as a label name
x <- sample(1:10, size = 30, replace = TRUE)
categorize(x, "equal_length", n_groups = 3, labels = "mean")
categorize(x, "equal_length", n_groups = 3, labels = "median")
# cut numeric into groups with the requested range as a label name
# each category has the same range, and labels indicate this range
categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
# in this example, each category has the same range, but labels only refer
# to the ranges of the actual values (present in the data) inside each group
categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "observed")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.