Description Usage Arguments Value what optimize groups custom left boundary for what = "width" cluster format_fun See Also Examples
Enhanced cut that supports among other things factor inputs, optimal grouping, and flexible formatting.
1 2 3 4 5 6 7 | smart_cut(x, i, what = c("breaks", "groups", "n_by_group", "n_intervals",
"width", "cluster", "bins", "rough"), labels = NULL,
closed = c("left", "right"), expand = TRUE, crop = FALSE,
simplify = TRUE, squeeze = FALSE, open_end = FALSE,
brackets = c("(", "[", ")", "]"), sep = ",", output = c("ordered",
"factor", "character", "numeric", "breaks", "labels"),
format_fun = formatC, ...)
|
x |
numeric vector to classify into intervals |
i |
numeric, character, or list, main parameter depending on |
what |
character, choices can be abreviated |
labels |
character of the same length as the resulting bins or function (or formula) to apply on the relevant bin's values. |
closed |
character, which side of the intervals should be closed |
expand |
logical, if TRUE cuts are added if necessary to cover min and max values |
crop |
logical, if TRUE intervals which go past the min or max values will be cropped |
simplify |
logical, if TRUE categories containing only one distinct value will be named by it |
squeeze |
logical, if TRUE all bins are cropped so they are closed on both sides on their min and max values, useful for sparse data and factors |
open_end |
include in last interval on open side the values which fall on the last cutpoint |
sep, brackets |
character, used to build the default labels |
output |
character, class of output |
format_fun |
formatting function |
... |
additional arguments passed to |
a factor variable with levels of the form "[a,b]"
or formatted means (character strings) unless onlycuts
is TRUE
in which case a numeric vector is returned
Depending on the value of what
, i is:
the actual cut points
the number of desired groups, by default cuts are calculated as quantiles, which might not always give i groups for some distributions, see help on optim_fun below to handle these cases
the number of desired items by group, with the same caveat as above
the number of desired intervals
the interval width, which will be centered on 0 by default or at a different value (see dedicated section)
the number of clusters
the bin values, useful if using an external clustering function
cuts into i groups of equal size (if possible), 2 elements of same value can be in different buckets, hence the "rough" adjective. Default brackets might be misleading when making this choice
If what
is "group"
or "width"
i can be a list in which the second
element is a' function, see following sections.
If what = "groups"
then i
can be a list in which the second element is a
function that will be applied on all possible combinations.
It will be fed the size of bins as its first argument and the
cuts as its second. From the results the combination that gives the minimum
result will be kept.
Alternatively the parameter can be any of the following strings:
Returns a combination with the minimal group size variance
Returns a combination that has the biggest smallest bin, to avoid narrow intervals
Returns a combination that has the biggest smallest bin, to avoid wide intervals
In practice the results should be quite similar and balanced should be enough most of the time, for continuous data of a decent size without singular points, optimization of groups is not necessary and will be ressource expensive.
what = "width"
If what = "width"
then i
can be either single numeric value setting the
width of the interval or a list in which the second element is a
function that will be applied on x (as a first parameter) and the cut points
(as a second parameter). The output of this function will determine where the
leftmost interval starts. Formula notation is supported.
Alternatively the parameter can be a numeric value or any of the following strings:
First interval starts at the data point
Last interval stops at the last data point
Margins are balanced on both sides
Interval containing zero is centered on zero
Uses function stats::kmean
to
cluster x
into i
groups
The original base::cut uses formatC in its code to format the labels while the commonly used Hmisc::cut2 uses format. smart_cut allows one to choose and to pass additional parameters to ... .
Any formating function can be used as long as it takes as a first argument a vector of characters and returns one.
The function format_metric including in cutr permits additional formatting especially well suited for smart_cut
?cut
?Hmisc::cut2
?format
?formatC
?format_metric
'?kmeans
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | x <- c(rep(1,3),rep(2,2),3:6,17:20)
# fixed breaks
table(smart_cut(x,cuts,"breaks"))
# groups defined by quantiles
table(smart_cut(x,2,"groups"))
# optimized groups of equal size
table(smart_cut(x,list(2,"balanced"),"groups"))
# try to get 3 items by group using quantiles
table(smart_cut(x,3,"n_by_group"))
# try to get 3 items by group using optimization
table(smart_cut(x,list(3,"balanced"),"n_by_group"))
# intervals of equal width
table(smart_cut(x,3,"n_intervals"))
# interval of equal defined width,
table(smart_cut(x,7,"width")) # start on 1st value
table(smart_cut(x,list(7,"right"),"width")) # end on last value
table(smart_cut(x,list(6,"centered"),"width")) # centered
table(smart_cut(x,list(6,"centered0"),"width")) # centered on 0
table(smart_cut(x,list(7,0),"width")) # starting on 0
# create groups by running a kmeans clustering
table(smart_cut(x,3,"cluster"))
simplify
table(smart_cut(x, 5, "width"))
table(smart_cut(x, 5, "width", simplify = FALSE))
# expand
table(smart_cut(x,c(4,10,18)))
table(smart_cut(x,c(4,10,18),expand = FALSE))
# crop
table(smart_cut(x,c(0,10,30)))
table(smart_cut(x,c(0,10,30),crop = TRUE))
# squeeze
table(smart_cut(x,c(0,10,30)))
table(smart_cut(x,c(0,10,30),squeeze = TRUE))
# brackets
table(smart_cut(x,c(0,10,30), brackets = c("]","[","[","]")))
table(smart_cut(x,c(0,10,30), brackets = NULL, sep = "~", squeeze= TRUE))
# labels
table(smart_cut(x,c(4,10)))
table(smart_cut(x,c(4,10),labels = ~mean(.x))) # mean of values by interval
table(smart_cut(x,c(4,10),labels = ~mean(.y))) # center of interval
table(smart_cut(x,c(4,10),labels = ~median(.x))) # median
table(smart_cut(x,c(4,10),labels = ~paste(
sep="~",.y[1],round(mean(.x),2),.y[2]))) # a more sophisticated label
# format_fun
table(smart_cut(x^6 + x/100,5,"g"))
table(smart_cut(x^6 + x/100,5,"g",format_fun = format, digits = 3))
table(smart_cut(x^6,5,"g",format_fun = signif))
table(smart_cut(x^6,5,"g",format_fun = smart_signif))
table(smart_cut(x^6,5,"g",format_fun = format_metric))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.