smart_cut: Cut a Numeric Variable into Intervals

Description Usage Arguments Value what optimize groups custom left boundary for what = "width" cluster format_fun See Also Examples

View source: R/smart_cut.R

Description

Enhanced cut that supports among other things factor inputs, optimal grouping, and flexible formatting.

Usage

1
2
3
4
5
6
7
smart_cut(x, i, what = c("breaks", "groups", "n_by_group", "n_intervals",
  "width", "cluster", "bins", "rough"), labels = NULL,
  closed = c("left", "right"), expand = TRUE, crop = FALSE,
  simplify = TRUE, squeeze = FALSE, open_end = FALSE,
  brackets = c("(", "[", ")", "]"), sep = ",", output = c("ordered",
  "factor", "character", "numeric", "breaks", "labels"),
  format_fun = formatC, ...)

Arguments

x

numeric vector to classify into intervals

i

numeric, character, or list, main parameter depending on what

what

character, choices can be abreviated

labels

character of the same length as the resulting bins or function (or formula) to apply on the relevant bin's values.

closed

character, which side of the intervals should be closed

expand

logical, if TRUE cuts are added if necessary to cover min and max values

crop

logical, if TRUE intervals which go past the min or max values will be cropped

simplify

logical, if TRUE categories containing only one distinct value will be named by it

squeeze

logical, if TRUE all bins are cropped so they are closed on both sides on their min and max values, useful for sparse data and factors

open_end

include in last interval on open side the values which fall on the last cutpoint

sep, brackets

character, used to build the default labels

output

character, class of output

format_fun

formatting function

...

additional arguments passed to format_fun

Value

a factor variable with levels of the form "[a,b]" or formatted means (character strings) unless onlycuts is TRUE in which case a numeric vector is returned

what

Depending on the value of what, i is:

breaks

the actual cut points

groups

the number of desired groups, by default cuts are calculated as quantiles, which might not always give i groups for some distributions, see help on optim_fun below to handle these cases

n_by_group

the number of desired items by group, with the same caveat as above

n_intervals

the number of desired intervals

width

the interval width, which will be centered on 0 by default or at a different value (see dedicated section)

cluster

the number of clusters

bins

the bin values, useful if using an external clustering function

rough

cuts into i groups of equal size (if possible), 2 elements of same value can be in different buckets, hence the "rough" adjective. Default brackets might be misleading when making this choice

If what is "group" or "width" i can be a list in which the second element is a' function, see following sections.

optimize groups

If what = "groups" then i can be a list in which the second element is a function that will be applied on all possible combinations. It will be fed the size of bins as its first argument and the cuts as its second. From the results the combination that gives the minimum result will be kept.

Alternatively the parameter can be any of the following strings:

"balanced"

Returns a combination with the minimal group size variance

"biggest_small_bin"

Returns a combination that has the biggest smallest bin, to avoid narrow intervals

"smallest_big_bin"

Returns a combination that has the biggest smallest bin, to avoid wide intervals

In practice the results should be quite similar and balanced should be enough most of the time, for continuous data of a decent size without singular points, optimization of groups is not necessary and will be ressource expensive.

custom left boundary for what = "width"

If what = "width" then i can be either single numeric value setting the width of the interval or a list in which the second element is a function that will be applied on x (as a first parameter) and the cut points (as a second parameter). The output of this function will determine where the leftmost interval starts. Formula notation is supported.

Alternatively the parameter can be a numeric value or any of the following strings:

"left"

First interval starts at the data point

"right"

Last interval stops at the last data point

"centered"

Margins are balanced on both sides

"centered0"

Interval containing zero is centered on zero

cluster

Uses function stats::kmean to cluster x into i groups

format_fun

The original base::cut uses formatC in its code to format the labels while the commonly used Hmisc::cut2 uses format. smart_cut allows one to choose and to pass additional parameters to ... .

Any formating function can be used as long as it takes as a first argument a vector of characters and returns one.

The function format_metric including in cutr permits additional formatting especially well suited for smart_cut

See Also

?cut ?Hmisc::cut2 ?format ?formatC ?format_metric '?kmeans

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
x <- c(rep(1,3),rep(2,2),3:6,17:20)

# fixed breaks
table(smart_cut(x,cuts,"breaks"))

# groups defined by quantiles
table(smart_cut(x,2,"groups"))

# optimized groups of equal size
table(smart_cut(x,list(2,"balanced"),"groups"))

# try to get 3 items by group using quantiles
table(smart_cut(x,3,"n_by_group"))

# try to get 3 items by group using optimization
table(smart_cut(x,list(3,"balanced"),"n_by_group"))

# intervals of equal width
table(smart_cut(x,3,"n_intervals"))

# interval of equal defined width,
table(smart_cut(x,7,"width"))                       # start on 1st value
table(smart_cut(x,list(7,"right"),"width"))         # end on last value
table(smart_cut(x,list(6,"centered"),"width"))      # centered
table(smart_cut(x,list(6,"centered0"),"width"))     # centered on 0
table(smart_cut(x,list(7,0),"width"))               # starting on 0

# create groups by running a kmeans clustering
table(smart_cut(x,3,"cluster"))
simplify
table(smart_cut(x, 5, "width"))
table(smart_cut(x, 5, "width", simplify = FALSE))

# expand
table(smart_cut(x,c(4,10,18)))
table(smart_cut(x,c(4,10,18),expand = FALSE))

# crop
table(smart_cut(x,c(0,10,30)))
table(smart_cut(x,c(0,10,30),crop = TRUE))

# squeeze
table(smart_cut(x,c(0,10,30)))
table(smart_cut(x,c(0,10,30),squeeze = TRUE))

# brackets
table(smart_cut(x,c(0,10,30), brackets = c("]","[","[","]")))
table(smart_cut(x,c(0,10,30), brackets = NULL, sep = "~", squeeze= TRUE))

# labels
table(smart_cut(x,c(4,10)))
table(smart_cut(x,c(4,10),labels = ~mean(.x)))   # mean of values by interval
table(smart_cut(x,c(4,10),labels = ~mean(.y)))   # center of interval
table(smart_cut(x,c(4,10),labels = ~median(.x))) # median
table(smart_cut(x,c(4,10),labels = ~paste(
  sep="~",.y[1],round(mean(.x),2),.y[2]))) # a more sophisticated label

# format_fun
table(smart_cut(x^6 + x/100,5,"g"))
table(smart_cut(x^6 + x/100,5,"g",format_fun = format, digits = 3))
table(smart_cut(x^6,5,"g",format_fun = signif))
table(smart_cut(x^6,5,"g",format_fun = smart_signif))
table(smart_cut(x^6,5,"g",format_fun = format_metric))

moodymudskipper/cutr documentation built on Aug. 23, 2019, 7:15 p.m.