Description Usage Arguments Details Value Examples
View source: R/group_category.r
Sometimes discrete features have sparse categories. This function will group the sparse categories for a discrete feature based on a given threshold.
1 2 3 4 5 6 7 8 9 | group_category(
data,
feature,
threshold,
measure,
update = FALSE,
category_name = "OTHER",
exclude = NULL
)
|
data |
input data |
feature |
name of the discrete feature to be collapsed. |
threshold |
the bottom x% categories to be grouped, e.g., if set to 20%, categories with cumulative frequency of the bottom 20% will be grouped |
measure |
name of feature to be used as an alternative measure. |
update |
logical, indicating if the data should be modified. The default is |
category_name |
name of the new category if update is set to |
exclude |
categories to be excluded from grouping when update is set to |
If a continuous feature is passed to the argument feature
, it will be force set to character-class.
If update
is set to FALSE
, returns categories with cumulative frequency less than the input threshold. The output class will match the class of input data.
If update
is set to TRUE
, updated data will be returned, and the output class will match the class of input data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # Load packages
library(data.table)
# Generate data
data <- data.table("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 500))
# View cumulative frequency without collpasing categories
group_category(data, "a", 0.2)
# View cumulative frequency based on another measure
group_category(data, "a", 0.2, measure = "b")
# Group bottom 20% categories based on cumulative frequency
group_category(data, "a", 0.2, update = TRUE)
plot_bar(data)
# Exclude categories from being grouped
dt <- data.table("a" = c(rep("c1", 25), rep("c2", 10), "c3", "c4"))
group_category(dt, "a", 0.8, update = TRUE, exclude = c("c3", "c4"))
plot_bar(dt)
# Return from non-data.table input
df <- data.frame("a" = as.factor(round(rnorm(50, 10, 5))), "b" = rexp(50, 10))
group_category(df, "a", 0.2)
group_category(df, "a", 0.2, measure = "b", update = TRUE)
group_category(df, "a", 0.2, update = TRUE)
|
a cnt pct cum_pct
1: 12 46 0.092 0.092
2: 9 46 0.092 0.184
3: 10 43 0.086 0.270
4: 14 38 0.076 0.346
5: 6 35 0.070 0.416
6: 13 27 0.054 0.470
7: 7 26 0.052 0.522
8: 11 26 0.052 0.574
9: 5 23 0.046 0.620
10: 8 21 0.042 0.662
11: 3 20 0.040 0.702
12: 15 19 0.038 0.740
13: 4 19 0.038 0.778
a cnt pct cum_pct
1: 12 0.10012218 0.10694611 0.1069461
2: 10 0.09729841 0.10392988 0.2108760
3: 9 0.08300857 0.08866611 0.2995421
4: 14 0.06484007 0.06925931 0.3688014
5: 11 0.06189590 0.06611448 0.4349159
6: 6 0.05960164 0.06366385 0.4985797
7: 7 0.05512230 0.05887922 0.5574590
8: 8 0.05346934 0.05711359 0.6145726
9: 4 0.05007506 0.05348798 0.6680605
10: 13 0.04121208 0.04402093 0.7120815
11: 15 0.03273911 0.03497048 0.7470520
12: 5 0.03233986 0.03454402 0.7815960
a cnt pct cum_pct
1 12 9 0.18 0.18
2 8 5 0.10 0.28
3 6 4 0.08 0.36
4 9 4 0.08 0.44
5 15 3 0.06 0.50
6 16 3 0.06 0.56
7 18 2 0.04 0.60
8 4 2 0.04 0.64
9 11 2 0.04 0.68
10 5 2 0.04 0.72
11 14 2 0.04 0.76
12 1 2 0.04 0.80
a b
1 12 0.115254198
2 6 0.258347734
3 8 0.052535813
4 0 0.412786508
5 10 0.214859255
6 12 0.481402894
7 18 0.037532669
8 4 0.220318532
9 OTHER 0.017036013
10 8 0.055730848
11 9 0.047173320
12 6 0.165592977
13 6 0.118630901
14 12 0.135878158
15 OTHER 0.062431240
16 9 0.136812119
17 OTHER 0.041311125
18 12 0.185764604
19 OTHER 0.015328154
20 5 0.372875017
21 12 0.013750471
22 OTHER 0.101196198
23 OTHER 0.055318712
24 8 0.001034942
25 OTHER 0.045970334
26 12 0.100596245
27 OTHER 0.026762635
28 9 0.121663844
29 OTHER 0.083116284
30 OTHER 0.109269384
31 6 0.190818894
32 12 0.335121804
33 18 0.250531057
34 13 0.207319097
35 OTHER 0.114294293
36 OTHER 0.098117830
37 9 0.015213069
38 OTHER 0.170404701
39 12 0.053436490
40 OTHER 0.117301571
41 12 0.005281956
42 OTHER 0.024137254
43 8 0.125539012
44 OTHER 0.053449245
45 OTHER 0.048471661
46 4 0.013832413
47 OTHER 0.032088334
48 5 0.243352995
49 8 0.190877097
50 OTHER 0.053654599
a b
1 12 0.115254198
2 6 0.258347734
3 8 0.052535813
4 OTHER 0.412786508
5 OTHER 0.214859255
6 12 0.481402894
7 18 0.037532669
8 4 0.220318532
9 OTHER 0.017036013
10 8 0.055730848
11 9 0.047173320
12 6 0.165592977
13 6 0.118630901
14 12 0.135878158
15 11 0.062431240
16 9 0.136812119
17 OTHER 0.041311125
18 12 0.185764604
19 OTHER 0.015328154
20 5 0.372875017
21 12 0.013750471
22 OTHER 0.101196198
23 15 0.055318712
24 8 0.001034942
25 14 0.045970334
26 12 0.100596245
27 OTHER 0.026762635
28 9 0.121663844
29 16 0.083116284
30 1 0.109269384
31 6 0.190818894
32 12 0.335121804
33 18 0.250531057
34 OTHER 0.207319097
35 11 0.114294293
36 16 0.098117830
37 9 0.015213069
38 OTHER 0.170404701
39 12 0.053436490
40 OTHER 0.117301571
41 12 0.005281956
42 16 0.024137254
43 8 0.125539012
44 1 0.053449245
45 15 0.048471661
46 4 0.013832413
47 15 0.032088334
48 5 0.243352995
49 8 0.190877097
50 14 0.053654599
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.