group_category: Group categories for discrete features
In DataExplorer: Automate Data Exploration and Treatment

Description Usage Arguments Details Value Examples

Sometimes discrete features have sparse categories. This function will group the sparse categories for a discrete feature based on a given threshold.

group_category(
  data,
  feature,
  threshold,
  measure,
  update = FALSE,
  category_name = "OTHER",
  exclude = NULL
)

`data`	input data
`feature`	name of the discrete feature to be collapsed.
`threshold`	the bottom x% categories to be grouped, e.g., if set to 20%, categories with cumulative frequency of the bottom 20% will be grouped
`measure`	name of feature to be used as an alternative measure.
`update`	logical, indicating if the data should be modified. The default is `FALSE`. Setting to `TRUE` will modify the input data.table object directly. Otherwise, input class will be returned.
`category_name`	name of the new category if update is set to `TRUE`. The default is "OTHER".
`exclude`	categories to be excluded from grouping when update is set to `TRUE`.

If a continuous feature is passed to the argument feature, it will be force set to character-class.

If update is set to FALSE, returns categories with cumulative frequency less than the input threshold. The output class will match the class of input data. If update is set to TRUE, updated data will be returned, and the output class will match the class of input data.

# Load packages
library(data.table)

# Generate data
data <- data.table("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 500))

# View cumulative frequency without collpasing categories
group_category(data, "a", 0.2)

# View cumulative frequency based on another measure
group_category(data, "a", 0.2, measure = "b")

# Group bottom 20% categories based on cumulative frequency
group_category(data, "a", 0.2, update = TRUE)
plot_bar(data)

# Exclude categories from being grouped
dt <- data.table("a" = c(rep("c1", 25), rep("c2", 10), "c3", "c4"))
group_category(dt, "a", 0.8, update = TRUE, exclude = c("c3", "c4"))
plot_bar(dt)

# Return from non-data.table input
df <- data.frame("a" = as.factor(round(rnorm(50, 10, 5))), "b" = rexp(50, 10))
group_category(df, "a", 0.2)
group_category(df, "a", 0.2, measure = "b", update = TRUE)
group_category(df, "a", 0.2, update = TRUE)

     a cnt   pct cum_pct
 1: 12  46 0.092   0.092
 2:  9  46 0.092   0.184
 3: 10  43 0.086   0.270
 4: 14  38 0.076   0.346
 5:  6  35 0.070   0.416
 6: 13  27 0.054   0.470
 7:  7  26 0.052   0.522
 8: 11  26 0.052   0.574
 9:  5  23 0.046   0.620
10:  8  21 0.042   0.662
11:  3  20 0.040   0.702
12: 15  19 0.038   0.740
13:  4  19 0.038   0.778
     a        cnt        pct   cum_pct
 1: 12 0.10012218 0.10694611 0.1069461
 2: 10 0.09729841 0.10392988 0.2108760
 3:  9 0.08300857 0.08866611 0.2995421
 4: 14 0.06484007 0.06925931 0.3688014
 5: 11 0.06189590 0.06611448 0.4349159
 6:  6 0.05960164 0.06366385 0.4985797
 7:  7 0.05512230 0.05887922 0.5574590
 8:  8 0.05346934 0.05711359 0.6145726
 9:  4 0.05007506 0.05348798 0.6680605
10: 13 0.04121208 0.04402093 0.7120815
11: 15 0.03273911 0.03497048 0.7470520
12:  5 0.03233986 0.03454402 0.7815960
    a cnt  pct cum_pct
1  12   9 0.18    0.18
2   8   5 0.10    0.28
3   6   4 0.08    0.36
4   9   4 0.08    0.44
5  15   3 0.06    0.50
6  16   3 0.06    0.56
7  18   2 0.04    0.60
8   4   2 0.04    0.64
9  11   2 0.04    0.68
10  5   2 0.04    0.72
11 14   2 0.04    0.76
12  1   2 0.04    0.80
       a           b
1     12 0.115254198
2      6 0.258347734
3      8 0.052535813
4      0 0.412786508
5     10 0.214859255
6     12 0.481402894
7     18 0.037532669
8      4 0.220318532
9  OTHER 0.017036013
10     8 0.055730848
11     9 0.047173320
12     6 0.165592977
13     6 0.118630901
14    12 0.135878158
15 OTHER 0.062431240
16     9 0.136812119
17 OTHER 0.041311125
18    12 0.185764604
19 OTHER 0.015328154
20     5 0.372875017
21    12 0.013750471
22 OTHER 0.101196198
23 OTHER 0.055318712
24     8 0.001034942
25 OTHER 0.045970334
26    12 0.100596245
27 OTHER 0.026762635
28     9 0.121663844
29 OTHER 0.083116284
30 OTHER 0.109269384
31     6 0.190818894
32    12 0.335121804
33    18 0.250531057
34    13 0.207319097
35 OTHER 0.114294293
36 OTHER 0.098117830
37     9 0.015213069
38 OTHER 0.170404701
39    12 0.053436490
40 OTHER 0.117301571
41    12 0.005281956
42 OTHER 0.024137254
43     8 0.125539012
44 OTHER 0.053449245
45 OTHER 0.048471661
46     4 0.013832413
47 OTHER 0.032088334
48     5 0.243352995
49     8 0.190877097
50 OTHER 0.053654599
       a           b
1     12 0.115254198
2      6 0.258347734
3      8 0.052535813
4  OTHER 0.412786508
5  OTHER 0.214859255
6     12 0.481402894
7     18 0.037532669
8      4 0.220318532
9  OTHER 0.017036013
10     8 0.055730848
11     9 0.047173320
12     6 0.165592977
13     6 0.118630901
14    12 0.135878158
15    11 0.062431240
16     9 0.136812119
17 OTHER 0.041311125
18    12 0.185764604
19 OTHER 0.015328154
20     5 0.372875017
21    12 0.013750471
22 OTHER 0.101196198
23    15 0.055318712
24     8 0.001034942
25    14 0.045970334
26    12 0.100596245
27 OTHER 0.026762635
28     9 0.121663844
29    16 0.083116284
30     1 0.109269384
31     6 0.190818894
32    12 0.335121804
33    18 0.250531057
34 OTHER 0.207319097
35    11 0.114294293
36    16 0.098117830
37     9 0.015213069
38 OTHER 0.170404701
39    12 0.053436490
40 OTHER 0.117301571
41    12 0.005281956
42    16 0.024137254
43     8 0.125539012
44     1 0.053449245
45    15 0.048471661
46     4 0.013832413
47    15 0.032088334
48     5 0.243352995
49     8 0.190877097
50    14 0.053654599