group_category: Group categories for discrete features

Description Usage Arguments Details Value Examples

View source: R/group_category.r

Description

Sometimes discrete features have sparse categories. This function will group the sparse categories for a discrete feature based on a given threshold.

Usage

1
2
3
4
5
6
7
8
9
group_category(
  data,
  feature,
  threshold,
  measure,
  update = FALSE,
  category_name = "OTHER",
  exclude = NULL
)

Arguments

data

input data

feature

name of the discrete feature to be collapsed.

threshold

the bottom x% categories to be grouped, e.g., if set to 20%, categories with cumulative frequency of the bottom 20% will be grouped

measure

name of feature to be used as an alternative measure.

update

logical, indicating if the data should be modified. The default is FALSE. Setting to TRUE will modify the input data.table object directly. Otherwise, input class will be returned.

category_name

name of the new category if update is set to TRUE. The default is "OTHER".

exclude

categories to be excluded from grouping when update is set to TRUE.

Details

If a continuous feature is passed to the argument feature, it will be force set to character-class.

Value

If update is set to FALSE, returns categories with cumulative frequency less than the input threshold. The output class will match the class of input data. If update is set to TRUE, updated data will be returned, and the output class will match the class of input data.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Load packages
library(data.table)

# Generate data
data <- data.table("a" = as.factor(round(rnorm(500, 10, 5))), "b" = rexp(500, 500))

# View cumulative frequency without collpasing categories
group_category(data, "a", 0.2)

# View cumulative frequency based on another measure
group_category(data, "a", 0.2, measure = "b")

# Group bottom 20% categories based on cumulative frequency
group_category(data, "a", 0.2, update = TRUE)
plot_bar(data)

# Exclude categories from being grouped
dt <- data.table("a" = c(rep("c1", 25), rep("c2", 10), "c3", "c4"))
group_category(dt, "a", 0.8, update = TRUE, exclude = c("c3", "c4"))
plot_bar(dt)

# Return from non-data.table input
df <- data.frame("a" = as.factor(round(rnorm(50, 10, 5))), "b" = rexp(50, 10))
group_category(df, "a", 0.2)
group_category(df, "a", 0.2, measure = "b", update = TRUE)
group_category(df, "a", 0.2, update = TRUE)

Example output

     a cnt   pct cum_pct
 1: 12  46 0.092   0.092
 2:  9  46 0.092   0.184
 3: 10  43 0.086   0.270
 4: 14  38 0.076   0.346
 5:  6  35 0.070   0.416
 6: 13  27 0.054   0.470
 7:  7  26 0.052   0.522
 8: 11  26 0.052   0.574
 9:  5  23 0.046   0.620
10:  8  21 0.042   0.662
11:  3  20 0.040   0.702
12: 15  19 0.038   0.740
13:  4  19 0.038   0.778
     a        cnt        pct   cum_pct
 1: 12 0.10012218 0.10694611 0.1069461
 2: 10 0.09729841 0.10392988 0.2108760
 3:  9 0.08300857 0.08866611 0.2995421
 4: 14 0.06484007 0.06925931 0.3688014
 5: 11 0.06189590 0.06611448 0.4349159
 6:  6 0.05960164 0.06366385 0.4985797
 7:  7 0.05512230 0.05887922 0.5574590
 8:  8 0.05346934 0.05711359 0.6145726
 9:  4 0.05007506 0.05348798 0.6680605
10: 13 0.04121208 0.04402093 0.7120815
11: 15 0.03273911 0.03497048 0.7470520
12:  5 0.03233986 0.03454402 0.7815960
    a cnt  pct cum_pct
1  12   9 0.18    0.18
2   8   5 0.10    0.28
3   6   4 0.08    0.36
4   9   4 0.08    0.44
5  15   3 0.06    0.50
6  16   3 0.06    0.56
7  18   2 0.04    0.60
8   4   2 0.04    0.64
9  11   2 0.04    0.68
10  5   2 0.04    0.72
11 14   2 0.04    0.76
12  1   2 0.04    0.80
       a           b
1     12 0.115254198
2      6 0.258347734
3      8 0.052535813
4      0 0.412786508
5     10 0.214859255
6     12 0.481402894
7     18 0.037532669
8      4 0.220318532
9  OTHER 0.017036013
10     8 0.055730848
11     9 0.047173320
12     6 0.165592977
13     6 0.118630901
14    12 0.135878158
15 OTHER 0.062431240
16     9 0.136812119
17 OTHER 0.041311125
18    12 0.185764604
19 OTHER 0.015328154
20     5 0.372875017
21    12 0.013750471
22 OTHER 0.101196198
23 OTHER 0.055318712
24     8 0.001034942
25 OTHER 0.045970334
26    12 0.100596245
27 OTHER 0.026762635
28     9 0.121663844
29 OTHER 0.083116284
30 OTHER 0.109269384
31     6 0.190818894
32    12 0.335121804
33    18 0.250531057
34    13 0.207319097
35 OTHER 0.114294293
36 OTHER 0.098117830
37     9 0.015213069
38 OTHER 0.170404701
39    12 0.053436490
40 OTHER 0.117301571
41    12 0.005281956
42 OTHER 0.024137254
43     8 0.125539012
44 OTHER 0.053449245
45 OTHER 0.048471661
46     4 0.013832413
47 OTHER 0.032088334
48     5 0.243352995
49     8 0.190877097
50 OTHER 0.053654599
       a           b
1     12 0.115254198
2      6 0.258347734
3      8 0.052535813
4  OTHER 0.412786508
5  OTHER 0.214859255
6     12 0.481402894
7     18 0.037532669
8      4 0.220318532
9  OTHER 0.017036013
10     8 0.055730848
11     9 0.047173320
12     6 0.165592977
13     6 0.118630901
14    12 0.135878158
15    11 0.062431240
16     9 0.136812119
17 OTHER 0.041311125
18    12 0.185764604
19 OTHER 0.015328154
20     5 0.372875017
21    12 0.013750471
22 OTHER 0.101196198
23    15 0.055318712
24     8 0.001034942
25    14 0.045970334
26    12 0.100596245
27 OTHER 0.026762635
28     9 0.121663844
29    16 0.083116284
30     1 0.109269384
31     6 0.190818894
32    12 0.335121804
33    18 0.250531057
34 OTHER 0.207319097
35    11 0.114294293
36    16 0.098117830
37     9 0.015213069
38 OTHER 0.170404701
39    12 0.053436490
40 OTHER 0.117301571
41    12 0.005281956
42    16 0.024137254
43     8 0.125539012
44     1 0.053449245
45    15 0.048471661
46     4 0.013832413
47    15 0.032088334
48     5 0.243352995
49     8 0.190877097
50    14 0.053654599

DataExplorer documentation built on Dec. 16, 2020, 1:07 a.m.