auto_grouping: Reduce cardinality in categorical variable by automatic...

Description Usage Arguments Value Examples

View source: R/data_preparation.R

Description

Reduce the cardinality of an input variable based on a target -binary by now- variable based on attribitues of accuracy and representativity, for both input and target variable. It uses a cluster model to create the new groups. Full documentation can be found at: https://livebook.datascienceheroes.com/data-preparation.html#high_cardinality_predictive_modeling

Usage

1
auto_grouping(data, input, target, n_groups, model = "kmeans", seed = 999)

Arguments

data

data frame source

input

categorical variable indicating

target

string of the variable to optimize the re-grouping

n_groups

number of groups for the new category based on input, normally between 3 and 10.

model

is the clustering model used to create the grouping, supported models: "kmeans" (default) or "hclust" (hierarchical clustering).

seed

optional, random number used internally for the k-means, changing this value will change the model

Value

A list containing 3 elements: recateg_results which contains the description of the target variable with the new groups; df_equivalence is a data frame containing the input category and the new category; fit_cluster which is the cluster model used to do the re-grouping

Examples

1
2
3
4
5
## Not run: 
# Reducing quantity of countries based on has_flu variable
auto_grouping(data=data_country, input='country', target="has_flu", n_groups=8)

## End(Not run)

Example output

Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, round.POSIXt, trunc.POSIXt, units

sh: 1: cannot create /dev/null: Permission denied
funModeling v.1.6.5 :)
Examples and tutorials at livebook.datascienceheroes.com

$recateg_results
  country_rec mean_target sum_target perc_target q_rows perc_rows
1     group_4       0.176         19       0.229    108     0.119
2     group_7       0.156         10       0.120     64     0.070
3     group_3       0.142         41       0.494    288     0.316
4     group_5       0.111         10       0.120     90     0.099
5     group_2       0.019          1       0.012     52     0.057
6     group_1       0.015          2       0.024    132     0.145
7     group_6       0.000          0       0.000     75     0.082
8     group_8       0.000          0       0.000    101     0.111

$df_equivalence
                     country country_rec
1                      China     group_1
2                     Turkey     group_1
3                    Belgium     group_2
4                      Japan     group_2
5                Netherlands     group_2
6                     France     group_3
7             United Kingdom     group_4
8                    Uruguay     group_4
9                  Australia     group_5
10                    Canada     group_5
11                   Germany     group_5
12       Asia/Pacific Region     group_6
13                   Austria     group_6
14                Bangladesh     group_6
15    Bosnia and Herzegovina     group_6
16                  Cambodia     group_6
17                     Chile     group_6
18                Costa Rica     group_6
19                   Croatia     group_6
20                    Cyprus     group_6
21            Czech Republic     group_6
22        Dominican Republic     group_6
23                     Egypt     group_6
24                   Finland     group_6
25                     Ghana     group_6
26                    Greece     group_6
27                  Honduras     group_6
28 Iran, Islamic Republic of     group_6
29                   Ireland     group_6
30               Isle of Man     group_6
31        Korea, Republic of     group_6
32                    Latvia     group_6
33                 Lithuania     group_6
34                Luxembourg     group_6
35                     Malta     group_6
36      Moldova, Republic of     group_6
37                Montenegro     group_6
38                   Morocco     group_6
39               New Zealand     group_6
40                  Pakistan     group_6
41     Palestinian Territory     group_6
42                      Peru     group_6
43        Russian Federation     group_6
44              Saudi Arabia     group_6
45                   Senegal     group_6
46                  Slovenia     group_6
47                    Taiwan     group_6
48                  Thailand     group_6
49                   Vietnam     group_6
50                 Argentina     group_7
51                    Israel     group_7
52                  Malaysia     group_7
53                    Mexico     group_7
54                  Portugal     group_7
55                   Romania     group_7
56                     Spain     group_7
57                    Sweden     group_7
58               Switzerland     group_7
59                    Brazil     group_8
60                  Bulgaria     group_8
61                   Denmark     group_8
62                 Hong Kong     group_8
63                 Indonesia     group_8
64                     Italy     group_8
65                    Norway     group_8
66               Philippines     group_8
67                    Poland     group_8
68                 Singapore     group_8
69              South Africa     group_8
70                   Ukraine     group_8

$fit_cluster
K-means clustering with 8 clusters of sizes 2, 3, 1, 2, 3, 38, 9, 12

Cluster means:
  perc_target  perc_rows
1 -0.03671967  1.4596260
2 -0.16604681  0.1205032
3  7.75524043  7.5545121
4  1.62028429  1.1217165
5  0.41592532  0.4709278
6 -0.23071038 -0.3056712
7 -0.01516515 -0.1603928
8 -0.23071038 -0.1193709

Clustering vector:
 [1] 7 7 7 4 4 7 7 5 3 7 5 5 7 7 7 2 1 6 6 6 2 6 8 8 6 6 1 6 6 6 6 8 6 6 6 6 6 6
[39] 8 8 6 6 6 8 2 6 6 6 6 6 6 6 6 6 8 6 6 6 8 8 6 6 6 8 6 8 6 6 8 6

Within cluster sum of squares by cluster:
[1] 0.07808412 0.03385951 0.00000000 0.30418806 0.20552532 0.03694805 0.12464413
[8] 0.04443054
 (between_SS / total_SS =  99.4 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

funModeling documentation built on July 1, 2020, 5:40 p.m.