Description Usage Arguments Value Examples
View source: R/data_preparation.R
Reduce the cardinality of an input variable based on a target -binary by now- variable based on attribitues of accuracy and representativity, for both input and target variable. It uses a cluster model to create the new groups. Full documentation can be found at: https://livebook.datascienceheroes.com/data-preparation.html#high_cardinality_predictive_modeling
1 | auto_grouping(data, input, target, n_groups, model = "kmeans", seed = 999)
|
data |
data frame source |
input |
categorical variable indicating |
target |
string of the variable to optimize the re-grouping |
n_groups |
number of groups for the new category based on input, normally between 3 and 10. |
model |
is the clustering model used to create the grouping, supported models: "kmeans" (default) or "hclust" (hierarchical clustering). |
seed |
optional, random number used internally for the k-means, changing this value will change the model |
A list containing 3 elements: recateg_results which contains the description of the target variable with the new groups; df_equivalence is a data frame containing the input category and the new category; fit_cluster which is the cluster model used to do the re-grouping
1 2 3 4 5 | ## Not run:
# Reducing quantity of countries based on has_flu variable
auto_grouping(data=data_country, input='country', target="has_flu", n_groups=8)
## End(Not run)
|
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
sh: 1: cannot create /dev/null: Permission denied
funModeling v.1.6.5 :)
Examples and tutorials at livebook.datascienceheroes.com
$recateg_results
country_rec mean_target sum_target perc_target q_rows perc_rows
1 group_4 0.176 19 0.229 108 0.119
2 group_7 0.156 10 0.120 64 0.070
3 group_3 0.142 41 0.494 288 0.316
4 group_5 0.111 10 0.120 90 0.099
5 group_2 0.019 1 0.012 52 0.057
6 group_1 0.015 2 0.024 132 0.145
7 group_6 0.000 0 0.000 75 0.082
8 group_8 0.000 0 0.000 101 0.111
$df_equivalence
country country_rec
1 China group_1
2 Turkey group_1
3 Belgium group_2
4 Japan group_2
5 Netherlands group_2
6 France group_3
7 United Kingdom group_4
8 Uruguay group_4
9 Australia group_5
10 Canada group_5
11 Germany group_5
12 Asia/Pacific Region group_6
13 Austria group_6
14 Bangladesh group_6
15 Bosnia and Herzegovina group_6
16 Cambodia group_6
17 Chile group_6
18 Costa Rica group_6
19 Croatia group_6
20 Cyprus group_6
21 Czech Republic group_6
22 Dominican Republic group_6
23 Egypt group_6
24 Finland group_6
25 Ghana group_6
26 Greece group_6
27 Honduras group_6
28 Iran, Islamic Republic of group_6
29 Ireland group_6
30 Isle of Man group_6
31 Korea, Republic of group_6
32 Latvia group_6
33 Lithuania group_6
34 Luxembourg group_6
35 Malta group_6
36 Moldova, Republic of group_6
37 Montenegro group_6
38 Morocco group_6
39 New Zealand group_6
40 Pakistan group_6
41 Palestinian Territory group_6
42 Peru group_6
43 Russian Federation group_6
44 Saudi Arabia group_6
45 Senegal group_6
46 Slovenia group_6
47 Taiwan group_6
48 Thailand group_6
49 Vietnam group_6
50 Argentina group_7
51 Israel group_7
52 Malaysia group_7
53 Mexico group_7
54 Portugal group_7
55 Romania group_7
56 Spain group_7
57 Sweden group_7
58 Switzerland group_7
59 Brazil group_8
60 Bulgaria group_8
61 Denmark group_8
62 Hong Kong group_8
63 Indonesia group_8
64 Italy group_8
65 Norway group_8
66 Philippines group_8
67 Poland group_8
68 Singapore group_8
69 South Africa group_8
70 Ukraine group_8
$fit_cluster
K-means clustering with 8 clusters of sizes 2, 3, 1, 2, 3, 38, 9, 12
Cluster means:
perc_target perc_rows
1 -0.03671967 1.4596260
2 -0.16604681 0.1205032
3 7.75524043 7.5545121
4 1.62028429 1.1217165
5 0.41592532 0.4709278
6 -0.23071038 -0.3056712
7 -0.01516515 -0.1603928
8 -0.23071038 -0.1193709
Clustering vector:
[1] 7 7 7 4 4 7 7 5 3 7 5 5 7 7 7 2 1 6 6 6 2 6 8 8 6 6 1 6 6 6 6 8 6 6 6 6 6 6
[39] 8 8 6 6 6 8 2 6 6 6 6 6 6 6 6 6 8 6 6 6 8 8 6 6 6 8 6 8 6 6 8 6
Within cluster sum of squares by cluster:
[1] 0.07808412 0.03385951 0.00000000 0.30418806 0.20552532 0.03694805 0.12464413
[8] 0.04443054
(between_SS / total_SS = 99.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.