dummify: Dummify discrete features to binary columns

Description Usage Arguments Details Value Note Examples

View source: R/dummify.r

Description

Data dummification is also known as one hot encoding or feature binarization. It turns each category to a distinct column with binary (numeric) values.

Usage

1
dummify(data, maxcat = 50L, select = NULL)

Arguments

data

input data

maxcat

maximum categories allowed for each discrete feature. Default is 50.

select

names of selected features to be dummified. Default is NULL.

Details

Continuous features will be ignored if added in select.

select features will be ignored if categories exceed maxcat.

Value

dummified dataset (discrete features only) preserving original features. However, column order might be different.

Note

This is different from model.matrix, where the latter aims to create a full rank matrix for regression-like use cases. If your intention is to create a design matrix, use model.matrix instead.

Examples

1
2
3
4
5
6
7
## Dummify iris dataset
str(dummify(iris))

## Dummify diamonds dataset ignoring features with more than 5 categories
data("diamonds", package = "ggplot2")
str(dummify(diamonds, maxcat = 5))
str(dummify(diamonds, select = c("cut", "color")))

Example output

'data.frame':	150 obs. of  7 variables:
 $ Sepal.Length      : num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width       : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length      : num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width       : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species_setosa    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Species_versicolor: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Species_virginica : int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
2 features with more than 5 categories ignored!
color: 7 categories
clarity: 8 categories

Classes 'tbl_df', 'tbl' and 'data.frame':	53940 obs. of  14 variables:
 $ carat        : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ depth        : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table        : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price        : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x            : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y            : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z            : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
 $ color        : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity      : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ cut_Fair     : int  0 0 0 0 0 0 0 0 1 0 ...
 $ cut_Good     : int  0 0 1 0 1 0 0 0 0 0 ...
 $ cut_Ideal    : int  1 0 0 0 0 0 0 0 0 0 ...
 $ cut_Premium  : int  0 1 0 1 0 0 0 0 0 0 ...
 $ cut_Very.Good: int  0 0 0 0 0 1 1 1 0 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 
Classes 'tbl_df', 'tbl' and 'data.frame':	53940 obs. of  20 variables:
 $ carat        : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ depth        : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table        : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price        : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x            : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y            : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z            : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
 $ clarity      : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ cut_Fair     : int  0 0 0 0 0 0 0 0 1 0 ...
 $ cut_Good     : int  0 0 1 0 1 0 0 0 0 0 ...
 $ cut_Ideal    : int  1 0 0 0 0 0 0 0 0 0 ...
 $ cut_Premium  : int  0 1 0 1 0 0 0 0 0 0 ...
 $ cut_Very.Good: int  0 0 0 0 0 1 1 1 0 1 ...
 $ color_D      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ color_E      : int  1 1 1 0 0 0 0 0 1 0 ...
 $ color_F      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ color_G      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ color_H      : int  0 0 0 0 0 0 0 1 0 1 ...
 $ color_I      : int  0 0 0 1 0 0 1 0 0 0 ...
 $ color_J      : int  0 0 0 0 1 1 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 

DataExplorer documentation built on Dec. 16, 2020, 1:07 a.m.