onehot2sql: Prepare training data in R so that it is ready for XGBoost...

Description Usage Arguments Value Examples

Description

This function performs full one-hot encoding for all the categorical features inside the training data, with all NAs inside both categorical and numeric features preserved. Other than outputting a matrix model.matrix which is the data after processing, it also outputs meta information keeping track of all the transformation the function performs, while SQL query for the transformation is kept in output sql and write to the file specified by output_file_name. If meta is specified as input to the function, the transformation and the corresponding SQL query will follow what is kept in meta exactly.

Usage

1
2
3
onehot2sql(data, meta = NULL, sep = "_", ws_replace = TRUE,
  ws_replace_with = "", unique_id = NULL, output_file_name = NULL,
  input_table_name = NULL)

Arguments

data

Data object of class data.frame or data.table.

meta

Optional, a list keeps track of all the transformation that has been taken on the categorical features.

sep

Separation symbol between the categorical features and their levels, which will be the column names inside the output model.matrix, default to "_".

ws_replace

Boolean indicator controls whether white-space and punctuation inside categorical feature levels should be replaced, default to TRUE.

ws_replace_with

Replacing symbol, default to ” which means all white-space and punctuation should be removed.

unique_id

A row unique identifier is crucial for in-database scoring of XGBoost model. If not given, SQL query will be generated with id name "ROW_KEY".

output_file_name

Optional, a file name that the SQL query will write to.

input_table_name

Name of raw data table in the database, that the SQL query will select from. If not given, SQL query will be generated with table name "INPUT_TABLE".

Value

A list of 1). meta data tracking the transformation; 2). matrix model.matrix is the data after processing which is ready for XGBoost fitting; 3). SQL query sql performing the exact one-hot encoding in the database.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
library(data.table)
### load test data
df = data.frame(ggplot2::diamonds)
head(df)

d1 = data.frame(ggplot2::diamonds)
d1[1,2] = NA  # NA on 1st row cut
d1[2,5] = NA  # NA on 2nd row depth
head(d1)

d2 = data.table(ggplot2::diamonds)
d2[, cut:=factor(cut, ordered=FALSE)]
d2[, clarity:=as.character(clarity)]
d2[, tsdt:=as.IDate('2017-01-05')]
d2[1:3, tsdt:=tsdt-1]
head(d2)

### out is obtained for training data
out <- onehot2sql(df)
out1 <- onehot2sql(d1)  # NA is kept in the output
out2 <- onehot2sql(d2)  # all non-numeric features will be treated as categorical

### perform same transformation for new data when meta is given
# test-1: new data has column class change
newdata = df[1:5,]
newdata$cut = as.character(newdata$cut)
onehot2sql(newdata, meta=out$meta)$model.matrix

# test-2: new data has NA
newdata = df[1:5,]
newdata[1,1]=NA; newdata[2,1]=NA; newdata[3,2]=NA; newdata[3,3]=NA; newdata[5,4]=NA
onehot2sql(newdata, meta=out$meta)$model.matrix

# test-3: newdata has column with new elements
newdata = d2[1:5,]
newdata[5,clarity:='NEW']; newdata[1,tsdt:=as.IDate('2017-05-01')]
onehot2sql(newdata, meta=out2$meta)$model.matrix

# test-4: newdata has new columns
newdata = d2[1:5,]
newdata[,new_col:=1]
onehot2sql(newdata, meta=out2$meta)$model.matrix

# test-5: newdata is lacking some columns
newdata = d2[1:5,]
newdata[,cut:=NULL]
onehot2sql(newdata, meta=out2$meta)$model.matrix

Example output

  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
  carat       cut color clarity depth table price    x    y    z
1  0.23      <NA>     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1    NA    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
   carat       cut color clarity depth table price    x    y    z       tsdt
1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43 2017-01-04
2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31 2017-01-04
3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31 2017-01-04
4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63 2017-01-05
5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75 2017-01-05
6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48 2017-01-05
  (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
1           1  0.23          0          0           0           1           0
2           1  0.21          0          0           1           0           0
3           1  0.23          0          0           0           0           1
4           1  0.29          0          0           0           0           0
5           1  0.31          0          0           0           1           0
  clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
1           0            0            0       0       1       0       0       0
2           0            0            0       0       1       0       0       0
3           0            0            0       0       1       0       0       0
4           1            0            0       0       0       0       0       0
5           0            0            0       0       0       0       0       0
  color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
1       0       0        0        0         1           0            0  61.5
2       0       0        0        0         0           1            0  59.8
3       0       0        0        1         0           0            0  56.9
4       1       0        0        0         0           1            0  62.4
5       0       1        0        1         0           0            0  63.3
  price table    x    y    z
1   326    55 3.95 3.98 2.43
2   326    61 3.89 3.84 2.31
3   327    65 4.05 4.07 2.31
4   334    58 4.20 4.23 2.63
5   335    58 4.34 4.35 2.75
  (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
1           1    NA          0          0           0           1           0
2           1    NA          0          0           1           0           0
3           1  0.23          0          0           0           0           1
4           1  0.29          0          0           0           0           0
5           1  0.31         NA         NA          NA          NA          NA
  clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
1           0            0            0       0       1       0       0       0
2           0            0            0       0       1       0       0       0
3           0            0            0      NA      NA      NA      NA      NA
4           1            0            0       0       0       0       0       0
5          NA           NA           NA       0       0       0       0       0
  color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
1       0       0        0        0         1           0            0  61.5
2       0       0        0        0         0           1            0  59.8
3      NA      NA       NA       NA        NA          NA           NA  56.9
4       1       0        0        0         0           1            0  62.4
5       0       1        0        1         0           0            0  63.3
  price table    x    y    z
1   326    55 3.95 3.98 2.43
2   326    61 3.89 3.84 2.31
3   327    65 4.05 4.07 2.31
4   334    58 4.20 4.23 2.63
5   335    58 4.34 4.35 2.75
  (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
1           1  0.23          0          0           0           1           0
2           1  0.21          0          0           1           0           0
3           1  0.23          0          0           0           0           1
4           1  0.29          0          0           0           0           0
5           1  0.31          0          0           0           0           0
  clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
1           0            0            0       0       1       0       0       0
2           0            0            0       0       1       0       0       0
3           0            0            0       0       1       0       0       0
4           1            0            0       0       0       0       0       0
5           0            0            0       0       0       0       0       0
  color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
1       0       0        0        0         1           0            0  61.5
2       0       0        0        0         0           1            0  59.8
3       0       0        0        1         0           0            0  56.9
4       1       0        0        0         0           1            0  62.4
5       0       1        0        1         0           0            0  63.3
  price table tsdt_20170104 tsdt_20170105    x    y    z
1   326    55             0             0 3.95 3.98 2.43
2   326    61             1             0 3.89 3.84 2.31
3   327    65             1             0 4.05 4.07 2.31
4   334    58             0             1 4.20 4.23 2.63
5   335    58             0             1 4.34 4.35 2.75
  (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
1           1  0.23          0          0           0           1           0
2           1  0.21          0          0           1           0           0
3           1  0.23          0          0           0           0           1
4           1  0.29          0          0           0           0           0
5           1  0.31          0          0           0           1           0
  clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
1           0            0            0       0       1       0       0       0
2           0            0            0       0       1       0       0       0
3           0            0            0       0       1       0       0       0
4           1            0            0       0       0       0       0       0
5           0            0            0       0       0       0       0       0
  color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
1       0       0        0        0         1           0            0  61.5
2       0       0        0        0         0           1            0  59.8
3       0       0        0        1         0           0            0  56.9
4       1       0        0        0         0           1            0  62.4
5       0       1        0        1         0           0            0  63.3
  price table tsdt_20170104 tsdt_20170105    x    y    z
1   326    55             1             0 3.95 3.98 2.43
2   326    61             1             0 3.89 3.84 2.31
3   327    65             1             0 4.05 4.07 2.31
4   334    58             0             1 4.20 4.23 2.63
5   335    58             0             1 4.34 4.35 2.75
  (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2 clarity_VS1
1           1  0.23          0          0           0           1           0
2           1  0.21          0          0           1           0           0
3           1  0.23          0          0           0           0           1
4           1  0.29          0          0           0           0           0
5           1  0.31          0          0           0           1           0
  clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E color_F color_G color_H
1           0            0            0       0       1       0       0       0
2           0            0            0       0       1       0       0       0
3           0            0            0       0       1       0       0       0
4           1            0            0       0       0       0       0       0
5           0            0            0       0       0       0       0       0
  color_I color_J cut_Fair cut_Good cut_Ideal cut_Premium cut_VeryGood depth
1       0       0       NA       NA        NA          NA           NA  61.5
2       0       0       NA       NA        NA          NA           NA  59.8
3       0       0       NA       NA        NA          NA           NA  56.9
4       1       0       NA       NA        NA          NA           NA  62.4
5       0       1       NA       NA        NA          NA           NA  63.3
  price table tsdt_20170104 tsdt_20170105    x    y    z
1   326    55             1             0 3.95 3.98 2.43
2   326    61             1             0 3.89 3.84 2.31
3   327    65             1             0 4.05 4.07 2.31
4   334    58             0             1 4.20 4.23 2.63
5   335    58             0             1 4.34 4.35 2.75
Warning message:
In onehot2sql(newdata, meta = out2$meta) :
  Following columns are populated with NAs: 
cut

xgb2sql documentation built on May 2, 2019, 1:09 p.m.