options(width=250)
devtools::install_github("https://github.com/Zelazny7/onehot")
set.seed(100) test <- data.frame( factor = factor(sample(c(NA, letters[1:3]), 100, T)), integer = as.integer(runif(100) * 10), real = rnorm(100), logical = sample(c(T, F), 100, T), character = sample(letters, 100, T), stringsAsFactors = FALSE) head(test)
A onehot object contains information about the data.frame. This is used to transform a data.frame into a onehot encoded matrix. It should be saved to transform future datasets into the same exact layout.
library(onehot) encoder <- onehot(test) ## printe a summary encoder
The onehot object has a predict method which may be used to transform a
data.frame. Factors are onehot encoded. Character variables are skipped.
However calling predict with stringsAsFactors=TRUE
will convert character
vectors to factors first.
train_data <- predict(encoder, test) head(train_data)
add_NA_factors=TRUE
(the default) will create an indicator column for every factor column. Having NAs as a factor
level will result in an indicator column being created without using this option.
encoder <- onehot(test, add_NA_factors=TRUE) train_data <- predict(encoder, test) head(train_data)
The sentinel=VALUE
argument will replace all numeric NAs with the provided value. Some ML algorithms such
as randomForest
and xgboost
do not handle NA values. However, by using sentinel values such algorithms are
usually able to separate them with enough decision-tree splits. The default value is -999
onehot
also provides support for predicting sparse, column compressed matrices
from the Matrix
package:
encoder <- onehot(test) train_data <- predict(encoder, test, sparse=TRUE) head(train_data)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.