categoryEncodings intends to provide a fast way to encode ‘factor’ or qualitative variables through various methods. The packages uses data.table as the backend for speed, with as few other dependencies as possible. Most of the methods are based on the paper of Johannemann et al.(2019) - Sufficient Representations for Categorical Variables (arXiv:1908.09874).
The current version features automatic inference of factors and uses a very simple heuristic for encoding, as well as allowing manual controls.
You can install the latest version of categoryEncodings from github using the devtools package
Soon the package will be submitted to CRAN, and hopefully will be accepted.
Here we want to encode all of the factors in a given data.frame.
library(categoryEncodings) # currently data_fm <- cbind( data.frame(matrix(rnorm(5*100),ncol = 5)), sample(sample(letters, 10), 100, replace = TRUE)) colnames(data_fm) <- "few_letters" # encoding is done automatically, as is the inference of factors result <- encode_categories(X = data_fm) # note that due to the data.table backend, the result has to be saved to an object to be # visible: otherwise printing is surpressed. print(result) data_fm <- cbind( data.frame( matrix( rnorm(5*100),ncol = 5)), sample(sample(letters, 10), 100, replace = TRUE), sample(sample(letters, 20), 100, replace = TRUE), sample(sample(1:10, 5), 100, replace = TRUE), sample(sample(1:50, 35), 100, replace = TRUE ), sample(1:2, 100, replace = TRUE )) colnames(data_fm)[6:10] <- c( "few_letters", "many_letters", "some_numbers", "many_numbers", "binary" ) # it does not matter how many factor variables they are, whether they are encoded as factors # and whether you supply a method to encode them by - some simple inference of factors is done # based on the number of distinct values in every variable - over a certain threshold # a variable is deemed as essentialy a factor, and treated as such for conversion # you will be notified of which variables are being converted via a warning result <- encode_categories(data_fm) print(result)
If you would like to contribute a pull request, please do contribute! All contributions will be considered for acceptance, provided they are justifiable and the code is reasonable, regardless of anything related to the person submitting the pull request. Please keep things civil - there is no need for negativity. Also, please do refrain from adding unnecessary dependencies (Ex: pipe) to the package (such pull requests as would add an unnecessary dependencies will be denied/ suspended until the code can be made dependency free).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.