View source: R/splendid_process.R
splendid_process | R Documentation |
Process the data by converting categorical predictors to dummy variables, standardizing continuous predictors, and apply subsampling techniques.
splendid_process(
data,
class,
algorithms,
convert = FALSE,
standardize = FALSE,
sampling = c("none", "up", "down", "smote"),
seed_samp = NULL
)
data |
data frame with rows as samples, columns as features |
class |
true/reference class vector used for supervised learning |
algorithms |
character vector of algorithms to use for supervised
learning. See Algorithms section for possible options. By default,
this argument is |
convert |
logical; if |
standardize |
logical; if |
sampling |
the default is "none", in which no subsampling is performed. Other options include "up" (Up-sampling the minority class), "down" (Down-sampling the majority class), and "smote" (synthetic points for the minority class and down-sampling the majority class). Subsampling is only applicable to the training set. |
seed_samp |
random seed used for reproducibility in subsampling training sets for model generation |
If all the variables in the original data are already continuous, nothing is
done. Otherwise, conversion is performed if convert = TRUE
using
dummify()
. An error message is thrown if there are categorical variables
and convert = FALSE
, indicating exactly which algorithms specified require
data conversion. Classification algorithms LDA and the MLR family have such a
limitation.
Continuous predictors can be scaled to have zero mean and unit variance with
standardize = TRUE
. Dummy variables coded to 0 or 1 are never standardized.
Subsampling techniques can be applied with sampling
methods passed to
subsample()
.
A pre-processed data frame for model training
Derek Chiu
dummify()
, subsample()
data(hgsc)
cl <- attr(hgsc, "class.true")
# Nothing happens if data is all continuous
data_same <- splendid_process(hgsc, class = cl, algorithms = "lda", convert =
TRUE)
identical(hgsc, data_same)
# Dummy variables created if there are categorical variables
data_dummy <- splendid_process(iris, class = iris$Species, algorithms =
"lda", convert = TRUE)
head(data_dummy)
# Some algorithms are robust to the covariate data structure
data_robust <- splendid_process(iris, class = iris$Species, algorithms =
"rf", convert = FALSE)
identical(iris, data_robust)
# Standardize and down-sample
iris2 <- iris[1:130, ]
data_scale_down <- splendid_process(iris2, class = iris2$Species, algorithms
= "rf", standardize = TRUE, sampling = "down")
dim(data_scale_down)
# Other algorithms require conversion
## Not run:
splendid_process(iris, class = iris$Species, algorithms = "lda", convert =
FALSE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.