Description Usage Arguments Details Value Author(s) See Also Examples
Generates a tibble with features optimized for machine learning
1 2 3 | preprocess_data(x, target = "Truth", reduce_cols = FALSE,
factor_y = TRUE, impute = "zero", corr_cutoff = 0.9,
freq_cut = 95/5, unique_cut = 10, k = 10, prepro_methods = NULL)
|
x |
data frame or tibble. |
target |
classifier column |
reduce_cols |
lgl 'TRUE': Columns are reduced based on near zero variance and correlation; FALSE = Nothing |
factor_y |
'FALSE': Recodes pred to 0 and 1; 'TRUE' = Recodes pred to factor |
impute |
character Impute NA by "knn","mean","zero" |
corr_cutoff |
Corelation coefficient level to cut off highly correlated columns, devaulted to .90 |
freq_cut |
the cutoff for the ratio of the most common value to the second most common value |
unique_cut |
the cutoff for the percentage of distinct values out of the number of total samples (knn takes substantially longer to compute, zero replaces NA with 0) |
k |
the number of nearest neighbours to use for impute (defaults to 10) |
prepro_methods |
string or vector of strings of preprocessing methods |
Data is often messy and needs to be cleaned prior to use in machine learning.
preprocess_data
can help with this but isn't a complete solution.
Reguardless of argument specification, this function will ungroup data if
grouped, and turn Inf
values into NA. Beyond that, the user can
specify whether to convert their target variable into a factor (default),
or convert to 0 and 1 with factor_y = FALSE
; whether to impute NA's
using mean, knn, or replace with 0 (default) using impute
; and whether
to reduce columns with reduce_cols
. When reduce
is set to
TRUE
, freq_cut
and unique_cut
can also bet set to
exclude more or less columns. See the argument definitions in
nearZeroVar for further information.
This function returns a tibble
of optimized features
"Dallin Webb <dallinwebb@byui.edu>"
preProcess, nearZeroVar, findCorrelation, cor
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ## Not run:
\donttest{
library(caret)
data(dhfr)
dhfr_reduced <- preprocess_data(dhfr, target = "Y", reduce_cols = TRUE)
dhfr_reduced <- preprocess_data(dhfr,
target = "Y",
reduce_cols = TRUE,
impute = "mean",
freq_cut = 2,
unique_cut = 20,
prepro_methods = c("center","scale","BoxCox"))
}
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.