preprocess_data: Pre-process cognostic data
In BYUIDSS/BYUImachine: Machine Learning Model Selection of Cognostic Data

Description Usage Arguments Details Value Author(s) See Also Examples

Generates a tibble with features optimized for machine learning

1
2
3

preprocess_data(x, target = "Truth", reduce_cols = FALSE,
  factor_y = TRUE, impute = "zero", corr_cutoff = 0.9,
  freq_cut = 95/5, unique_cut = 10, k = 10, prepro_methods = NULL)

`x`	data frame or tibble.
`target`	classifier column
`reduce_cols`	lgl 'TRUE': Columns are reduced based on near zero variance and correlation; FALSE = Nothing
`factor_y`	'FALSE': Recodes pred to 0 and 1; 'TRUE' = Recodes pred to factor
`impute`	character Impute NA by "knn","mean","zero"
`corr_cutoff`	Corelation coefficient level to cut off highly correlated columns, devaulted to .90
`freq_cut`	the cutoff for the ratio of the most common value to the second most common value
`unique_cut`	the cutoff for the percentage of distinct values out of the number of total samples (knn takes substantially longer to compute, zero replaces NA with 0)
`k`	the number of nearest neighbours to use for impute (defaults to 10)
`prepro_methods`	string or vector of strings of preprocessing methods

Data is often messy and needs to be cleaned prior to use in machine learning. preprocess_data can help with this but isn't a complete solution. Reguardless of argument specification, this function will ungroup data if grouped, and turn Inf values into NA. Beyond that, the user can specify whether to convert their target variable into a factor (default), or convert to 0 and 1 with factor_y = FALSE; whether to impute NA's using mean, knn, or replace with 0 (default) using impute; and whether to reduce columns with reduce_cols. When reduce is set to TRUE, freq_cut and unique_cut can also bet set to exclude more or less columns. See the argument definitions in nearZeroVar for further information.

This function returns a tibble of optimized features

"Dallin Webb <dallinwebb@byui.edu>"

preProcess, nearZeroVar, findCorrelation, cor

## Not run: 
\donttest{
library(caret)
data(dhfr)

dhfr_reduced <- preprocess_data(dhfr, target = "Y", reduce_cols = TRUE)

dhfr_reduced <- preprocess_data(dhfr,
                                target      = "Y",
                                reduce_cols = TRUE,
                                impute      = "mean",
                                freq_cut    = 2,
                                unique_cut  = 20,
                                prepro_methods = c("center","scale","BoxCox"))
}

## End(Not run)