preprocess_data: Pre-process cognostic data

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Generates a tibble with features optimized for machine learning

Usage

1
2
3
preprocess_data(x, target = "Truth", reduce_cols = FALSE,
  factor_y = TRUE, impute = "zero", corr_cutoff = 0.9,
  freq_cut = 95/5, unique_cut = 10, k = 10, prepro_methods = NULL)

Arguments

x

data frame or tibble.

target

classifier column

reduce_cols

lgl 'TRUE': Columns are reduced based on near zero variance and correlation; FALSE = Nothing

factor_y

'FALSE': Recodes pred to 0 and 1; 'TRUE' = Recodes pred to factor

impute

character Impute NA by "knn","mean","zero"

corr_cutoff

Corelation coefficient level to cut off highly correlated columns, devaulted to .90

freq_cut

the cutoff for the ratio of the most common value to the second most common value

unique_cut

the cutoff for the percentage of distinct values out of the number of total samples (knn takes substantially longer to compute, zero replaces NA with 0)

k

the number of nearest neighbours to use for impute (defaults to 10)

prepro_methods

string or vector of strings of preprocessing methods

Details

Data is often messy and needs to be cleaned prior to use in machine learning. preprocess_data can help with this but isn't a complete solution. Reguardless of argument specification, this function will ungroup data if grouped, and turn Inf values into NA. Beyond that, the user can specify whether to convert their target variable into a factor (default), or convert to 0 and 1 with factor_y = FALSE; whether to impute NA's using mean, knn, or replace with 0 (default) using impute; and whether to reduce columns with reduce_cols. When reduce is set to TRUE, freq_cut and unique_cut can also bet set to exclude more or less columns. See the argument definitions in nearZeroVar for further information.

Value

This function returns a tibble of optimized features

Author(s)

"Dallin Webb <dallinwebb@byui.edu>"

See Also

preProcess, nearZeroVar, findCorrelation, cor

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
## Not run: 
\donttest{
library(caret)
data(dhfr)

dhfr_reduced <- preprocess_data(dhfr, target = "Y", reduce_cols = TRUE)

dhfr_reduced <- preprocess_data(dhfr,
                                target      = "Y",
                                reduce_cols = TRUE,
                                impute      = "mean",
                                freq_cut    = 2,
                                unique_cut  = 20,
                                prepro_methods = c("center","scale","BoxCox"))
}

## End(Not run)

BYUIDSS/BYUImachine documentation built on May 3, 2019, 5:22 p.m.