preprocess_data: Preprocess data prior to running machine learning

View source: R/preprocess.R

preprocess_dataR Documentation

Preprocess data prior to running machine learning

Description

Function to preprocess your data for input into run_ml().

Usage

preprocess_data(
  dataset,
  outcome_colname,
  method = c("center", "scale"),
  remove_var = "nzv",
  collapse_corr_feats = TRUE,
  to_numeric = TRUE,
  group_neg_corr = TRUE,
  prefilter_threshold = 1
)

Arguments

dataset

Data frame with an outcome variable and other columns as features.

outcome_colname

Column name as a string of the outcome variable (default NULL; the first column will be chosen automatically).

method

Methods to preprocess the data, described in caret::preProcess() (default: c("center","scale"), use NULL for no normalization).

remove_var

Whether to remove variables with near-zero variance ('nzv'; default), zero variance ('zv'), or none (NULL).

collapse_corr_feats

Whether to keep only one of perfectly correlated features.

to_numeric

Whether to change features to numeric where possible.

group_neg_corr

Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)).

prefilter_threshold

Remove features which only have non-zero & non-NA values N rows or fewer (default: 1). Set this to -1 to keep all columns at this step. This step will also be skipped if to_numeric is set to FALSE.

Value

Named list including:

  • dat_transformed: Preprocessed data.

  • grp_feats: If features were grouped together, a named list of the features corresponding to each group.

  • removed_feats: Any features that were removed during preprocessing (e.g. because there was zero variance or near-zero variance for those features).

If the progressr package is installed, a progress bar with time elapsed and estimated time to completion can be displayed.

More details

See the preprocessing vignette for more details.

Note that if any values in outcome_colname contain spaces, they will be converted to underscores for compatibility with caret.

Author(s)

Zena Lapp, zenalapp@umich.edu

Kelly Sovacool, sovacool@umich.edu

Examples

preprocess_data(mikropml::otu_small, "dx")

# the function can show a progress bar if you have the progressr package installed
## optionally, specify the progress bar format
progressr::handlers(progressr::handler_progress(
  format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
  clear = FALSE,
  show_after = 0
))
## tell progressor to always report progress
## Not run: 
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
dat_preproc <- preprocess_data(mikropml::otu_small, "dx")

## End(Not run)

SchlossLab/mikropml documentation built on Aug. 24, 2023, 9:51 p.m.