preprocess_data: Preprocess data prior to running machine learning
In mikropml: User-Friendly R Package for Supervised Machine Learning Pipelines

preprocess_data

R Documentation

Preprocess data prior to running machine learning

Description

Function to preprocess your data for input into run_ml().

Usage

preprocess_data(dataset, ...)

## S4 method for signature 'TreeSummarizedExperiment'
preprocess_data(
  dataset,
  outcome_colname,
  assay.type = "counts",
  col.var = NULL,
  altexp = NULL,
  name = "preprocessed",
  ...
)

## S4 method for signature 'ANY'
preprocess_data(
  dataset,
  outcome_colname,
  method = c("center", "scale"),
  remove_var = "nzv",
  collapse_corr_feats = TRUE,
  corr_method = "spearman",
  corr_thresh = 1,
  to_numeric = TRUE,
  group_neg_corr = TRUE,
  prefilter_threshold = 1,
  ...
)

Arguments

`dataset`	Data frame with an outcome variable and other columns as features. Alternatively, the input can be in `TreeSummarizedExperiment` format.
`...`	All additional arguments are passed on to `caret::train()`, such as case weights via the `weights` argument or `ntree` for `rf` models. See the `caret::train()` docs for more details.
`outcome_colname`	Column name as a string of the outcome variable (default `NULL`; the first column will be chosen automatically).
`assay.type`	The name of assay from `dataset` when the object is in `TreeSummarizedExperiment` format. This assay is used as an input.
`col.var`	The name of sample matdata variables from `colData` slot of `dataset` when the object is in `TreeSummarizedExperiment` format. These variables are used as predictors.
`altexp`	The name of alternative experiment (`altExp`) from `dataset` when the object is in `TreeSummarizedExperiment` format. This can be used to select an experiment for the input.
`name`	Name of results used when the input is `TreeSummarizedExperiment`. This same name is used for `assay` and `altExp`.
`method`	Methods to preprocess the data, described in `caret::preProcess()` (default: `c("center","scale")`, use `NULL` for no normalization).
`remove_var`	Whether to remove variables with near-zero variance (`'nzv'`; default), zero variance (`'zv'`), or none (`NULL`).
`collapse_corr_feats`	Whether to keep only one of correlated features (see `corr_method` and `corr_thresh`)
`corr_method`	Correlation method. Options are the same as those supported by `stats::cor`: spearman, pearson, kendall. (default: spearman)
`corr_thresh`	group correlations above or equal to `corr_thresh` (range `0` to `1`; default: `1`).
`to_numeric`	Whether to change features to numeric where possible.
`group_neg_corr`	Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)).
`prefilter_threshold`	Remove features which only have non-zero & non-NA values in N rows or fewer (default: 1). Set this to -1 to keep all columns at this step. This step will also be skipped if `to_numeric` is set to `FALSE`.

Value

Named list including:

dat_transformed: Preprocessed data.
grp_feats: If features were grouped together, a named list of the features corresponding to each group.
removed_feats: Any features that were removed during preprocessing (e.g. because there was zero variance or near-zero variance for those features).

If the input is TreeSummarizedExperiment, the output is added as an additional data to the input object. If the set of features match in output and input, the results are stored directly to assay slot. If they do not match, the output is stored to altExp slot of the object.

If the progressr package is installed, a progress bar with time elapsed and estimated time to completion can be displayed.

More details

See the preprocessing vignette for more details.

Note that if any values in outcome_colname contain spaces, they will be converted to underscores for compatibility with caret.

Author(s)

Zena Lapp, zenalapp@umich.edu

Kelly Sovacool, sovacool@umich.edu

Examples

preprocess_data(mikropml::otu_small, "dx")

# the function can show a progress bar if you have the progressr package installed
## optionally, specify the progress bar format
progressr::handlers(progressr::handler_progress(
  format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
  clear = FALSE,
  show_after = 0
))
## tell progressor to always report progress
## Not run: 
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
dat_preproc <- preprocess_data(mikropml::otu_small, "dx")

# Create TreeSE object
library(TreeSummarizedExperiment)
df <- mikropml::otu_small
assay <- df[, !colnames(df) %in% c("dx"), drop = FALSE] |> t() |> as.matrix()
tse <- TreeSummarizedExperiment(assays = SimpleList(counts = assay))
colData(tse)[["dx"]] <- df[["dx"]]

# Preprocess
tse <- preprocess_data(
  dataset = tse,
  assay.type = "counts",
  outcome_colname = "dx"
)
# The result is in assay slot
tse

## End(Not run)

mikropml documentation built on Dec. 1, 2025, 9:08 a.m.