numero.prepare: Prepare datasets for analysis
In Numero: Statistical Framework to Define Subgroups in Complex Datasets

numero.prepare

R Documentation

Prepare datasets for analysis

Description

Prepare training data by mitigating confounding factors and standardizing values.

Usage

numero.prepare(data, variables = NULL, confounders = NULL,
               batch = NULL, method = "standard", clip = 5.0,
	       pipeline = NULL, undo = FALSE)

Arguments

`data`	A matrix or a data frame.
`variables`	A character vector of column names, see details.
`confounders`	Names of columns that contain confounder data.
`batch`	The name of the column that contains batch labels.
`method`	Method to standardize values, see `nroPreprocess()`.
`clip`	Range for clipping extreme values in multiples of standard deviations.
`pipeline`	Processing parameters from a previous use of the function.
`undo`	If true, standardization is reversed after adjusting for batches and confounders.

Details

We recommend first applying numero.clean() to the full dataset, then selecting a subset for training using the input argument variables. This preserves any attributes that may be used in Numero functions.

If a previous pipeline is available, it overrides all processing parameters irrespective of other input arguments.

Due to safeguards against numerical instability, the standardized values may deviate slightly from the expected range (<0.1 percent error is typical).

Clipping of extreme values is applied only during the first round of standardization before adjustments for confounders. Therefore, the final output may contain values that exceed the threshold.

Value

A matrix with the attributes 'pipeline' that contains the processing parameters and 'subsets' that contains row names divided into batches if batch correction was applied.

Examples

# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)

# Set identities and manage missing data.
dataset <- numero.clean(dataset, identity = "INDEX")

# Prepare training variables using default standardization.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")
trdata <- numero.prepare(data = dataset, variables = trvars)
print(summary(trdata))

# Prepare training values adjusted for age and sex and
# standardized by rank-based method.
trdata <- numero.prepare(data = dataset, variables = trvars,
                         batch = "MALE", confounders = "AGE",
			 method = "tapered")
print(summary(trdata))

Numero documentation built on Sept. 17, 2024, 5:09 p.m.