nroPreprocess: Data cleaning and standardization
In Numero: Statistical Framework to Define Subgroups in Complex Datasets

nroPreprocess

R Documentation

Data cleaning and standardization

Description

Convert to numerical values, remove unusable rows and columns, and standardize scale of each variable.

Usage

nroPreprocess(data, method = "standard", clip = 5.0,
    resolution = 100, trim = FALSE)

Arguments

`data`	A matrix or a data frame.
`method`	Method for standardizing scale and location, see details below.
`clip`	Range for truncating extreme values in multiples of standard deviations.
`resolution`	Maximum number of sampling points to capture distribution shape.
`trim`	if TRUE, empty rows and columns are removed.

Details

Standardization methods include empty string for no action, "standard" for centering by mean and division by standard deviation, "uniform" for normalized ranks between -1 and 1, "tapered" for a version of the rank-based method that puts more samples around zero and "normal" for quantile-based mapping to standard normal distribution.

The standard method also checks if the distribution is skewed and applies logarithm if it makes the distribution closer to the normal curve.

Value

A matrix of numerical values. A value mapping model is stored in the attribute 'mapping'. The names of binary columns are stored in the attribute 'binary'.

Author(s)

Ville-Petteri Makinen

Examples

# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)

# Show original data characteristics.
print(summary(dataset))

# Detect binary columns.
ds <- nroPreprocess(dataset, method = "")
print(attr(ds,"binary"))

# Centering and scaling cholesterol.
ds <- nroPreprocess(dataset$CHOL)
print(summary(ds))

# Centering and scaling.
ds <- nroPreprocess(dataset)
print(summary(ds))

# Tapered ranks.
ds <- nroPreprocess(dataset, method = "tapered")
print(summary(ds))

# Standard normal ranks.
ds <- nroPreprocess(dataset, method = "normal")
print(summary(ds))

Numero documentation built on Sept. 13, 2025, 1:09 a.m.