preprocess_: Data preprocessing (in-place)

View source: R/preprocess_.R

preprocess_R Documentation

Data preprocessing (in-place)

Description

Prepare data for analysis and visualization

Usage

preprocess_(
  x,
  removeFeatures.thres = NULL,
  missingness = FALSE,
  integer2factor = FALSE,
  integer2numeric = FALSE,
  logical2factor = FALSE,
  logical2numeric = FALSE,
  numeric2factor = FALSE,
  numeric2factor.levels = NULL,
  len2factor = 0,
  character2factor = FALSE,
  factorNA2missing = FALSE,
  factorNA2missing.level = "missing",
  scale = FALSE,
  center = scale,
  removeConstants = FALSE,
  oneHot = FALSE,
  exclude = NULL,
  verbose = TRUE
)

Arguments

x

data.frame or data.table to be preprocessed. If data.frame, will be converted to data.table in-place of missing features.

removeFeatures.thres

Float (0, 1): Remove features with missing values in >= to this fraction of cases.

missingness

Logical: If TRUE, generate new boolean columns for each feature with missing values, indicating which cases were missing data.

integer2factor

Logical: If TRUE, convert all integers to factors

integer2numeric

Logical: If TRUE, convert all integers to numeric (will only work if integer2factor = FALSE)

logical2factor

Logical: If TRUE, convert all logical variables to factors

logical2numeric

Logical: If TRUE, convert all logical variables to numeric

numeric2factor

Logical: If TRUE, convert all numeric variables to factors

numeric2factor.levels

Character vector: Optional - If numeric2factor = TRUE, use these levels for all numeric variables.

len2factor

Integer (>=2): Convert all numeric variables with less than or equal to this number of unique values to factors. For example, if binary variables are encoded with 1, 2, you could use len2factor = 2 to convert them to factors. If race is encoded with 6 integers, you can use 6.

character2factor

Logical: If TRUE, convert all character variables to factors

factorNA2missing

Logical: If TRUE, make NA values in factors be of level factorNA2missing.level. In many cases this is the preferred way to handle missing data in categorical variables. Note that since this step is performed before imputation, you can use this option to handle missing data in categorical variables and impute numeric variables in the same preprocess call.

factorNA2missing.level

Character: Name of level if factorNA2missing = TRUE.

scale

Logical: If TRUE, scale columns of x

center

Logical: If TRUE, center columns of x

removeConstants

Logical: If TRUE, remove constant columns.

oneHot

Logical: If TRUE, convert all factors using one-hot encoding

exclude

Integer, vector: Exclude these columns from preprocessing.

verbose

Logical: If TRUE, write messages to console.

Details

This function (ending in "_") performs operations in-place and returns the preprocessed data.table silently (e.g. for piping). Note that imputation is not currently supported - use preprocess for imputation.

Order of operations is the same as the order of arguments in usage:

  • keep complete cases only

  • remove duplicates

  • remove cases by missingness threshold

  • remove features by missingness threshold

  • integer to factor

  • integer to numeric

  • logical to factor

  • logical to numeric

  • numeric to factor

  • numeric with less than N unique values to factor

  • character to factor

  • factor NA to named level

  • add missingness column

  • scale and/or center

  • remove constants

  • one-hot encoding

Author(s)

E.D. Gennatas

Examples

## Not run: 
x <- data.table(a = sample(c(1:3), 30, T),
b = rnorm(30, 12),
c = rnorm(30, 200),
d = sample(c(21:22), 30, T),
e = rnorm(30, -100),
f = rnorm(30, 950),
g = rnorm(30),
h = rnorm(30))
## add duplicates
x <- rbind(x, x[c(1, 3), ])
## add constant
x[, z := 99]
preprocess_(x)

## End(Not run)

egenn/rtemis documentation built on Dec. 17, 2024, 6:16 p.m.