preprocess: Data preprocessing

View source: R/preprocess.R

preprocessR Documentation

Data preprocessing

Description

Prepare data for analysis and visualization

Usage

preprocess(
  x,
  completeCases = FALSE,
  removeCases.thres = NULL,
  removeFeatures.thres = NULL,
  missingness = FALSE,
  impute = FALSE,
  impute.type = c("missRanger", "micePMM", "meanMode"),
  impute.missRanger.params = list(pmm.k = 3, maxiter = 10, num.trees = 500),
  impute.discrete = get_mode,
  impute.numeric = mean,
  integer2factor = FALSE,
  integer2numeric = FALSE,
  logical2factor = FALSE,
  logical2numeric = FALSE,
  numeric2factor = FALSE,
  numeric2factor.levels = NULL,
  numeric.cut.n = 0,
  numeric.cut.labels = FALSE,
  numeric.quant.n = 0,
  numeric.quant.NAonly = FALSE,
  len2factor = 0,
  character2factor = FALSE,
  factorNA2missing = FALSE,
  factorNA2missing.level = "missing",
  factor2integer = FALSE,
  factor2integer_startat0 = TRUE,
  scale = FALSE,
  center = scale,
  removeConstants = FALSE,
  removeConstants.skipMissing = TRUE,
  removeDuplicates = FALSE,
  oneHot = FALSE,
  add_date_features = FALSE,
  date_features = c("weekday", "month", "year"),
  add_holidays = FALSE,
  exclude = NULL,
  xname = NULL,
  verbose = TRUE
)

Arguments

x

data.frame to be preprocessed

completeCases

Logical: If TRUE, only retain complete cases (no missing data). Default = FALSE

removeCases.thres

Float (0, 1): Remove cases with >= to this fraction of missing features.

removeFeatures.thres

Float (0, 1): Remove features with missing values in >= to this fraction of cases.

missingness

Logical: If TRUE, generate new boolean columns for each feature with missing values, indicating which cases were missing data.

impute

Logical: If TRUE, impute missing cases. See impute.discrete and impute.numeric for how

impute.type

Character: How to impute data: "missRanger" and "missForest" use the packages of the same name to impute by iterative random forest regression. "rfImpute" uses randomForest::rfImpute (see its documentation), "meanMode" will use mean and mode by default or any custom function defined in impute.discrete and impute.numeric. Default = "missRanger" (which is much faster than "missForest"). "missForest" is included for compatibility with older pipelines.

impute.missRanger.params

Named list with elements "pmm.k" and "maxiter", which are passed to missRanger::missRanger. pmm.k greater than 0 results in predictive mean matching. Default pmm.k = 3 maxiter = 10 num.trees = 500. Reduce num.trees for faster imputation especially in large datasets. Set pmm.k = 0 to disable predictive mean matching to missForest::missForest

impute.discrete

Function that returns single value: How to impute discrete variables for impute.type = "meanMode". Default = get_mode

impute.numeric

Function that returns single value: How to impute continuous variables for impute.type = "meanMode". Default = mean

integer2factor

Logical: If TRUE, convert all integers to factors. This includes bit64::integer64 columns

integer2numeric

Logical: If TRUE, convert all integers to numeric (will only work if integer2factor = FALSE) This includes bit64::integer64 columns

logical2factor

Logical: If TRUE, convert all logical variables to factors

logical2numeric

Logical: If TRUE, convert all logical variables to numeric

numeric2factor

Logical: If TRUE, convert all numeric variables to factors

numeric2factor.levels

Character vector: Optional - will be passed to levels arg of factor() if numeric2factor = TRUE (For advanced/ specific use cases; need to know unique values of numeric vector(s) and given all numeric vars have same unique values)

numeric.cut.n

Integer: If > 0, convert all numeric variables to factors by binning using base::cut with breaks equal to this number

numeric.cut.labels

Logical: The labels argument of base::cut

numeric.quant.n

Integer: If > 0, convert all numeric variables to factors by binning using base::cut with breaks equal to this number of quantiles produced using stats::quantile

numeric.quant.NAonly

Logical: If TRUE, only bin numeric variables with missing values

len2factor

Integer (>=2): Convert all variables with less than or equal to this number of unique values to factors. Default = NULL. For example, if binary variables are encoded with 1, 2, you could use len2factor = 2 to convert them to factors.

character2factor

Logical: If TRUE, convert all character variables to factors

factorNA2missing

Logical: If TRUE, make NA values in factors be of level factorNA2missing.level. In many cases this is the preferred way to handle missing data in categorical variables. Note that since this step is performed before imputation, you can use this option to handle missing data in categorical variables and impute numeric variables in the same preprocess call.

factorNA2missing.level

Character: Name of level if factorNA2missing = TRUE. Default = "missing"

factor2integer

Logical: If TRUE, convert all factors to integers

factor2integer_startat0

Logical: If TRUE, start integer coding at 0

scale

Logical: If TRUE, scale columns of x

center

Logical: If TRUE, center columns of x. Note that by default it is the same as scale

removeConstants

Logical: If TRUE, remove constant columns.

removeConstants.skipMissing

Logical: If TRUE, skip missing values, before checking if feature is constant

removeDuplicates

Logical: If TRUE, remove duplicate cases.

oneHot

Logical: If TRUE, convert all factors using one-hot encoding.

add_date_features

Logical: If TRUE, extract date features from date columns.

date_features

Character vector: Features to extract from dates.

add_holidays

Logical: If TRUE, extract holidays from date columns.

exclude

Integer, vector: Exclude these columns from preprocessing.

xname

Character: Name of x for messages.

verbose

Logical: If TRUE, write messages to console.

Details

Order of operations (reflected by order of arguments in usage):

  • keep complete cases only

  • remove constants

  • remove duplicates

  • remove cases by missingness threshold

  • remove features by missingness threshold

  • integer to factor

  • integer to numeric

  • logical to factor

  • logical to numeric

  • numeric to factor

  • cut numeric to n bins

  • cut numeric to n quantiles

  • numeric with less than N unique values to factor

  • character to factor

  • factor NA to named level

  • add missingness column

  • impute

  • scale and/or center

  • one-hot encoding

Author(s)

E.D. Gennatas


egenn/rtemis documentation built on Dec. 17, 2024, 6:16 p.m.