setup_Preprocessor: Setup 'PreprocessorParameters'
In egenn/rtemis: Machine Learning and Visualization

setup_Preprocessor

R Documentation

Setup `PreprocessorParameters`

Description

Setup PreprocessorParameters

Usage

setup_Preprocessor(
  complete_cases = FALSE,
  remove_features_thres = NULL,
  remove_cases_thres = NULL,
  missingness = FALSE,
  impute = FALSE,
  impute_type = c("missRanger", "micePMM", "meanMode"),
  impute_missRanger_params = list(pmm.k = 3, maxiter = 10, num.trees = 500),
  impute_discrete = "get_mode",
  impute_continuous = "mean",
  integer2factor = FALSE,
  integer2numeric = FALSE,
  logical2factor = FALSE,
  logical2numeric = FALSE,
  numeric2factor = FALSE,
  numeric2factor_levels = NULL,
  numeric_cut_n = 0,
  numeric_cut_labels = FALSE,
  numeric_quant_n = 0,
  numeric_quant_NAonly = FALSE,
  unique_len2factor = 0,
  character2factor = FALSE,
  factorNA2missing = FALSE,
  factorNA2missing_level = "missing",
  factor2integer = FALSE,
  factor2integer_startat0 = TRUE,
  scale = FALSE,
  center = scale,
  scale_centers = NULL,
  scale_coefficients = NULL,
  remove_constants = FALSE,
  remove_constants_skip_missing = TRUE,
  remove_features = NULL,
  remove_duplicates = FALSE,
  one_hot = FALSE,
  one_hot_levels = NULL,
  add_date_features = FALSE,
  date_features = c("weekday", "month", "year"),
  add_holidays = FALSE,
  exclude = NULL
)

Arguments

`complete_cases`	Logical: If TRUE, only retain complete cases (no missing data).
`remove_features_thres`	Float (0, 1): Remove features with missing values in >= to this fraction of cases.
`remove_cases_thres`	Float (0, 1): Remove cases with >= to this fraction of missing features.
`missingness`	Logical: If TRUE, generate new boolean columns for each feature with missing values, indicating which cases were missing data.
`impute`	Logical: If TRUE, impute missing cases. See `impute_discrete` and `impute_continuous`.
`impute_type`	Character: Package to use for imputation.
`impute_missRanger_params`	Named list with elements "pmm.k" and "maxiter", which are passed to `missRanger::missRanger`. `pmm.k` greater than 0 results in predictive mean matching. Default `pmm.k = 3` `maxiter = 10` `num.trees = 500`. Reduce `num.trees` for faster imputation especially in large datasets. Set `pmm.k = 0` to disable predictive mean matching.
`impute_discrete`	Character: Name of function that returns single value: How to impute discrete variables for `impute_type = "meanMode"`.
`impute_continuous`	Character: Name of function that returns single value: How to impute continuous variables for `impute_type = "meanMode"`.
`integer2factor`	Logical: If TRUE, convert all integers to factors. This includes `bit64::integer64` columns.
`integer2numeric`	Logical: If TRUE, convert all integers to numeric (will only work if `integer2factor = FALSE`). This includes `bit64::integer64` columns.
`logical2factor`	Logical: If TRUE, convert all logical variables to factors.
`logical2numeric`	Logical: If TRUE, convert all logical variables to numeric.
`numeric2factor`	Logical: If TRUE, convert all numeric variables to factors.
`numeric2factor_levels`	Character vector: Optional - will be passed to `levels` arg of `factor()` if `numeric2factor = TRUE`. For advanced/ specific use cases; need to know unique values of numeric vector(s) and given all numeric vars have same unique values.
`numeric_cut_n`	Integer: If > 0, convert all numeric variables to factors by binning using `base::cut` with `breaks` equal to this number.
`numeric_cut_labels`	Logical: The `labels` argument of base::cut.
`numeric_quant_n`	Integer: If > 0, convert all numeric variables to factors by binning using `base::cut` with `breaks` equal to this number of quantiles. produced using `stats::quantile`.
`numeric_quant_NAonly`	Logical: If TRUE, only bin numeric variables with missing values.
`unique_len2factor`	Integer (>=2): Convert all variables with less than or equal to this number of unique values to factors. For example, if binary variables are encoded with 1, 2, you could use `unique_len2factor = 2` to convert them to factors.
`character2factor`	Logical: If TRUE, convert all character variables to factors.
`factorNA2missing`	Logical: If TRUE, make NA values in factors be of level `factorNA2missing_level`. In many cases this is the preferred way to handle missing data in categorical variables. Note that since this step is performed before imputation, you can use this option to handle missing data in categorical variables and impute numeric variables in the same `preprocess` call.
`factorNA2missing_level`	Character: Name of level if `factorNA2missing = TRUE`.
`factor2integer`	Logical: If TRUE, convert all factors to integers.
`factor2integer_startat0`	Logical: If TRUE, start integer coding at 0.
`scale`	Logical: If TRUE, scale columns of `x`.
`center`	Logical: If TRUE, center columns of `x`. Note that by default it is the same as `scale`.
`scale_centers`	Named vector: Centering values for each feature.
`scale_coefficients`	Named vector: Scaling values for each feature.
`remove_constants`	Logical: If TRUE, remove constant columns.
`remove_constants_skip_missing`	Logical: If TRUE, skip missing values, before checking if feature is constant.
`remove_features`	Character vector: Features to remove.
`remove_duplicates`	Logical: If TRUE, remove duplicate cases.
`one_hot`	Logical: If TRUE, convert all factors using one-hot encoding.
`one_hot_levels`	List: Named list of the form "feature_name" = "levels". Used when applying one-hot encoding to validation or test data using `Preprocessor`.
`add_date_features`	Logical: If TRUE, extract date features from date columns.
`date_features`	Character vector: Features to extract from dates.
`add_holidays`	Logical: If TRUE, extract holidays from date columns.
`exclude`	Integer, vector: Exclude these columns from preprocessing.