design.pipeline: Design a pipeline for data pre-processing

Description Usage Arguments Value Author(s) Examples

View source: R/design_pipeline.R

Description

Designs and outputs a pipeline stating which steps will be taken to pre-process data. The output of this function will return the settings specified when designing the pipeline as well as documentation on how the data will be treated based on the specified settings.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
design.pipeline(pipeline.name = NULL, text.features = TRUE,
  text.threshold = 100, date.features = TRUE, impute.missing = TRUE,
  impute.mode = "auto", impute.tracking = FALSE,
  impute.threshold = 0.1, categorical.encoding = TRUE,
  categorical.mode = "onehot.prop", categorical.tracking = FALSE,
  categorical.max.levels = 10, categorical.min.percent = 0.025,
  categorical.interactions = FALSE,
  categorical.interactions.levels = 2,
  categorical.interaction.feats = 10, outlier.clipping = FALSE,
  outlier.mode = "tukey", outlier.tracking = FALSE,
  outlier.lower.percentile = 0.01, outlier.upper.percentile = 0.99,
  max.scaling = FALSE, kmeans.features = FALSE,
  numeric.transform = FALSE, transform.mode = "log",
  transform.cutoff = 7, numeric.interactions = FALSE,
  numeric.interaction.feats = 10, freq.encode = FALSE, seed = 1)

Arguments

pipeline.name

[character | optional | default=NULL] Name of the pipeline, if NULL then a random name will be generated.

text.features

[logical | optional | default=TRUE] Should text features be engineered and new features derived. Simple engineering is applied such as character and number count etc.

text.threshold

[numeric | optional | default=100] The maximum number of characters counted in a character feature before it is identified as a text feature and assigned to text feature engineering.

date.features

[logical | optional | default=TRUE] Should date features be engineered and new features derived. Simple engineering such as year, month day etc.

impute.missing

[logical | optional | default=TRUE] Should features be imputed.

impute.mode

[character | optional | default="auto"] Imputation mode, options are auto, encode and median.mode. Auto applies a combination bewteen encoding and median.mode imputation based on the na.threshold parameter.

impute.tracking

[logical | optional | default=FALSE] Should tracking features be created which are indicator features that sets a value of 1 to all observations where a missing value was found per feature.

impute.threshold

[optional | numeric | default=0.1] Threshold for auto impute.mode to apply encoding or median.mode imputation. All features containing Na values above the specified percentage threshold will be imputed using encoding.

categorical.encoding

[logical | optional | default=TRUE] Should categorical features be encoded and engineered.

categorical.mode

[optional | character | default="onehot.prop"] Type of mappings to apply. Options are auto, target, proportional, ordinal, onehot, onehot.prop, report, where auto is a combination between onehot and target. Tracking features are created which flags if a feature has a low proportional category in it. Other types of feature engineering includes, weighted mean noise target encoding, proportional encoding, ordinal proportional encoding, one hot encoding and low proportional one hot encoding which flags all low proportional categories as "other". Report cleans up levels so that the data can be represented in reports and charts.

categorical.tracking

[logical | optional | default=FALSE] Should tracking features be created which are indicator features that sets a value of 1 to all observations where a level in a categorical feature was sparse (low proportional).

categorical.max.levels

[optional | integer | default=10] The maximum levels allowed for a categorical feature to create one hot encoded features.

categorical.min.percent

[optional | numeric | default=0.025] The minimum proportion a categorical level is allowed to have before it is flagged as a low proportional level.

categorical.interactions

[logical | optional | default=FALSE] Should interaction features be created for categorical features based on n-way combinations. Categories for different features are combined into a new feature.

categorical.interactions.levels

[optional | numeric | Default=2] Number of features to interact, needs to be less than or equal to the number of features provided in x.

categorical.interaction.feats

[optional | numeric | default=10] The number of top important categorical features to be used when creating categorical interaction features.

outlier.clipping

[logical | optional | default=FALSE] Should outliers be clipped.

outlier.mode

[optional | character | default="tukey"] Mode to identify outliers. Options are tukey or percentile to identify outliers.

outlier.tracking

[optional | logical | default=FALSE] Creates tracking features that record which observations had outliers present.

outlier.lower.percentile

[optional | numeric | default=0.01] The lower percentile value to be used when flagging values as outliers.

outlier.upper.percentile

[optional | numeric | default=0.99] The upper percentile value to be used when flagging values as outliers.

max.scaling

[optional | logical | default=FALSE] Should features be scaled to be between 0 and 1 by dividing by the maximum value.

kmeans.features

[optional | logical | default=FALSE] Should k-means features be created by clustering each feature and calculating the distance to the cluster centre.

numeric.transform

[optional | logical | default=FALSE] Should numerical features with skewed distributions be transformed.

transform.mode

[optional | character | default="log"] Transform type, options are log or sqrt.

transform.cutoff

[optional | numeric | default=7] The skewness statistic cutoff value for features to be transformed.

numeric.interactions

[optional | logical | default=FALSE] Should numerical feature interactions be created by adding, subtracting, dividing and multiplying n-way feature combinations. Only the top n numerical features are used to create interactions features as identified by a random forest permuation based feature importance.

numeric.interaction.feats

[optional | numeric | default=10] The number of top important numerical features to be used when creating numerical interaction features.

freq.encode

[optional | logical | default=FALSE] Should frequency features be created which is simply a count of each unique value present per feature. These are created before any feature engineering is done.

seed

[optional | numeric | default=1] Random seed for reproducable results.

Value

List containing pipeline information including settings and data pre-processing documentation

Author(s)

Xander Horn

Examples

1
pl <- design.pipeline(kmeans.features = TRUE)

XanderHorn/lazy documentation built on Jan. 16, 2021, 6:15 p.m.