autoDataprep: Automatic data preparation for ML algorithms

View source: R/autoDataPrep.R

autoDataprepR Documentation

Automatic data preparation for ML algorithms

Description

Final data preparation before ML algorithms. Function provides final data set and highlights of the data preparation

Usage

autoDataprep(
  data,
  target = NULL,
  missimpute = "default",
  auto_mar = FALSE,
  mar_object = NULL,
  dummyvar = TRUE,
  char_var_limit = 12,
  aucv = 0.02,
  corr = 0.99,
  outlier_flag = FALSE,
  interaction_var = FALSE,
  frequent_var = FALSE,
  uid = NULL,
  onlykeep = NULL,
  drop = NULL,
  verbose = FALSE
)

Arguments

data

[data.frame | Required] dataframe or data.table

target

[integer | Required] dependent variable (binary or multiclass)

missimpute

[text | Optional] missing value imputation using mlr misimpute function. Please refer to the "details" section to know more

auto_mar

[character | Optional] identify any missing variable which are completely missing at random or not (default FALSE). If TRUE this will call autoMAR()

mar_object

[character | Optional] object created from autoMAR function

dummyvar

[logical | Optional] categorical feature engineering i.e. one hot encoding (default is TRUE)

char_var_limit

[integer | Optional] default limit is 12 for a dummy variable preparation. e.g. if gender variable has two different value "M" and "F", then gender has 2 levels

aucv

[integer | Optional] cut off value for AUC based variable selection

corr

[integer | Optional] cut off value for correlation based variable selection

outlier_flag

[logical | Optional] to add outlier features (default is FALSE)

interaction_var

[logical | Optional] bulk interactions transformer for numerical features

frequent_var

[logical | Optional] frequent transformer for categorical features

uid

[character | Optional] unique identifier column if any to keep in the final data set

onlykeep

[character | Optional] only consider selected variables for data preparation

drop

[character | Optional] exclude variables from the dataset

verbose

[logical | Optional] display executions steps on console(default is FALSE)

Details

Missing imputation using impute function from MLR

MLR package have a appropriate way to impute missing value using multiple methods. #'

  • mean value for integer variable

  • median value for numeric variable

  • mode value for character or factor variable

optional: You might be interested to impute missing variable using ML method. List of algorithms will be handle missing variables in MLR package listLearners("classif", check.packages = TRUE, properties = "missings")[c("class", "package")]

Feature engineering

  • missing not completely at random variable using autoMAR function

  • date transfomer like year, month, quarter, week

  • frequent transformer counts each categorical value in the dataset

  • interaction transformer using multiplication

  • one hot dummy coding for categorical value

  • outlier flag and capping variable for numerical value

Feature reduction

  • zero variance using nearZeroVar caret function

  • pearson's correlation value

  • auc with target variable

Value

list output contains below objects

complete_data

complete dataset including new derived features based on the functional understanding of the dataset

master_data

filtered dataset based on the input parameters

final_var_list

list of master variables

auc_var

list of auc variables

cor_var

list of correlation variables

overall_var

all variables in the dataset

zerovariance

variables with zero variance in the dataset

See Also

impute

Examples

#Auto data prep
traindata <- autoDataprep(heart, target = "target_var", missimpute = "default",
dummyvar = TRUE, aucv = 0.02, corr = 0.98, outlier_flag = TRUE,
interaction_var = TRUE, frequent_var = TRUE)
train <- traindata$master_data

DriveML documentation built on Dec. 2, 2022, 5:14 p.m.