autoPreProcess: Automated data cleaning and feature engineering for machine...

Description Usage Arguments Value Author(s) Examples

View source: R/autoPreProcess.R

Description

Automatically cleans and engineers data for machine learning problems. Cleaning of data involves imputation, outlier clipping, feature formatting, incorrect formatting of missing values and removal of duplicate observations. Feature enigneering involves creating tracking features, used to identify where changes were made to the data like imputation. Transformations on numercial features, categorical feature encodings, unsupervised features using k-means to calculate the distance from centre, as well as feature interactions. A random forest model decides on the best possible transformations/ categorical encodings and only those survive. Dependent on the cleaning and engineering chosen, production code is produced and returned via a data.frame object. Should this code frame then be provided to the autoLearn function, the code is altered to the model specific features.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
autoPreProcess(train, target = NULL, id = NULL, removeDupObs = TRUE,
  downSample = FALSE, correctMissEncode = TRUE, numMissEncode = NULL,
  charMissEncode = c("", " ", "UNKNOWN", "MISS", "MISSING", "UNK", "NA",
  "NULL", "N/A"), formatFeatures = TRUE, trackingFeatures = TRUE,
  clipOutliers = TRUE, outlierMethod = "tukey", lowPercentile = 0.01,
  upPercentile = 0.99, imputeMissing = TRUE,
  categoricalMinPercent = 0.025, catFeatMaxLevels = 7, numChars = 60,
  featureTransformations = TRUE, featureInteractions = TRUE,
  unsupervisedFeatures = TRUE, maxUniques = 100, autoCode = TRUE,
  seed = 1991, saveCode = FALSE, removeIDFeatures = FALSE,
  codePath = NULL, codeFilename = "autoCode", verbose = TRUE)

Arguments

train

[data.frame | Required] Dataset to perform cleaning and engineering on, usually training set but can be the full set as well.

target

[character | Optional] Leave NULL if the problem is unsupervised else specify the target feature

id

[character | Optioanl] ID features are automatically detected and removed from cleaning and engieering, the dataset is also de-duplicated accoring to the ID feature(s) specified. Default of auto, which automatically searches for ID feature. For best performance specify ID features or leave as NULL

removeDupObs

[character | Optional] Should duplicate observations be removed using the ID features detected or specified. Default of TRUE

downSample

[logical | Optional] Should the dataset be downsampled for faster computation. Default of FALSE

correctMissEncode

[logical | Optional] Should incorrectly formatted missing values be corrected and replaced by NA. Default of TRUE

numMissEncode

[numeric vector | Optional] Vector of numeric values which indicate missing data. Default of NULL

charMissEncode

[character vector | Optional] Vector of character values which indicate missing data. Default of c(”,' ','UNKNOWN','MISS','MISSING','UNK','NA','NULL','N/A')

formatFeatures

[logical | Optional] Should feature classes be formatted accoring to a recommended formatting scheme. Default of TRUE

trackingFeatures

[logical | Optional] Should tracking features be created when cleaning the data. Useful for tree based models. Default of TRUE

clipOutliers

[logical | Optional] Should outliers be clipped by the median value. Default of TRUE

outlierMethod

[character | Optional] Which outlier method to use when searching for outliers, options are: tukey, percentile. Default of tukey

lowPercentile

[numeric | Optional] When percentile outlier method is specified any feature with values lower than this percentile will be flagged as outliers. Default of 0.01

upPercentile

[numeric | Optional] When percentile outlier method is specified any feature with values greater than this percentile will be flagged as outliers. Default of 0.99

imputeMissing

[logical | Optional] Should missing data be imputed. Default of TRUE

categoricalMinPercent

[numeric | Optional] Minimum percentage of categorical class proportions allowed to flag a class as minority in nature. Default of 0.025

catFeatMaxLevels

[integer | Optional] Maximum number of categories allowed for a categorical feature to one hot encode. Less than or equal to the specified number will perform one hot encoding on categorical features. Default of 7

numChars

[integer | Optional] Number of characters in a character feature to identify it as a text feature and engineer it accordingly. Default of 100

featureTransformations

[logical | Optional] Shoud feature transformations be computed for numeric and integer features, log and square-root transformations are used. Default of TRUE

featureInteractions

[logical | Optional] Should feature interactions be computed for numeric and integer features. Default of TRUE

unsupervisedFeatures

[logical | Optional] Should unsupervised features be cretead for numeric and integer feature. Uses k-means to create clusters on a feature and then calculates the distance to the center which is the final feature. Default of TRUE

maxUniques

[integer | Optional] Maximimum number of uniques values in the target feature before it is seen as a regression problem. Default of 100 i.e. 100 categories to classify

autoCode

[logical | Optional] Should production code be written and returned whilst cleaning and engineering the dataset. Default of TRUE

seed

[integer | Optional] Random number seed for reproducible results. Default of 1991

saveCode

[logical | Optional] Should the code that is generated be saved locally. Default of FALSE

removeIDFeatures

[logical | Optional] Should ID features be removed from the cleaned and engineered dataset

codePath

[character | Optional] Path dictating where the code is saved

codeFilename

[character | Optional] Name of the file in which the code will be saved

verbose

[logical | Optional] Chatty function or not. Default of TRUE

Value

List containing data.frame with cleaned and engineered features as well as code when autoCode is TRUE

Author(s)

Xander Horn

Examples

1
temp <- autoPreProcess(train = iris, target = "Species", removeDupObs = F)

XanderHorn/autoML documentation built on Aug. 5, 2020, 11:45 a.m.