autoPreProcess: Automated data cleaning and feature engineering for machine...
In XanderHorn/autoML: Automated machine learning

Description Usage Arguments Value Author(s) Examples

Automatically cleans and engineers data for machine learning problems. Cleaning of data involves imputation, outlier clipping, feature formatting, incorrect formatting of missing values and removal of duplicate observations. Feature enigneering involves creating tracking features, used to identify where changes were made to the data like imputation. Transformations on numercial features, categorical feature encodings, unsupervised features using k-means to calculate the distance from centre, as well as feature interactions. A random forest model decides on the best possible transformations/ categorical encodings and only those survive. Dependent on the cleaning and engineering chosen, production code is produced and returned via a data.frame object. Should this code frame then be provided to the autoLearn function, the code is altered to the model specific features.

autoPreProcess(train, target = NULL, id = NULL, removeDupObs = TRUE,
  downSample = FALSE, correctMissEncode = TRUE, numMissEncode = NULL,
  charMissEncode = c("", " ", "UNKNOWN", "MISS", "MISSING", "UNK", "NA",
  "NULL", "N/A"), formatFeatures = TRUE, trackingFeatures = TRUE,
  clipOutliers = TRUE, outlierMethod = "tukey", lowPercentile = 0.01,
  upPercentile = 0.99, imputeMissing = TRUE,
  categoricalMinPercent = 0.025, catFeatMaxLevels = 7, numChars = 60,
  featureTransformations = TRUE, featureInteractions = TRUE,
  unsupervisedFeatures = TRUE, maxUniques = 100, autoCode = TRUE,
  seed = 1991, saveCode = FALSE, removeIDFeatures = FALSE,
  codePath = NULL, codeFilename = "autoCode", verbose = TRUE)

`train`	[data.frame \| Required] Dataset to perform cleaning and engineering on, usually training set but can be the full set as well.
`target`	[character \| Optional] Leave NULL if the problem is unsupervised else specify the target feature
`id`	[character \| Optioanl] ID features are automatically detected and removed from cleaning and engieering, the dataset is also de-duplicated accoring to the ID feature(s) specified. Default of auto, which automatically searches for ID feature. For best performance specify ID features or leave as NULL
`removeDupObs`	[character \| Optional] Should duplicate observations be removed using the ID features detected or specified. Default of TRUE
`downSample`	[logical \| Optional] Should the dataset be downsampled for faster computation. Default of FALSE
`correctMissEncode`	[logical \| Optional] Should incorrectly formatted missing values be corrected and replaced by NA. Default of TRUE
`numMissEncode`	[numeric vector \| Optional] Vector of numeric values which indicate missing data. Default of NULL
`charMissEncode`	[character vector \| Optional] Vector of character values which indicate missing data. Default of c(”,' ','UNKNOWN','MISS','MISSING','UNK','NA','NULL','N/A')
`formatFeatures`	[logical \| Optional] Should feature classes be formatted accoring to a recommended formatting scheme. Default of TRUE
`trackingFeatures`	[logical \| Optional] Should tracking features be created when cleaning the data. Useful for tree based models. Default of TRUE
`clipOutliers`	[logical \| Optional] Should outliers be clipped by the median value. Default of TRUE
`outlierMethod`	[character \| Optional] Which outlier method to use when searching for outliers, options are: tukey, percentile. Default of tukey
`lowPercentile`	[numeric \| Optional] When percentile outlier method is specified any feature with values lower than this percentile will be flagged as outliers. Default of 0.01
`upPercentile`	[numeric \| Optional] When percentile outlier method is specified any feature with values greater than this percentile will be flagged as outliers. Default of 0.99
`imputeMissing`	[logical \| Optional] Should missing data be imputed. Default of TRUE
`categoricalMinPercent`	[numeric \| Optional] Minimum percentage of categorical class proportions allowed to flag a class as minority in nature. Default of 0.025
`catFeatMaxLevels`	[integer \| Optional] Maximum number of categories allowed for a categorical feature to one hot encode. Less than or equal to the specified number will perform one hot encoding on categorical features. Default of 7
`numChars`	[integer \| Optional] Number of characters in a character feature to identify it as a text feature and engineer it accordingly. Default of 100
`featureTransformations`	[logical \| Optional] Shoud feature transformations be computed for numeric and integer features, log and square-root transformations are used. Default of TRUE
`featureInteractions`	[logical \| Optional] Should feature interactions be computed for numeric and integer features. Default of TRUE
`unsupervisedFeatures`	[logical \| Optional] Should unsupervised features be cretead for numeric and integer feature. Uses k-means to create clusters on a feature and then calculates the distance to the center which is the final feature. Default of TRUE
`maxUniques`	[integer \| Optional] Maximimum number of uniques values in the target feature before it is seen as a regression problem. Default of 100 i.e. 100 categories to classify
`autoCode`	[logical \| Optional] Should production code be written and returned whilst cleaning and engineering the dataset. Default of TRUE
`seed`	[integer \| Optional] Random number seed for reproducible results. Default of 1991
`saveCode`	[logical \| Optional] Should the code that is generated be saved locally. Default of FALSE
`removeIDFeatures`	[logical \| Optional] Should ID features be removed from the cleaned and engineered dataset
`codePath`	[character \| Optional] Path dictating where the code is saved
`codeFilename`	[character \| Optional] Name of the file in which the code will be saved
`verbose`	[logical \| Optional] Chatty function or not. Default of TRUE