explore.pipelines: Machine learning pipeline optimization
In XanderHorn/lazy: Automated data pre-processing and machine learning

Description Usage Arguments Details Value Author(s) Examples

Machine learning pipelines consist of various methods for data cleaning and feature engineering.

explore.pipelines(train, valid, id.feats = NULL, x = NULL, y,
  cluster.memory = NULL, max.runtime.mins = 10,
  reduce.dimensionality = TRUE, max.levels = 100, progress = TRUE,
  seed = 1, cluster.shutdown = FALSE)

`train`	[required \| data.frame] Traning set before any feature engineering or data cleaning is done.
`valid`	[required \| data.frame] Validation set before any feature engineering or data cleaning is done.
`id.feats`	[optional \| character \| default=NULL] Names of ID features. Used to de-duplicate the training dataset given ID features, if nothing is provided then no de-duplication is done.
`x`	[optional \| character \| default=NULL] Features to include as predictors in the training and validation sets. If NULL then all features in the dataset will be used except for the target feature and ID features.
`y`	[optional \| character] The name of the target feature contained in the training and validation sets.
`cluster.memory`	[optional \| integer \| default=NULL] The maxmimum allocated memory in GB designated to the H2O cluster.
`max.runtime.mins`	[optional \| integer \| default=10] The maximum run time in minutes for the function to identify the best possible pipelines. Recommended to increase for datasets with a large number of columns or multi-class problems.
`reduce.dimensionality`	[optional \| logical \| default=TRUE] Reduces dimensionality by computing feature importances for each feature and only keeping the top 10 numerical and categorical features. All other feature types are kept along with the top performing features. Used to speed up pipeline search. If the number of features in the dataset is greater than 80, dimensionality will be reduced, else the data is used as is.
`max.levels`	[optional \| numeric \| default=100] The maximum number of unique values in the target feature before it is considered a regression problem.
`progress`	[optional \| logical \| default=TRUE] Display a progress bar.
`seed`	[optional \| integer \| default=1] Random number seed for reproducable results.
`cluster.shutdown`	[optional \| integer \| default=TRUE] Shutdown h2o cluster after completion.

Constructs a grid with all combinations of pipelines then randomly shuffles the grid and explores pipelines along the grid. The function will run until the max runtime in minutes threshold has been reached. Pipelines are evaluated using a random forest and a lasso GLM. The mean performance is also calculated and returned. For binary classification problems, Gini is used to evaluatedpipelines. For multiclass classification, logloss is used to evaluate pipelines. For regression mse is used to evaluate pipelines. The training set is down sampled to a max of 40k observations along with the validation set for faster pipeline exploration.

List containing best pipelines, summary frame and pipeline plots

Xander Horn

1 2	#Iris dataset used for both train and validation for demonstration purposes only res <- explore.pipelines(train = iris, valid = iris, y = "Species")