automl: Automated machine learning

Description Usage Arguments Value Author(s) Examples

View source: R/automl.R

Description

Automated machine learning with automated feature engineering via pipeline exploration optimization. Utilises h2o.automl as the modelling engine. Duplicate observations are removed based on id features provided to the function. Time sensitive partitioning is also performed if a time sensitive indicator feature is provided. The function is bound by time for both optimization of pipelines as well as model optimization. Due to the function using the h2o library, models are saved locally to enable loading the models into the h2o cluster at a later stage and perform scoring. Note that stacked ensembles will only be trained if the cv folds are set to 3 or above.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
automl(train, y, valid = NULL, test = NULL, x = NULL,
  id.feats = NULL, time.partition.feature = NULL,
  optimization.metric = "AUTO", valid.split = 0.1, test.split = 0.2,
  pipeline.search.max.runtime.mins = 30,
  automl.search.max.runtime.mins = 30, balance.classes = FALSE,
  models = c("DRF", "GLM", "GBM", "XGBoost", "DeepLearning",
  "StackedEnsemble"), cv.folds = 0, max.levels = 100,
  data.leakage.cutoff = 0.65, cluster.memory = NULL,
  min.feature.importance = 0.1, seed = 1, output.path = NULL,
  pipeline = NULL, return.data = TRUE)

Arguments

train

[required | data.frame] Traning set, if no test and validation sets are provided it is considered as the full set and test and validation sets will be created.

y

[optional | character] The name of the target feature contained in the training and validation sets.

valid

[optional | data.frame | default=NULL] Validation set used to optimize model hyper parameters and evaluate against.

test

[optional | data.frame | default=NULL] Test set for model validation.

x

[optional | character | default=NULL] A character vector of predictor features to use. If left NULL then all features in the dataset except for the id, target and time partitioning features will be used.

id.feats

[optional | character | default=NULL] The name or names of id features that will be used to de-duplicate the training set.

time.partition.feature

[optional | character | default=NULL] The name of the time partitioning feature that will be used to create time sensitive train, validation and test sets.

optimization.metric

[optional | character | default="AUTO"] Which metric models should optimize when learning. Options include AUC, logloss, mean_per_class_error, RMSE, MSE, mean_residual_deviance, MAE, RMSLE. When set to AUTO will do AUC for binary classification, mean_per_class_error for multi-class and mean_residual_deviance for regression problems.

valid.split

[optional | numeric | default=0.1] The percentage of data to allocate to the validation set. If no time partitioning is done, then stratefied random sampling is done.

test.split

[optional | numeric | default=0.2] The percentage of data to allocate to the test set. If no time partitioning is done, then stratefied random sampling is done.

pipeline.search.max.runtime.mins

[optional | integer | default=30] The number of minutes allocated to optimized pre-processing pipelines.

automl.search.max.runtime.mins

[optional | integer | default=30] The number of minutes allocated to train models on the optimized dataset. Uses h2o.automl.

balance.classes

[optional | logical | default=FALSE] Should class imbalances be corrected by either up sampling minority cases or down sampling majority cases.

models

[optional | character | default=c("DRF","GLM","GBM","XGBoost","DeepLearning","StackedEnsemble")] The models to fit when running h2o.automl. Note that for Windows operating systems xgboost is not available.

cv.folds

[optional | integer | default=0] The number of folds to cross validate models on. Any value less than 3 will perform no cross validation.

max.levels

[optional | integer | default=100] The maximum number of unique values in the target feature before it is seen as a regression problem.

data.leakage.cutoff

[optional | numeric | default=0.65] The AUC cutoff value for determining which features are predictive in predicting the testing set. Features with a value greater than the cutoff will be removed from the feature set.

cluster.memory

[optional | integer | default=NULL] The maxmimum memory allocated to the h2o cluster in gigabytes. Default of NULL which will auto assign memory.

min.feature.importance

[optional | numeric | default=0.1] The minimum scaled feature importance features need to have before they are removed from the feature space.

seed

[optional | integer | default=NULL] Random seed value for reproducable results.

output.path

[optional | character | default=NULL] Path where function output will save to. Default of NULL, which will save to the current working directory.

pipeline

[optional | list | default=NULL] A pre-defined pipeline to train models on.

return.data

[optional | logical | default=TRUE] Return the pre-processed train, validation and test sets.

Value

List of objects and output generated to a specific path

Author(s)

Xander Horn

Examples

1
2
## NOT RUN
res <- automl(train=iris, valid=iris, y="Species")

XanderHorn/lazy documentation built on Jan. 16, 2021, 6:15 p.m.