startml: startml

Description Usage Arguments Value

Description

startml is designed to run automatic hyperparameter searches for deep leaning gradient boosted machine, and random forest models. It selects best models, and combines or ensembles them in hopes making good predictions from an ensemble or highly skilled single model using just one function call. Machine learning algorithms are provided by h2o and run on the h2o JVM platform outside of the R workspace. Thus, much of the functionalies in startml are scalable. Currently, startml only supports regression and binary classification.

Usage

1
2
3
4
5
6
7
startml(labeled_data, newdata, y, x = NULL, label_id = NULL, y_type,
  algorithms = c("deeplearning", "randomForest", "gbm"),
  eval_metric = "AUTO", validation_type = "shared_holdout",
  percent_valid_holdout = 10, percent_test_holdout = 10,
  runtime_secs = 10, split_seed = NULL, trim = FALSE,
  number_top_models = NULL, eval_threshold = NULL,
  correlation_threshold = 0, return_dataframe = FALSE, wd = getwd())

Arguments

labeled_data

H2O frame object containing labeled data for model training. No Default.

newdata

H2O frame object containing unlabeled data for model predictions. No Default.

y

Character object of length 1 identifying the column name of the target variable. No Default.

x

Character object of length 1 or more identifying the column name(s) of the input variables. Default NULL, uses all remaining variables in labeled_data as inputs. Newdata must contian all of these input column names.

label_id

Character object of length 1 identifying the name of the column of observation IDs in labeled_data. If used, must match column of same name in newdata. startml will ignore this column as an input, but include it as an ID column in prediction outputs.

y_type

Character object of length 1 identifying the type of data the target variable is. Can be "continuous" or "discrete." Coninuous automatically creates regression models, and discrete automatically creates binomial models. Currently, startml only supports regression and binary classification.

algorithms

Character object of length 3, 2, or 1, specifying which alrogrithms to automatically train. The autotrain function will run a separate grid search for each algorimth type. Choices are: "deeplearning", "randomForest", and "gbm" following the naming convention in H2O version 3. Defaults to c("deeplearning", "randomForest", "gbm").

eval_metric

Character object defining evaluation metric for training. Defualt is "AUTO" and uses built-in H2O automatic choice for target data type.

validation_type

Defines validation type for training models. Defaults to "shared_holdout" indicating all model built with all algorithms share the same validation set. Currently, this is the only option in autotrain. Planned types include "random_holdout" where each model will get a unique randomized sample of labeled data for validation, and "xval" in which the cross validation functionality in H2O will be implemented in every model.

percent_valid_holdout

Numeric object of value 0 to 100. Sets the percent of the labeled data that will be used for holdout validation. Default is 10. Is ignored if validation_type = "xval." Currently startml only supports "shared_holdout" validation.

percent_test_holdout

Numeric object of value 0 to 100. Sets the percent of the labeled data that will be used for test holdout for model selection. Default is 10.

runtime_secs

Character Object which sets the length of time each grid search will run. Defaults to 20, thus the default runtime is 20 sec * (length of algorimths) = 1 minute.

split_seed

Random seed for splitting labeled data into train, validation, and test components. Currently, startml only supports random sampling splits, this argument sets the random seed for these splits, making the data set separation process reproducible. Since this is a "naive" random split, labeled data should be shuffled before hand.

trim

Boolean. When TRUE, output is trimmed with eval_threshold, correlation_threshold, or number_top_models. When FALSE, all models are returned. Default FALSE.

number_top_models

Numeric object indicating number of top models to return. Defualt is 10. If number entered is greater than number of model, whole model list is returned.

eval_threshold

Numeric objsect defining the performance threshold models must meet to be used in prediction. Is minimum for maximization loss function (i.e., AUC) and maximum for minimization loss functions (logloss, MSE, etc). Default is NULL, returns models without performance consideration.

correlation_threshold

Numeric object defining the maximum person correlation allowed in the group of resulting models. If two models show high correlation, the one with surperior performance will be kept and the other dropped. Value ranges from -1 to 1, default is NULL, returning models without correlation considered.

return_dataframe

Depricated. Always keep equal to FALSE

wd

Character object defining file path where resulting modeling will be saved. Defualts to current working directory.

Value

Object of class mlblob using S4 type. mlblob objects contain all selected models, their predictions on train, validation, test, and new data, and can be plotted using plot() showing a summary of the model group. Slots are: models, a list of h2o model objects labeled_data an h2o frame object equivalent to the input label_data input object.


andrewsommerlot/startml documentation built on May 5, 2019, 6:58 p.m.