Description Usage Arguments Value
startml is designed to run automatic hyperparameter searches for deep leaning gradient boosted machine, and random forest models. It selects best models, and combines or ensembles them in hopes making good predictions from an ensemble or highly skilled single model using just one function call. Machine learning algorithms are provided by h2o and run on the h2o JVM platform outside of the R workspace. Thus, much of the functionalies in startml are scalable. Currently, startml only supports regression and binary classification.
1 2 3 4 5 6 7 | startml(labeled_data, newdata, y, x = NULL, label_id = NULL, y_type,
algorithms = c("deeplearning", "randomForest", "gbm"),
eval_metric = "AUTO", validation_type = "shared_holdout",
percent_valid_holdout = 10, percent_test_holdout = 10,
runtime_secs = 10, split_seed = NULL, trim = FALSE,
number_top_models = NULL, eval_threshold = NULL,
correlation_threshold = 0, return_dataframe = FALSE, wd = getwd())
|
labeled_data |
H2O frame object containing labeled data for model training. No Default. |
newdata |
H2O frame object containing unlabeled data for model predictions. No Default. |
y |
Character object of length 1 identifying the column name of the target variable. No Default. |
x |
Character object of length 1 or more identifying the column name(s) of the input variables. Default NULL, uses all remaining variables in labeled_data as inputs. Newdata must contian all of these input column names. |
label_id |
Character object of length 1 identifying the name of the column of observation IDs in labeled_data. If used, must match column of same name in newdata. startml will ignore this column as an input, but include it as an ID column in prediction outputs. |
y_type |
Character object of length 1 identifying the type of data the target variable is. Can be "continuous" or "discrete." Coninuous automatically creates regression models, and discrete automatically creates binomial models. Currently, startml only supports regression and binary classification. |
algorithms |
Character object of length 3, 2, or 1, specifying which alrogrithms to automatically train. The autotrain function will run a separate grid search for each algorimth type. Choices are: "deeplearning", "randomForest", and "gbm" following the naming convention in H2O version 3. Defaults to c("deeplearning", "randomForest", "gbm"). |
eval_metric |
Character object defining evaluation metric for training. Defualt is "AUTO" and uses built-in H2O automatic choice for target data type. |
validation_type |
Defines validation type for training models. Defaults to "shared_holdout" indicating all model built with all algorithms share the same validation set. Currently, this is the only option in autotrain. Planned types include "random_holdout" where each model will get a unique randomized sample of labeled data for validation, and "xval" in which the cross validation functionality in H2O will be implemented in every model. |
percent_valid_holdout |
Numeric object of value 0 to 100. Sets the percent of the labeled data that will be used for holdout validation. Default is 10. Is ignored if validation_type = "xval." Currently startml only supports "shared_holdout" validation. |
percent_test_holdout |
Numeric object of value 0 to 100. Sets the percent of the labeled data that will be used for test holdout for model selection. Default is 10. |
runtime_secs |
Character Object which sets the length of time each grid search will run. Defaults to 20, thus the default runtime is 20 sec * (length of algorimths) = 1 minute. |
split_seed |
Random seed for splitting labeled data into train, validation, and test components. Currently, startml only supports random sampling splits, this argument sets the random seed for these splits, making the data set separation process reproducible. Since this is a "naive" random split, labeled data should be shuffled before hand. |
trim |
Boolean. When TRUE, output is trimmed with eval_threshold, correlation_threshold, or number_top_models. When FALSE, all models are returned. Default FALSE. |
number_top_models |
Numeric object indicating number of top models to return. Defualt is 10. If number entered is greater than number of model, whole model list is returned. |
eval_threshold |
Numeric objsect defining the performance threshold models must meet to be used in prediction. Is minimum for maximization loss function (i.e., AUC) and maximum for minimization loss functions (logloss, MSE, etc). Default is NULL, returns models without performance consideration. |
correlation_threshold |
Numeric object defining the maximum person correlation allowed in the group of resulting models. If two models show high correlation, the one with surperior performance will be kept and the other dropped. Value ranges from -1 to 1, default is NULL, returning models without correlation considered. |
return_dataframe |
Depricated. Always keep equal to FALSE |
wd |
Character object defining file path where resulting modeling will be saved. Defualts to current working directory. |
Object of class mlblob using S4 type. mlblob objects contain all selected models, their predictions on train, validation, test, and new data, and can be plotted using plot() showing a summary of the model group. Slots are: models, a list of h2o model objects labeled_data an h2o frame object equivalent to the input label_data input object.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.