APML: Develop models with grid search
In APML: An Approach for Machine-Learning Modelling

APML	R Documentation

Develop models with grid search

Description

Develop models with the optional parameters identified through the grid search and return model performance metrics. In order to run properly, the response column must be the first column and of a type of either numeric for "gaussian" or factor for "bernoulli" or "multinomial".

Usage

APML(model,AUC_stopping, xcol, traindata,testdata, hyper, 
distribution,imbalance, sort_by, extra_data,stopping_metric)

Arguments

`model`	The model to be used. Currently, only allow "gbm" (default) for the gradient boosted tree, and "rf" for the random forest.
`AUC_stopping`	Logical. If TRUE, the combinations of the hyperparameters will be randomly searched with AUC metric-based early stopping. Default:FALSE.
`xcol`	A vector containing the names or indices of the predictors to be used.
`traindata`	The training dataset.
`testdata`	The testing dataset.
`hyper`	List of hyper parameters (i.e., list(ntrees=c(1,2), max_depth=c(5,7)))
`distribution`	Distribution of the outcome: "bernoulli" (default), "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber" or "custom".
`imbalance`	Logical. If true, balancing the case numbers in the training data via over/under-sampling when developing the model. Default:FALSE
`sort_by`	Select the best model in the grid space by sorting with a metric. Choices are "logloss", "residual_deviance", "mse", "auc", "accuracy", "precision", "recall", "f1", etc
`extra_data`	Extra dataset for evaluating model performance.
`stopping_metric`	Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing". Defaults to AUTO.

Details

This function uses the grid search technique to tune models' parameters and return the optimal model.

Value

`bestmodel`	Best H2o model via grid search
`train_metrics`	Model performance metrics based on the training data.
`test_metrics`	Model performance metrics based on the testing data.
`summary`	Summary of model performance.
`extra_metrics`	Model performance metrics based on extra data. Only available when "model_metric" is used.

Note

This function is based on h2o package. In order to run this function, we need to run h2o.init() before using this function. The response variable should be the first column.

References

LeDell E, Gill N, Aiello S, Fu A, Candel A, Click C, et al. 2019. h2o: R Interface for “H2O.” Zhang W, Du Z, Zhang D, Yu S, Hao Y. 2016a. Boosted regression tree model-based assessment of the impacts of meteorological drivers of hand, foot and mouth disease in Guangdong, China. Sci Total Environ 553; doi:10.1016/j.scitotenv.2016.02.023.

Examples

library(h2o)
data(iris)
attach(iris)
h2o.init()
hyper <- list(ntrees=c(2,3,5))
iris <- iris[1:100,c(5,1:4)]
idx <- sample(100,50)
traindata <- iris[idx,]
testdata <- iris[-idx,]
xcol <- names(iris)[2:5]
results <- APML(xcol=xcol,hyper=hyper,
                traindata=traindata,testdata=testdata,
                sort_by ='auc',distribution = 'bernoulli')
h2o.shutdown(prompt=FALSE)
Sys.sleep(2)

APML documentation built on May 12, 2022, 9:06 a.m.