get_grid_data: Makes a data.frame that contains the best error rates from a...

Description Usage Arguments Value See Also

View source: R/get_grid_data.R

Description

get_grid_data creates a data.frame that has the datasets in the first column and the best error rate obtained in the grid search in the second column.

Usage

1
2
3
4
5
6
7
get_grid_data(
  path = ".",
  pattern = NULL,
  dataset = "Data",
  method = NULL,
  model_type = NULL
)

Arguments

path

File path to folder with the files that hold the grid results.

pattern

An optional character vector that can be used to select a subset of files in the folder.

dataset

Name of the dataset for which the grid search was done.

method

A character string that specifies the method used to create the grid. Choices are "svm", "gbm", "en", and "ada". This is added to the datasets to minimize ambiguity in downstream analysis.

model_type

Character string of either "binary" or "regression" that specifies the type of model. This is needed because some of the earlier grid searches had inconsistent loss scales.

Value

Returns a list with a series of data.frames that can be used to create plots of the grid surface or find the best error rates. Note that some of the datasets in the list may have more observations than indicated by the measures. This is because there are a substantial number of ties. The datasets in the list are:

data

Complete dataset of grid results.

dat20loss

Dataset containing only the best 20% in terms of loss (classification error, AUC, MSE, or MAE).

dat10loss

Dataset containing only the best 10% in terms of loss.

dat5loss

Dataset containing only the best 5% in terms of loss.

dat1loss

Dataset containing only the best 1% in terms of loss.

dat20time

Dataset containing only the best 20% in terms of computation time.

dat10time

Dataset containing only the best 10% in terms of computation time.

dat5time

Dataset containing only the best 5% in terms of computation time.

dat1time

Dataset containing only the best 1% in terms of computation time.

top20loss

Twenty grid locations with the best loss.

top20time

Twenty grid locations with the best computation times.

Each of the datasets has the following variables:

Data

Name of the dataset used to create the grid.

Method

Method used for EZtune. Should be "svm", "gbm", "en", or "ada". It will be exactly as it is entered into the method argument.

Tuning_Variables

These fields contain the tuning variables for the method. For svm they are Cost and Gamma (Note that Gamma is really log2(Gamma)), and Epsilon for regression models; gbm is NumTrees, MinNode, Shrinkage, IntDepth; en is Alpha and logLambda; ada is Nu, Iter, and MaxDepth.

Loss

The loss measure used in the grid search. It is typically classification error or MSE, but it can be AUC or MAE as well. It is computed using 10-fold cross validation.

LossUCL

A measure of stability for the Loss measure. The test loss for each of the folds in 10-fold cross validation were used to compute a 95% upper confidence interval for the loss. If it differs substantially from the Loss it indicates that results for the model with those tuning parameters are unstable.

Time

Computation time in seconds.

See Also

grid_search, eztune_table, get_best_grid, grid_plot


jillbo1000/EZtuneTest documentation built on Oct. 5, 2021, 4:16 p.m.