Nested cross fold validation with blkbox.

Description

A function that builds upon the blkbox and blkboxNCV function and performs nested k-fold cross validation and then provides votes for each fold as well as the importance of each feature in the models. Provides feature importance tables and details for each inner and outerfold run.

Usage

1
2
3
4
blkboxNCV(data, labels, outerfolds = 5, innerfolds = 5, ntrees, mTry,
  Kernel, Gamma, max.depth, xgtype = "binary:logistic", exclude = c(0),
  inn.exclude, Method = "GLM", AUC = 0.5, metric = c("ERR", "AUROC",
  "ACC", "MCC", "F-1"), seed)

Arguments

data

A data.frame where the columns correspond to features and the rows are samples. The dataframe will be shuffled and split into k folds for downstream analysis.

labels

A character or numeric vector of the class identifiers that each sample belongs.

outerfolds

The number of folds that will be in the first k-fold loop, this determines the number of holdouts. Default is 5.

innerfolds

The number of folds that occur in the internal feature selection cross fold validation before testing on the corresponding holdout. Default is 5.

ntrees

The number of trees used in the ensemble based learners (randomforest, bigrf, party, bartmachine). default = 500.

mTry

The number of features sampled at each node in the trees of ensemble based learners (randomforest, bigrf, party, bartmachine). default = sqrt(number of features).

Kernel

The type of kernel used in the support vector machine algorithm (linear, radial, sigmoid, polynomial). default = "linear".

Gamma

Advanced parameter, defines the distance of which a single training example reaches. Low gamma will produce a SVM with softer boundaries, as Gamma increases the boundaries will eventually become restricted to their singular support vector. default is 1/(ncol - 1).

max.depth

the maximum depth of the tree in xgboost model, default is sqrt(ncol(data)).

xgtype

either "binary:logistic" or "reg:linear" for logistic regression or linear regression respectively.

exclude

removes certain algorithms from analysis - to exclude random forest which you would set exclude = "randomforest". The algorithms each have their own numeric identifier. randomforest = "randomforest", knn = "kknn", bartmachine = "bartmachine", party = "party", glmnet = "GLM", pam = "PamR, nnet = "nnet", svm = "SVM", xgboost = "xgboost".

inn.exclude

removes certain algorithms from after feature selection analysis. similar to 'exclude'. Defaults to exclude all but Method.

Method

The algorithm used to feature select the data. Uses the feature importance from the algorithms to rank and remove anything below the AUC threshold. Defaults to "GLM", therefore the inner folds will use "GLM" only unless specified otherwise.

AUC

Area under the curve selection measure. The relative importance of features is calculated and then ranked. The features responsible for the most importance are therefore desired, the AUC value is the percentile in which to keep features above. 0.5 keeps the highest ranked features responsible for 50 percent of the cumulative importance. default = 0.5. Will Change to 1.0 default when Method = "xgboost".

metric

A character string to determine which performance metric will be passed on to the Performance() function. Refer to Performance() documentation. default = c("ERR", "AUROC", "ACC", "MCC", "F-1")

seed

A single numeric value that will determine all subsequent seeds set in NCV.

Author(s)

Zachary Davies, Boris Guennewig

Examples

1
2
3
4
blkboxNCV(data = my_data,
         labels = my_labels,
         Method = "randomforest",
         AUC = 0.9)