# Nested cross fold validation with blkbox.

### Description

A function that builds upon the blkbox and blkboxNCV function and performs nested k-fold cross validation and then provides votes for each fold as well as the importance of each feature in the models. Provides feature importance tables and details for each inner and outerfold run.

### Usage

1 2 3 4 |

### Arguments

`data` |
A data.frame where the columns correspond to features and the rows are samples. The dataframe will be shuffled and split into k folds for downstream analysis. |

`labels` |
A character or numeric vector of the class identifiers that each sample belongs. |

`outerfolds` |
The number of folds that will be in the first k-fold loop, this determines the number of holdouts. Default is 5. |

`innerfolds` |
The number of folds that occur in the internal feature selection cross fold validation before testing on the corresponding holdout. Default is 5. |

`ntrees` |
The number of trees used in the ensemble based learners (randomforest, bigrf, party, bartmachine). default = 500. |

`mTry` |
The number of features sampled at each node in the trees of ensemble based learners (randomforest, bigrf, party, bartmachine). default = sqrt(number of features). |

`Kernel` |
The type of kernel used in the support vector machine algorithm (linear, radial, sigmoid, polynomial). default = "linear". |

`Gamma` |
Advanced parameter, defines the distance of which a single training example reaches. Low gamma will produce a SVM with softer boundaries, as Gamma increases the boundaries will eventually become restricted to their singular support vector. default is 1/(ncol - 1). |

`max.depth` |
the maximum depth of the tree in xgboost model, default is sqrt(ncol(data)). |

`xgtype` |
either "binary:logistic" or "reg:linear" for logistic regression or linear regression respectively. |

`exclude` |
removes certain algorithms from analysis - to exclude random forest which you would set exclude = "randomforest". The algorithms each have their own numeric identifier. randomforest = "randomforest", knn = "kknn", bartmachine = "bartmachine", party = "party", glmnet = "GLM", pam = "PamR, nnet = "nnet", svm = "SVM", xgboost = "xgboost". |

`inn.exclude` |
removes certain algorithms from after feature selection analysis. similar to 'exclude'. Defaults to exclude all but Method. |

`Method` |
The algorithm used to feature select the data. Uses the feature importance from the algorithms to rank and remove anything below the AUC threshold. Defaults to "GLM", therefore the inner folds will use "GLM" only unless specified otherwise. |

`AUC` |
Area under the curve selection measure. The relative importance of features is calculated and then ranked. The features responsible for the most importance are therefore desired, the AUC value is the percentile in which to keep features above. 0.5 keeps the highest ranked features responsible for 50 percent of the cumulative importance. default = 0.5. Will Change to 1.0 default when Method = "xgboost". |

`metric` |
A character string to determine which performance metric will be passed on to the Performance() function. Refer to Performance() documentation. default = c("ERR", "AUROC", "ACC", "MCC", "F-1") |

`seed` |
A single numeric value that will determine all subsequent seeds set in NCV. |

### Author(s)

Zachary Davies, Boris Guennewig

### Examples

1 2 3 4 |