blkboxCV: k-fold cross validation with blkbox.
In gboris/blkbox: Data Exploration with Multiple Machine Learning Algorithms

Description Usage Arguments Author(s) Examples

View source: R/blkboxCV.R

A function that builds upon the blkbox function and performs k-fold cross validation and then provides votes for each fold as well as the importance of each feature in the models.

1
2
3

blkboxCV(data, labels, folds = 10, seed, ntrees, mTry, repeats = 1, Kernel,
  Gamma, max.depth, xgtype = "binary:logistic", exclude = c(0),
  Method = "GLM", AUC = "NA")

`data`	A data.frame where the columns correspond to features and the rows are samples. The dataframe will be shuffled and split into k folds for downstream analysis.
`labels`	A character or numeric vector of the class identifiers that each sample belongs.
`folds`	The number of times the data set will be subsectioned (number of samples / k, if modulo exists the groups will be as close to the same size as possible). Each data subsection will be used as a holdout portion. default = 10.
`seed`	A numeric value. defaults to a randomly generated set of seeds that are output when run starts.
`ntrees`	The number of trees used in the ensemble based learners (randomforest, bigrf, party, bartmachine). default = 500.
`mTry`	The number of features sampled at each node in the trees of ensemble based learners (randomforest, bigrf, party, bartmachine). default = sqrt(number of features).
`repeats`	repeat the cross validation process. default = 1.
`Kernel`	The type of kernel used in the support vector machine algorithm (linear, radial, sigmoid, polynomial). default = "linear".
`Gamma`	Advanced parameter, defines the distance of which a single training example reaches. Low gamma will produce a SVM with softer boundaries, as Gamma increases the boundaries will eventually become restricted to their singular support vector. default is 1/(ncol - 1).
`max.depth`	the maximum depth of the tree in xgboost model, default is sqrt(ncol(data)).
`xgtype`	either "binary:logistic" or "reg:linear" for logistic regression or linear regression respectively.
`exclude`	removes certain algorithms from analysis - to exclude random forest which you would set exclude = "randomforest". The algorithms each have their own numeric identifier. randomforest = "randomforest", knn = "kknn", bartmachine = "bartmachine", party = "party", glmnet = "GLM", pam = "PamR, nnet = "nnet", svm = "SVM", xgboost = "xgboost".
`Method`	The algorithm used to feature select the data. Uses the feature importance from the algorithms to rank and remove anything below the AUC threshold. Default is "GLM".
`AUC`	Area under the curve selection measure. The relative importance of features is calculated and then ranked. The features responsible for the most importance are therefore desired, the AUC value is the percentile in which to keep features above. 0.5 keeps the highest ranked features responsible for 50 percent of the cumulative importance. Default is NA which means feature are not selected at after CV. Will default to 1.0 if Method is "xgboost".