hyperSMURF.cv: hyperSMURF cross-validation
In hyperSMURF: Hyper-Ensemble Smote Undersampled Random Forests

Description Usage Arguments Details Value References See Also Examples

Automated cross validation of hyperSMURF (hyper-ensemble SMote Undersampled Random Forests)

1
2
3

hyperSMURF.cv(data, y, kk = 5, n.part = 10, fp = 1, ratio = 1, 
k = 5, ntree = 10, mtry = 5, cutoff = c(0.5, 0.5), thresh = FALSE, 
                       seed = 0, fold.partition = NULL, file = "")

`data`	a data frame or matrix with the data
`y`	a factor with the labels. 0:majority class, 1: minority class.
`kk`	number of folds (def: 5)
`n.part`	number of partitions (def. 10)
`fp`	multiplicative factor for the SMOTE oversampling of the minority class If fp<1 no oversampling is performed.
`ratio`	ratio of the #majority/#minority
`k`	number of the nearest neighbours for SMOTE oversampling (def. 5)
`ntree`	number of trees of the base learner random forest (def. 10)
`mtry`	number of the features to randomly selected by the decision tree of each base random forest (def. 5)
`cutoff`	a numeric vector of length 2. Cutoff for respectively the majority and minority class. This parameter is meaningful when used with the thresholded version of hyperSMURF parameter (`thresh`=TRUE)
`thresh`	logical. If TRUE the thresholded version of hyperSMURF is executed (def: FALSE)
`seed`	initialization seed for the random generator. If set to 0(def.) no initialization is performed
`fold.partition`	vector of size nrow(data) with values in interval [0,kk). The values indicate the fold of the cross validation of each example. If NULL (default) the folds are randomly generated.
`file`	name of the file where the cross-validated hyperSMURF models will be saved. If file=="" (def.) no model is saved.

The cross-validation is performed by randomly constructing the folds (parameter fold.partition = NULL) or using a set of predefined folds listed in the parameter fold.partition. The cross validation is performed by training and testing in sequence the base random forests. More precisely for each training set constructed at each step of the cross validation a separated random forest is trained sequentially for each of the n.part partitions of the data, by oversampling the minority class (parameter fp) and undersampling the majority class (parameter ratio). The random forest parameters ntree and mtry are the same for all the random forest of the hyper-ensemble.

a vector with the cross-validated hyperSMURF probabilities (hyperSMURF scores).

M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants, Scientific Reports, Nature Publishing, 7:2959, 2017.

hyperSMURF.train, hyperSMURF.test

1
2
3

d <- imbalanced.data.generator(n.pos=10, n.neg=300, sd=0.3);
res<-hyperSMURF.cv (d$data, d$labels, kk=2, n.part=3, fp=1, ratio=1, k=3, ntree=7, 
                    mtry=2, seed = 1, fold.partition=NULL);

Creating new folds
Starting training on Fold  1 ...
Training of ensemble  1 done.
Training of ensemble  2 done.
Training of ensemble  3 done.
Starting test on Fold  1 ...
End test on Fold  1 .
Fold  1  done -----
Starting training on Fold  2 ...
Training of ensemble  1 done.
Training of ensemble  2 done.
Training of ensemble  3 done.
Starting test on Fold  2 ...
End test on Fold  2 .
Fold  2  done -----