hyperSMURF.cv: hyperSMURF cross-validation

Description Usage Arguments Details Value References See Also Examples

Description

Automated cross validation of hyperSMURF (hyper-ensemble SMote Undersampled Random Forests)

Usage

1
2
3
hyperSMURF.cv(data, y, kk = 5, n.part = 10, fp = 1, ratio = 1, 
k = 5, ntree = 10, mtry = 5, cutoff = c(0.5, 0.5), thresh = FALSE, 
                       seed = 0, fold.partition = NULL, file = "")

Arguments

data

a data frame or matrix with the data

y

a factor with the labels. 0:majority class, 1: minority class.

kk

number of folds (def: 5)

n.part

number of partitions (def. 10)

fp

multiplicative factor for the SMOTE oversampling of the minority class If fp<1 no oversampling is performed.

ratio

ratio of the #majority/#minority

k

number of the nearest neighbours for SMOTE oversampling (def. 5)

ntree

number of trees of the base learner random forest (def. 10)

mtry

number of the features to randomly selected by the decision tree of each base random forest (def. 5)

cutoff

a numeric vector of length 2. Cutoff for respectively the majority and minority class. This parameter is meaningful when used with the thresholded version of hyperSMURF parameter (thresh=TRUE)

thresh

logical. If TRUE the thresholded version of hyperSMURF is executed (def: FALSE)

seed

initialization seed for the random generator. If set to 0(def.) no initialization is performed

fold.partition

vector of size nrow(data) with values in interval [0,kk). The values indicate the fold of the cross validation of each example. If NULL (default) the folds are randomly generated.

file

name of the file where the cross-validated hyperSMURF models will be saved. If file=="" (def.) no model is saved.

Details

The cross-validation is performed by randomly constructing the folds (parameter fold.partition = NULL) or using a set of predefined folds listed in the parameter fold.partition. The cross validation is performed by training and testing in sequence the base random forests. More precisely for each training set constructed at each step of the cross validation a separated random forest is trained sequentially for each of the n.part partitions of the data, by oversampling the minority class (parameter fp) and undersampling the majority class (parameter ratio). The random forest parameters ntree and mtry are the same for all the random forest of the hyper-ensemble.

Value

a vector with the cross-validated hyperSMURF probabilities (hyperSMURF scores).

References

M. Schubach, M. Re, P.N. Robinson and G. Valentini Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants, Scientific Reports, Nature Publishing, 7:2959, 2017.

See Also

hyperSMURF.train, hyperSMURF.test

Examples

1
2
3
d <- imbalanced.data.generator(n.pos=10, n.neg=300, sd=0.3);
res<-hyperSMURF.cv (d$data, d$labels, kk=2, n.part=3, fp=1, ratio=1, k=3, ntree=7, 
                    mtry=2, seed = 1, fold.partition=NULL);

Example output

Creating new folds
Starting training on Fold  1 ...
Training of ensemble  1 done.
Training of ensemble  2 done.
Training of ensemble  3 done.
Starting test on Fold  1 ...
End test on Fold  1 .
Fold  1  done -----
Starting training on Fold  2 ...
Training of ensemble  1 done.
Training of ensemble  2 done.
Training of ensemble  3 done.
Starting test on Fold  2 ...
End test on Fold  2 .
Fold  2  done -----

hyperSMURF documentation built on May 2, 2019, 9:20 a.m.