The package provides functions to deal with binary classification problems in the presence of imbalanced classes. Synthetic balanced samples are generated according to ROSE (Menardi and Torelli, 2014). Functions that implement more traditional remedies to the class imbalance are also provided, as well as different metrics to evaluate a learner accuracy. These are estimated by holdout, bootrstrap or cross-validation methods.
The package pivots on function
ROSE which generates synthetic balanced
samples and thus allows to strenghten the subsequent estimation of any binary classifier.
ROSE (Random Over-Sampling Examples) is a bootstrap-based technique
which aids the task of binary classification in the presence of rare classes.
It handles both continuous and categorical data by generating synthetic examples from
a conditional density estimate of the two classes.
Different metrics to evaluate a learner accuracy are supplied by
Holdout, bootstrap or cross-validation estimators of these accuracy metrics are
computed by means of ROSE and provided by function
ROSE.eval, to be used in
conjuction with virtually any binary classifier.
ovun.sample implements more traditional remedies
to the class imbalance, such as over-sampling the minority class, under-sampling the majority
class, or a combination of over- and under- sampling.
Nicola Lunardon, Giovanna Menardi, Nicola Torelli
Maintainer: Nicola Lunardon <firstname.lastname@example.org>
Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. R Jorunal, 6:82–92.
Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92–122.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
# loading data data(hacide) # check imbalance table(hacide.train$cls) # train logistic regression on imbalanced data log.reg.imb <- glm(cls ~ ., data=hacide.train, family=binomial) # use the trained model to predict test data pred.log.reg.imb <- predict(log.reg.imb, newdata=hacide.test, type="response") # generate new balanced data by ROSE hacide.rose <- ROSE(cls ~ ., data=hacide.train, seed=123)$data # check (im)balance of new data table(hacide.rose$cls) # train logistic regression on balanced data log.reg.bal <- glm(cls ~ ., data=hacide.rose, family=binomial) # use the trained model to predict test data pred.log.reg.bal <- predict(log.reg.bal, newdata=hacide.test, type="response") # check accuracy of the two learners by measuring auc roc.curve(hacide.test$cls, pred.log.reg.imb) roc.curve(hacide.test$cls, pred.log.reg.bal, add.roc=TRUE, col=2) # determine bootstrap distribution of the AUC of logit models # trained on ROSE balanced samples # B has been reduced from 100 to 10 for time saving solely boot.auc.bal <- ROSE.eval(cls ~ ., data=hacide.train, learner= glm, method.assess = "BOOT", control.learner=list(family=binomial), trace=TRUE, B=10) summary(boot.auc.bal)