Description Details Author(s) References See Also Examples
Functions to deal with binary classification problems in the presence of imbalanced classes. Synthetic balanced samples are generated according to ROSE (Menardi and Torelli, 2014). Functions that implement more traditional remedies to the class imbalance are also provided, as well as different metrics to evaluate a learner accuracy. These are estimated by holdout, bootrstrap or cross-validation methods.
The package pivots on function ROSE
which generates synthetic balanced
samples and thus allows to strenghten the subsequent estimation of any binary classifier.
ROSE (Random Over-Sampling Examples) is a bootstrap-based technique
which aids the task of binary classification in the presence of rare classes.
It handles both continuous and categorical data by generating synthetic examples from
a conditional density estimate of the two classes.
Different metrics to evaluate a learner accuracy are supplied by
functions roc.curve
and accuracy.meas
.
Holdout, bootstrap or cross-validation estimators of these accuracy metrics are
computed by means of ROSE and provided by function ROSE.eval
, to be used in
conjuction with virtually any binary classifier.
Additionally, function ovun.sample
implements more traditional remedies
to the class imbalance, such as over-sampling the minority class, under-sampling the majority
class, or a combination of over- and under- sampling.
Nicola Lunardon, Giovanna Menardi, Nicola Torelli
Maintainer: Nicola Lunardon <lunardon@stat.unipd.it>
Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. R Jorunal, 6:82–92.
Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92–122.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | # loading data
data(hacide)
# check imbalance
table(hacide.train$cls)
# train logistic regression on imbalanced data
log.reg.imb <- glm(cls ~ ., data=hacide.train, family=binomial)
# use the trained model to predict test data
pred.log.reg.imb <- predict(log.reg.imb, newdata=hacide.test,
type="response")
# generate new balanced data by ROSE
hacide.rose <- ROSE(cls ~ ., data=hacide.train, seed=123)$data
# check (im)balance of new data
table(hacide.rose$cls)
# train logistic regression on balanced data
log.reg.bal <- glm(cls ~ ., data=hacide.rose, family=binomial)
# use the trained model to predict test data
pred.log.reg.bal <- predict(log.reg.bal, newdata=hacide.test,
type="response")
# check accuracy of the two learners by measuring auc
roc.curve(hacide.test$cls, pred.log.reg.imb)
roc.curve(hacide.test$cls, pred.log.reg.bal, add.roc=TRUE, col=2)
# determine bootstrap distribution of the AUC of logit models
# trained on ROSE balanced samples
# B has been reduced from 100 to 10 for time saving solely
boot.auc.bal <- ROSE.eval(cls ~ ., data=hacide.train, learner= glm,
method.assess = "BOOT",
control.learner=list(family=binomial),
trace=TRUE, B=10)
summary(boot.auc.bal)
|
Loaded ROSE 0.0-3
0 1
980 20
0 1
507 493
Area under the curve (AUC): 0.803
Area under the curve (AUC): 0.915
Iteration:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Call:
ROSE.eval(formula = cls ~ ., data = hacide.train, learner = glm,
method.assess = "BOOT", B = 10, control.learner = list(family = binomial),
trace = TRUE)
Summary of bootstrap distribution of auc:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9274 0.9277 0.9278 0.9280 0.9282 0.9291
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.