Description Usage Arguments Details Value Warning References See Also Examples
View source: R/data_balancing_funcs.R
Creates a sample of synthetic data by enlarging the features space of minority and majority class examples. Operationally, the new examples are drawn from a conditional kernel density estimate of the two classes, as described in Menardi and Torelli (2013).
1 2 3 4 |
formula |
An object of class |
data |
An optional data frame, list or environment (or object
coercible to a data frame by |
N |
The desired sample size of the resulting data set generated by ROSE. If missing,
it is set equal to the length of the response variable in |
p |
The probability of the minority class examples in the resulting data set generated by ROSE. |
hmult.majo |
Optional shrink factor to be multiplied by the smoothing parameters to estimate the conditional kernel density of the majority class. See “References” and “Details”. |
hmult.mino |
Optional shrink factor to be multiplied by the smoothing parameters to estimate the conditional kernel density of the minority class. See “References” and “Details”. |
subset |
An optional vector specifying a subset of observations to be used in the sampling process.
The default is set by the |
na.action |
A function which indicates what should happen when the data contain 'NA's.
The default is set by the |
seed |
A single value, interpreted as an integer, recommended to specify seeds and keep trace of the generated sample. |
ROSE (Random Over-Sampling Examples) aids the task of binary classification in the presence of rare classes. It produces a synthetic, possibly balanced, sample of data simulated according to a smoothed-bootstrap approach.
Denoted by y the binary response and by x a vector of numeric predictors observed on n subjects i, (i=1, …, n), syntethic examples with class label k, (k=0, 1) are generated from a kernel estimate of the conditional density f(x|y = k). The kernel is a Normal product function centered at each of the x_i with diagonal covariance matrix H_k. Here, H_k is the asymptotically optimal smoothing matrix under the assumption of multivariate normality. See “References” below and further references therein.
Essentially, ROSE selects an observation belonging to the class k
and generates new examples in its neighbourhood,
where the width of the neighbourhood is determined by H_k. The user is allowed to
shrink H_k by varying arguments h.mult.majo
and h.mult.mino
.
Balancement is regulated by argument p
, i.e. the probability of
generating examples from class k=1.
As they stand, kernel-based methods may be applied to continuous data only. However, as ROSE includes combination of over and under-sampling as a special case when H_k tend to zero, the assumption of continuity may be circumvented by using a degenerate kernel distribution to draw synthetic categorical examples. Basically, if the j-th component of x_i is categorical, a syntehic clone of x_i will have as j-th component the same value of the j-th component of x_i.
The value is an object of class ROSE
which has components
Call |
The matched call. |
method |
The method used to balance the sample. The only possible choice is |
data |
An object of class |
The purpose of ROSE
is to generate new synthetic examples in the features space. The use of formula
is intended solely to
distinguish the response variable from the predictors.
Hence, formula
must not be confused with the one supplied to fit a classifier in which the specification of either tranformations
or interactions among variables may be sensible/necessary.
In the current version ROSE
discards possible interactions and transformations of predictors specified in formula
automatically.
The automatic parsing of formula
is able to manage virtually all cases on which it has been tested it but
the user is warned to use caution in the specification of entangled functions of predictors.
Any report about possible malfunctioning of the parsing mechanism is welcome.
Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a Package for Binary Imbalanced Learning. R Jorunal, 6:82–92.
Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92–122.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | # 2-dimensional example
# loading data
data(hacide)
# imbalance on training set
table(hacide.train$cls)
#imbalance on test set
table(hacide.test$cls)
# plot unbalanced data highlighting the majority and
# minority class examples.
par(mfrow=c(1,2))
plot(hacide.train[, 2:3], main="Unbalanced data", xlim=c(-4,4),
ylim=c(-4,4), col=as.numeric(hacide.train$cls), pch=20)
legend("topleft", c("Majority class","Minority class"), pch=20, col=1:2)
# model estimation using logistic regression
fit <- glm(cls~., data=hacide.train, family="binomial")
# prediction using test set
pred <- predict(fit, newdata=hacide.test)
roc.curve(hacide.test$cls, pred,
main="ROC curve \n (Half circle depleted data)")
# generating data according to ROSE: p=0.5 as default
data.rose <- ROSE(cls~., data=hacide.train, seed=3)$data
table(data.rose$cls)
par(mfrow=c(1,2))
# plot new data generated by ROSE highlighting the
# majority and minority class examples.
plot(data.rose[, 2:3], main="Balanced data by ROSE",
xlim=c(-6,6), ylim=c(-6,6), col=as.numeric(data.rose$cls), pch=20)
legend("topleft", c("Majority class","Minority class"), pch=20, col=1:2)
fit.rose <- glm(cls~., data=data.rose, family="binomial")
pred.rose <- predict(fit.rose, data=data.rose, type="response")
roc.curve(data.rose$cls, pred.rose,
main="ROC curve \n (Half circle depleted data balanced by ROSE)")
par(mfrow=c(1,1))
|
Loaded ROSE 0.0-3
0 1
980 20
0 1
245 5
Area under the curve (AUC): 0.804
0 1
480 520
Area under the curve (AUC): 0.901
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.