View source: R/imbalanced.rfsrc.R
imbalanced.rfsrc | R Documentation |
Implements various solutions to the two-class imbalanced problem, including the newly proposed quantile-classifier approach of O'Brien and Ishwaran (2017). Also includes Breiman's balanced random forests undersampling of the majority class. Performance is assesssed using the G-mean, but misclassification error can be requested.
## S3 method for class 'rfsrc'
imbalanced(formula, data, ntree = 3000,
method = c("rfq", "brf", "standard"), splitrule = "auc",
perf.type = NULL, block.size = NULL, fast = FALSE,
ratio = NULL, ...)
formula |
A symbolic description of the model to be fit. |
data |
A data frame containing the two-class y-outcome and x-variables. |
ntree |
Number of trees to grow. |
method |
Method used to fit the classifier. The default is |
splitrule |
Splitting rule used to grow trees. The default is |
perf.type |
Performance metric used for evaluating the classifier and computing downstream quantities such as VIMP. Defaults depend on the method: |
block.size |
Controls how the cumulative error rate is computed. If |
fast |
Logical. If |
ratio |
Optional and experimental. Specifies the proportion (between 0 and 1) of majority class cases to sample during RFQ training. Sampling is without replacement. Ignored for BRF. |
... |
Additional arguments passed to |
Imbalanced data, also known as the minority class problem, refers to two-class classification settings where the majority class significantly outnumbers the minority class. This function supports two approaches to address class imbalance:
The random forests quantile classifier (RFQ) proposed by O'Brien and Ishwaran (2017).
The balanced random forest (BRF) undersampling method of Chen et al. (2004).
By default, the performance metric used is the G-mean (Kubat et al., 1997), which balances sensitivity and specificity.
Handling of missing values: Missing data are not supported for BRF or when the ratio
option is specified. In these cases, records with missing values are removed prior to analysis.
Variable importance: Permutation-based VIMP is used by default in this setting, in contrast to anti-VIMP which is the default for other families. Empirical results suggest that permutation VIMP is more reliable in highly imbalanced settings.
Tree count recommendation: We recommend using a relatively
large value for ntree
in imbalanced problems to ensure stable
performance estimation, especially for G-mean. As a general guideline,
use at least five times the usual number of trees.
Performance metrics: A helper function, get.imbalanced.performance
, is provided for extracting classification performance summaries. The metric names are self-explanatory in most cases. Some key metrics include:
F1
: The harmonic mean of precision and recall.
F1mod
: The harmonic mean of sensitivity, specificity, precision, and negative predictive value.
F1gmean
: The average of F1 and G-mean.
F1modgmean
: The average of F1mod and G-mean.
A two-class random forest fit under the requested method and performance value.
Hemant Ishwaran and Udaya B. Kogalur
Chen, C., Liaw, A. and Breiman, L. (2004). Using random forest to learn imbalanced data. University of California, Berkeley, Technical Report 110.
Kubat, M., Holte, R. and Matwin, S. (1997). Learning when negative examples abound. Machine Learning, ECML-97: 146-153.
O'Brien R. and Ishwaran H. (2019). A random forests quantile classifier for class imbalanced data. Pattern Recognition, 90, 232-249
rfsrc
,
rfsrc.fast
## ------------------------------------------------------------
## use the breast data for illustration
## ------------------------------------------------------------
data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
f <- as.formula(status ~ .)
##----------------------------------------------------------------
## default RFQ call
##----------------------------------------------------------------
o.rfq <- imbalanced(f, breast)
print(o.rfq)
## equivalent to:
## rfsrc(f, breast, rfq = TRUE, ntree = 3000,
## perf.type = "gmean", splitrule = "auc")
##----------------------------------------------------------------
## detailed output using customized performance function
##----------------------------------------------------------------
print(get.imbalanced.performance(o.rfq))
##-----------------------------------------------------------------
## RF using misclassification error with gini splitting
## ------------------------------------------------------------
o.std <- imbalanced(f, breast, method = "stand", splitrule = "gini")
##-----------------------------------------------------------------
## RF using G-mean performance with AUC splitting
## ------------------------------------------------------------
o.std <- imbalanced(f, breast, method = "stand", perf.type = "gmean")
## equivalent to:
## rfsrc(f, breast, ntree = 3000, perf.type = "gmean", splitrule = "auc")
##----------------------------------------------------------------
## default BRF call
##----------------------------------------------------------------
o.brf <- imbalanced(f, breast, method = "brf")
## equivalent to:
## imbalanced(f, breast, method = "brf", perf.type = "gmean")
##----------------------------------------------------------------
## BRF call with misclassification performance
##----------------------------------------------------------------
o.brf <- imbalanced(f, breast, method = "brf", perf.type = "misclass")
##----------------------------------------------------------------
## train/test example
##----------------------------------------------------------------
trn <- sample(1:nrow(breast), size = nrow(breast) / 2)
o.trn <- imbalanced(f, breast[trn,], importance = TRUE)
o.tst <- predict(o.trn, breast[-trn,], importance = TRUE)
print(o.trn)
print(o.tst)
print(100 * cbind(o.trn$impo[, 1], o.tst$impo[, 1]))
##----------------------------------------------------------------
##
## illustrates how to optimize threshold on training data
## improves Gmean for RFQ in many situations
##
##----------------------------------------------------------------
if (library("caret", logical.return = TRUE)) {
## experimental settings
n <- 2 * 5000
q <- 20
ir <- 6
f <- as.formula(Class ~ .)
## simulate the data, create minority class data
d <- twoClassSim(n, linearVars = 15, noiseVars = q)
d$Class <- factor(as.numeric(d$Class) - 1)
idx.0 <- which(d$Class == 0)
idx.1 <- sample(which(d$Class == 1), sum(d$Class == 1) / ir , replace = FALSE)
d <- d[c(idx.0,idx.1),, drop = FALSE]
## split data into train and test
trn.pt <- sample(1:nrow(d), size = nrow(d) / 2)
trn <- d[trn.pt, ]
tst <- d[setdiff(1:nrow(d), trn.pt), ]
## run rfq on training data
o <- imbalanced(f, trn)
## (1) default threshold (2) directly optimized gmean threshold
th.1 <- get.imbalanced.performance(o)["threshold"]
th.2 <- get.imbalanced.optimize(o)["threshold"]
## training performance
cat("-------- train performance ---------\n")
print(get.imbalanced.performance(o, thresh=th.1))
print(get.imbalanced.performance(o, thresh=th.2))
## test performance
cat("-------- test performance ---------\n")
pred.o <- predict(o, tst)
print(get.imbalanced.performance(pred.o, thresh=th.1))
print(get.imbalanced.performance(pred.o, thresh=th.2))
}
##----------------------------------------------------------------
## illustrates RFQ with and without SMOTE
##
## - simulation example using the caret R-package
## - creates imbalanced data by randomly sampling the class 1 data
## - use SMOTE from "imbalance" package to oversample the minority
##
##----------------------------------------------------------------
if (library("caret", logical.return = TRUE) &
library("imbalance", logical.return = TRUE)) {
## experimental settings
n <- 5000
q <- 20
ir <- 6
f <- as.formula(Class ~ .)
## simulate the data, create minority class data
d <- twoClassSim(n, linearVars = 15, noiseVars = q)
d$Class <- factor(as.numeric(d$Class) - 1)
idx.0 <- which(d$Class == 0)
idx.1 <- sample(which(d$Class == 1), sum(d$Class == 1) / ir , replace = FALSE)
d <- d[c(idx.0,idx.1),, drop = FALSE]
d <- d[sample(1:nrow(d)), ]
## define train/test split
trn <- sample(1:nrow(d), size = nrow(d) / 2, replace = FALSE)
## now make SMOTE training data
newd.50 <- mwmote(d[trn, ], numInstances = 50, classAttr = "Class")
newd.500 <- mwmote(d[trn, ], numInstances = 500, classAttr = "Class")
## fit RFQ with and without SMOTE
o.with.50 <- imbalanced(f, rbind(d[trn, ], newd.50))
o.with.500 <- imbalanced(f, rbind(d[trn, ], newd.500))
o.without <- imbalanced(f, d[trn, ])
## compare performance on test data
print(predict(o.with.50, d[-trn, ]))
print(predict(o.with.500, d[-trn, ]))
print(predict(o.without, d[-trn, ]))
}
##----------------------------------------------------------------
##
## illustrates effectiveness of blocked VIMP
##
##----------------------------------------------------------------
if (library("caret", logical.return = TRUE)) {
## experimental settings
n <- 1000
q <- 20
ir <- 6
f <- as.formula(Class ~ .)
## simulate the data, create minority class data
d <- twoClassSim(n, linearVars = 15, noiseVars = q)
d$Class <- factor(as.numeric(d$Class) - 1)
idx.0 <- which(d$Class == 0)
idx.1 <- sample(which(d$Class == 1), sum(d$Class == 1) / ir , replace = FALSE)
d <- d[c(idx.0,idx.1),, drop = FALSE]
## permutation VIMP for BRF with and without blocking
## blocked VIMP is a hybrid of Breiman-Cutler/Ishwaran-Kogalur VIMP
brf <- imbalanced(f, d, method = "brf", importance = "permute", block.size = 1)
brfB <- imbalanced(f, d, method = "brf", importance = "permute", block.size = 10)
## permutation VIMP for RFQ with and without blocking
rfq <- imbalanced(f, d, importance = "permute", block.size = 1)
rfqB <- imbalanced(f, d, importance = "permute", block.size = 10)
## compare VIMP values
imp <- 100 * cbind(brf$importance[, 1], brfB$importance[, 1],
rfq$importance[, 1], rfqB$importance[, 1])
legn <- c("BRF", "BRF-block", "RFQ", "RFQ-block")
colr <- rep(4,20+q)
colr[1:20] <- 2
ylim <- range(c(imp))
nms <- 1:(20+q)
par(mfrow=c(2,2))
barplot(imp[,1],col=colr,las=2,main=legn[1],ylim=ylim,names.arg=nms)
barplot(imp[,2],col=colr,las=2,main=legn[2],ylim=ylim,names.arg=nms)
barplot(imp[,3],col=colr,las=2,main=legn[3],ylim=ylim,names.arg=nms)
barplot(imp[,4],col=colr,las=2,main=legn[4],ylim=ylim,names.arg=nms)
}
##----------------------------------------------------------------
##
## confidence intervals for G-mean permutation VIMP using subsampling
##
##----------------------------------------------------------------
if (library("caret", logical.return = TRUE)) {
## experimental settings
n <- 1000
q <- 20
ir <- 6
f <- as.formula(Class ~ .)
## simulate the data, create minority class data
d <- twoClassSim(n, linearVars = 15, noiseVars = q)
d$Class <- factor(as.numeric(d$Class) - 1)
idx.0 <- which(d$Class == 0)
idx.1 <- sample(which(d$Class == 1), sum(d$Class == 1) / ir , replace = FALSE)
d <- d[c(idx.0,idx.1),, drop = FALSE]
## RFQ
o <- imbalanced(Class ~ ., d, importance = "permute", block.size = 10)
## subsample RFQ
smp.o <- subsample(o, B = 100)
plot(smp.o, cex.axis = .7)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.