View source: R/subsample.rfsrc.R
subsample.rfsrc | R Documentation |
Use subsampling to calculate confidence intervals and standard errors for VIMP (variable importance). Applies to all families.
## S3 method for class 'rfsrc'
subsample(obj,
B = 100,
block.size = 1,
importance,
subratio = NULL,
stratify = TRUE,
performance = FALSE,
performance.only = FALSE,
joint = FALSE,
xvar.names = NULL,
bootstrap = FALSE,
verbose = TRUE)
obj |
A forest grow object of class |
B |
Number of subsamples (or bootstrap iterations, if |
block.size |
Number of trees in each block used when calculating VIMP. If VIMP is already included in the original grow object, that setting is used instead. |
importance |
Type of variable importance (VIMP) to compute. Choices are |
subratio |
Subsample size as a proportion of the original sample size. The default is approximately the inverse square root of the sample size. |
stratify |
Logical. If |
performance |
Logical. If |
performance.only |
Logical. If |
joint |
Logical. If |
xvar.names |
Character vector specifying variables to be used for joint VIMP. If omitted, all variables are included. |
bootstrap |
Logical. If |
verbose |
Logical. If |
This function applies subsampling (or optional double bootstrapping) to a previously trained forest to estimate standard errors and construct confidence intervals for variable importance (VIMP), as described in Ishwaran and Lu (2019). It also supports inference for the out-of-bag (OOB) prediction error via the performance = TRUE
option. Joint VIMP for selected or all variables can be obtained using joint
and xvar.names
.
If the original forest does not include VIMP, it will be computed prior to subsampling. For repeated calls to subsample
, it is recommended that VIMP be requested in the original rfsrc
call. This not only avoids redundant computation, but also ensures consistency of the importance type (e.g., anti, permute, or random) and related parameters, which may otherwise be unclear. Note that permutation importance is not the default for most families.
Subsampled forests are constructed using the same tuning parameters as the original forest. While most settings are automatically recovered, certain advanced configurations (e.g., custom sampling schemes) may not be fully supported.
Both subsampled variance estimates (Politis and Romano, 1994) and delete-\(d\) jackknife variance estimates (Shao and Wu, 1989) are returned. The jackknife estimator tends to produce larger standard errors, offering a conservative bias correction, particularly for signal variables.
By default, stratified subsampling is used for classification, survival, and competing risk families:
For classification, strata correspond to class labels.
For survival and competing risks, strata include event type and censoring.
Stratification helps ensure representation of key outcome types and is especially important for small sample sizes. Overriding this behavior is discouraged. Note that stratification is not available for multivariate families, and caution should be exercised when subsampling in that context.
The function extract.subsample
can be used to retrieve detailed information from the subsample object. By default, returned VIMP values are standardized: for regression families, VIMP is divided by the variance of the response; for other families, no transformation is applied. To obtain raw (unstandardized) values, set standardize = FALSE
. For expert users, the option raw = TRUE
returns detailed internal output, including VIMP from each individual subsampled forest (constructed on a smaller sample size), which is used internally by plot.subsample.rfsrc
to generate confidence intervals.
Printed and plotted outputs also standardize VIMP by default. This behavior can be disabled via standardize
. The alpha
option controls the confidence level and is preset in wrapper functions but can be adjusted by the user.
A list with the following key components:
rf |
Original forest grow object. |
vmp |
Variable importance values for grow forest. |
vmpS |
Variable importance subsampled values. |
subratio |
Subratio used. |
Hemant Ishwaran and Udaya B. Kogalur
Ishwaran H. and Lu M. (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine, 38, 558-582.
Politis, D.N. and Romano, J.P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. The Annals of Statistics, 22(4):2031-2050.
Shao, J. and Wu, C.J. (1989). A general theory for jackknife variance estimation. The Annals of Statistics, 17(3):1176-1197.
holdout.vimp.rfsrc
plot.subsample.rfsrc
,
rfsrc
,
vimp.rfsrc
## ------------------------------------------------------------
## regression
## ------------------------------------------------------------
## training the forest
reg.o <- rfsrc(Ozone ~ ., airquality)
## default subsample call
reg.smp.o <- subsample(reg.o)
## plot confidence regions
plot.subsample(reg.smp.o)
## summary of results
print(reg.smp.o)
## joint vimp and confidence region for generalization error
reg.smp.o2 <- subsample(reg.o, performance = TRUE,
joint = TRUE, xvar.names = c("Day", "Month"))
plot.subsample(reg.smp.o2)
## now try the double bootstrap (slower)
reg.dbs.o <- subsample(reg.o, B = 25, bootstrap = TRUE)
print(reg.dbs.o)
plot.subsample(reg.dbs.o)
## standard error and confidence region for generalization error only
gerror <- subsample(reg.o, performance.only = TRUE)
plot.subsample(gerror)
## ------------------------------------------------------------
## classification
## ------------------------------------------------------------
## 3 non-linear, 15 linear, and 5 noise variables
if (library("caret", logical.return = TRUE)) {
d <- twoClassSim(1000, linearVars = 15, noiseVars = 5)
## VIMP based on (default) misclassification error
cls.o <- rfsrc(Class ~ ., d)
cls.smp.o <- subsample(cls.o, B = 100)
plot.subsample(cls.smp.o, cex.axis = .7)
## same as above, but with VIMP defined using normalized Brier score
cls.o2 <- rfsrc(Class ~ ., d, perf.type = "brier")
cls.smp.o2 <- subsample(cls.o2, B = 100)
plot.subsample(cls.smp.o2, cex.axis = .7)
}
## ------------------------------------------------------------
## class-imbalanced data using RFQ classifier with G-mean VIMP
## ------------------------------------------------------------
if (library("caret", logical.return = TRUE)) {
## experimental settings
n <- 1000
q <- 20
ir <- 6
f <- as.formula(Class ~ .)
## simulate the data, create minority class data
d <- twoClassSim(n, linearVars = 15, noiseVars = q)
d$Class <- factor(as.numeric(d$Class) - 1)
idx.0 <- which(d$Class == 0)
idx.1 <- sample(which(d$Class == 1), sum(d$Class == 1) / ir , replace = FALSE)
d <- d[c(idx.0,idx.1),, drop = FALSE]
## RFQ classifier
oq <- imbalanced(Class ~ ., d, importance = TRUE, block.size = 10)
## subsample the RFQ-classifier
smp.oq <- subsample(oq, B = 100)
plot.subsample(smp.oq, cex.axis = .7)
}
## ------------------------------------------------------------
## survival
## ------------------------------------------------------------
data(pbc, package = "randomForestSRC")
srv.o <- rfsrc(Surv(days, status) ~ ., pbc)
srv.smp.o <- subsample(srv.o, B = 100)
plot(srv.smp.o)
## ------------------------------------------------------------
## competing risks
## target event is death (event = 2)
## ------------------------------------------------------------
if (library("survival", logical.return = TRUE)) {
data(pbc, package = "survival")
pbc$id <- NULL
cr.o <- rfsrc(Surv(time, status) ~ ., pbc, splitrule = "logrankCR", cause = 2)
cr.smp.o <- subsample(cr.o, B = 100)
plot.subsample(cr.smp.o, target = 2)
}
## ------------------------------------------------------------
## multivariate
## ------------------------------------------------------------
if (library("mlbench", logical.return = TRUE)) {
## simulate the data
data(BostonHousing)
bh <- BostonHousing
bh$rm <- factor(round(bh$rm))
o <- rfsrc(cbind(medv, rm) ~ ., bh)
so <- subsample(o)
plot.subsample(so)
plot.subsample(so, m.target = "rm")
##generalization error
gerror <- subsample(o, performance.only = TRUE)
plot.subsample(gerror, m.target = "medv")
plot.subsample(gerror, m.target = "rm")
}
## ------------------------------------------------------------
## largish data example - use rfsrc.fast for fast forests
## ------------------------------------------------------------
if (library("caret", logical.return = TRUE)) {
## largish data set
d <- twoClassSim(1000, linearVars = 15, noiseVars = 5)
## use a subsampled forest with Brier score performance
## remember to set forest=TRUE for rfsrc.fast
o <- rfsrc.fast(Class ~ ., d, ntree = 100,
forest = TRUE, perf.type = "brier")
so <- subsample(o, B = 100)
plot.subsample(so, cex.axis = .7)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.