rfsrc | R Documentation |
Fast OpenMP-parallel implementation of random forests (Breiman, 2001) for regression, classification, survival analysis (Ishwaran et al., 2008), competing risks (Ishwaran et al., 2012), multivariate outcomes (Segal and Xiao, 2011), unsupervised learning (Mantero and Ishwaran, 2020), quantile regression (Meinshausen, 2006; Zhang et al., 2019; Greenwald and Khanna, 2001), and imbalanced q-classification (O'Brien and Ishwaran, 2019).
The package supports both deterministic and randomized splitting rules (Geurts et al., 2006; Ishwaran, 2015) across all families. Multiple types of variable importance (VIMP) are available, including holdout VIMP and confidence regions (Ishwaran and Lu, 2019), for both individual and grouped variables. Variable selection can be performed using minimal depth (Ishwaran et al., 2010, 2011). Fast interfaces for missing data imputation are provided using several forest-based algorithms (Tang and Ishwaran, 2017).
Highlighted updates:
For variable selection, we recommend using VarPro, an R package for model-independent variable selection using rule-based variable priority. It supports regression, classification, survival analysis, and includes a new mode for unsupervised learning. See https://www.varprotools.org for more information.
For computational speed, the default VIMP method has changed
from "permute" (Breiman-Cutler permutation) to "anti"
(importance = "anti"
or importance = TRUE
). While
faster, this may be less accurate in settings such as highly
imbalanced classification. To revert to permutation VIMP, use
importance = "permute"
.
This is the main entry point to the randomForestSRC
package. For more information on OpenMP support and the package as a
whole, see package?randomForestSRC
.
rfsrc(formula, data, ntree = 500,
mtry = NULL, ytry = NULL,
nodesize = NULL, nodedepth = NULL,
splitrule = NULL, nsplit = NULL,
importance = c(FALSE, TRUE, "none", "anti", "permute", "random"),
block.size = if (any(is.element(as.character(importance),
c("none", "FALSE")))) NULL else 10,
bootstrap = c("by.root", "none", "by.user"),
samptype = c("swor", "swr"), samp = NULL, membership = FALSE,
sampsize = if (samptype == "swor") function(x){x * .632} else function(x){x},
na.action = c("na.omit", "na.impute"), nimpute = 1,
ntime = 150, cause,
perf.type = NULL,
proximity = FALSE, distance = FALSE, forest.wt = FALSE,
xvar.wt = NULL, yvar.wt = NULL, split.wt = NULL, case.wt = NULL,
case.depth = FALSE,
forest = TRUE,
save.memory = FALSE,
var.used = c(FALSE, "all.trees", "by.tree"),
split.depth = c(FALSE, "all.trees", "by.tree"),
seed = NULL,
do.trace = FALSE,
statistics = FALSE,
...)
## convenient interface for growing a CART tree
rfsrc.cart(formula, data, ntree = 1, mtry = ncol(data), bootstrap = "none", ...)
formula |
A formula describing the model to fit. Interaction terms are not supported. If missing, unsupervised splitting is used. |
data |
Data frame containing the response and predictor variables. |
ntree |
Number of trees to grow. |
mtry |
Number of candidate variables randomly selected at each split. Defaults: regression uses |
ytry |
Number of pseudo-response variables randomly selected for unsupervised splitting. Default is 1. |
nodesize |
Minimum terminal node size. Defaults: survival/competing risks (15), regression (5), classification (1), mixed/unsupervised (3). |
nodedepth |
Maximum tree depth. Ignored by default. |
splitrule |
Splitting rule. See Details. |
nsplit |
Number of random split points per variable. |
importance |
Variable importance (VIMP) method. Choices: |
block.size |
Controls frequency of cumulative error/VIMP updates. Default is |
bootstrap |
Bootstrap method. Options: |
samptype |
Sampling type for |
samp |
Bootstrap weights (only for |
membership |
Return inbag and terminal node membership? |
sampsize |
Bootstrap sample size (used when |
na.action |
Missing data handling. |
nimpute |
Number of iterations for internal imputation. If >1, OOB error rates may be optimistic. |
ntime |
For survival models: number or grid of time points used in ensemble estimation. If |
cause |
For competing risks: event of interest (1 to |
perf.type |
Optional performance metric for prediction, VIMP, and error. Defaults to the family-specific metric. |
proximity |
Compute proximity matrix? Options: |
distance |
Compute pairwise distances between cases? Similar options as |
forest.wt |
Return forest weight matrix? Same options as |
xvar.wt |
Optional weights on x-variables for sampling at splits. Does not need to sum to 1. Defaults to uniform. |
yvar.wt |
Weights on response variables (for multivariate regression). Used when |
split.wt |
Weights applied to each variable's split statistic. Higher weight increases likelihood of splitting. |
case.wt |
Sampling weights for cases in the bootstrap. Higher values increase selection probability. See class imbalance example. |
case.depth |
Return matrix recording depth of first split for each case? Default is |
forest |
Save forest object for future prediction? Set |
save.memory |
Reduce memory usage by avoiding storage of prediction quantities. Recommended for large survival or competing risk forests. |
var.used |
Return variable usage statistics? Options: |
split.depth |
Return minimal depth of splits for each variable? Options: |
seed |
Integer seed for reproducibility (negative values only). |
do.trace |
Print progress updates every |
statistics |
Return raw split statistics? Use |
... |
Additional arguments passed to or from other methods. |
Types of forests
The type of forest is automatically inferred from the outcome and formula. Supported forest types include:
Regression forests for continuous outcomes.
Classification forests for factor outcomes.
Multivariate forests for continuous, categorical, or mixed outcomes.
Unsupervised forests when no outcome is specified.
Survival forests for right-censored time-to-event data.
Competing risk forests for multi-event survival settings.
Splitting
Splitting rules are set using the splitrule
option.
Random splitting is invoked via splitrule = "random"
.
Use nsplit
to enable randomized splitting and improve speed; see Improving computational speed.
Available splitting rules
Regression
"mse"
(default): weighted mean squared error (Breiman et al., 1984).
"quantile.regr"
: quantile regression via check-loss; see quantreg.rfsrc
.
"la.quantile.regr"
: local adaptive quantile regression.
Classification
"gini"
(default): Gini index.
"auc"
: AUC-based splitting; appropriate for imbalanced data.
"entropy"
: entropy-based splitting.
Survival
"logrank"
(default): log-rank splitting.
"bs.gradient"
: Brier score gradient splitting. Uses 90th percentile of observed times by default, or set prob
.
"logrankscore"
: log-rank score splitting.
Competing risks (see Ishwaran et al., 2014)
"logrankCR"
(default): Gray's test-based weighted log-rank splitting.
"logrank"
: cause-specific weighted log-rank; use cause
to target specific events.
Multivariate
Default: normalized composite splitting (Tang and Ishwaran, 2017).
"mahalanobis"
: Mahalanobis splitting with optional covariance matrix; for multivariate regression.
Unsupervised
Splitting uses pseudo-outcomes and the composite rule. See sidClustering
for advanced unsupervised analysis.
Custom splitting
Custom rules can be defined using splitCustom.c
. Up to 16 rules per family are allowed. Use "custom"
, "custom1"
, etc. Compilation required.
Improving computational speed
See rfsrc.fast
. Strategies include:
Increase nodesize
.
Set save.memory = TRUE
for large survival or competing risk models.
Set block.size = NULL
to avoid repeated cumulative error computation.
Use perf.type = "none"
to disable VIMP and C-index calculations.
Set nsplit
to a small integer (e.g., 1-10).
Reduce bootstrap size with sampsize
, samptype
.
Set ntime
to a coarse grid (e.g., 50) for survival models.
Pre-filter variables; use max.subtree
for fast variable selection.
Prediction Error
Error is computed using OOB data:
Regression: mean squared error.
Classification: misclassification rate, Brier score, AUC.
Survival: C-error = 1 - Harrell's concordance index.
If bootstrap = "none"
, OOB error is unavailable. Use predict.rfsrc
for cross-validation error instead.
Variable Importance (VIMP)
VIMP methods:
"permute"
: permutation VIMP (Breiman-Cutler).
"random"
: randomized left/right assignment.
"anti"
(default): anti-split assignment.
The block.size
option controls granularity. For confidence intervals, see subsampling
. Also see holdout.vimp
for a more conservative variant.
Multivariate Forests
Use:
rfsrc(Multivar(y1, ..., yd) ~ ., data)
or
rfsrc(cbind(y1, ..., yd) ~ ., data)
Use get.mv.formula
, get.mv.predicted
, get.mv.error
for multivariate extraction.
Unsupervised Forests
Use:
rfsrc(data = X)
or
rfsrc(Unsupervised(ytry) ~ ., data = X)
Random subsets of ytry
pseudo-responses are used for each mtry
variable. No performance metrics are computed.
Survival, Competing Risks
Survival: use Surv(time, status) ~ .
. Status must be 0 (censored) or 1 (event).
Competing risks: status = 0 (censored), 1-J (event types). Use cause
to target specific events.
Larger nodesize
is typically needed for competing risks.
Missing data imputation
Use na.action = "na.impute"
. Iteration with nimpute > 1
replaces missing values using OOB predictions. Observations or variables with all missing values are removed.
Allowable data types and factors
Variables must be numeric, integer, factor, or logical. Non-factors are coerced to numeric. For unordered factors, all complementary subsets are considered for splits.
Factor levels are mapped to ensure consistency across training/test data. Consider converting factors to numeric for high-dimensional settings.
An object of class (rfsrc, grow)
with the following components:
The original call to rfsrc
.
The family used in the analysis.
Sample size after applying na.action
.
Number of trees grown.
Number of variables randomly selected at each node.
Minimum terminal node size.
Maximum depth allowed for each tree.
Splitting rule used.
Number of random split points.
Response values.
Character vector of response variable names.
Data frame of predictor variables.
Character vector of predictor variable names.
Non-negative weights specifying the selection probability of each variable.
Non-negative weights adjusting each variable's split statistic.
Weights for composite competing risk splitting.
Number of terminal nodes per tree. A value of 0 indicates a rejected tree (may occur with missing data); a value of 1 indicates a stump.
Proximity matrix indicating how often case pairs fall in the same terminal node.
Forest object, returned if forest=TRUE
. Required for prediction and most wrappers.
Forest weight matrix.
Terminal node membership matrix (rows: trees; columns: cases).
Inbag count matrix (rows: trees; columns: cases).
Number of times each variable is used to split a node.
Indices of individuals with missing values.
Imputed dataset with responses followed by predictors.
Matrix or array recording minimal split depth of variables by case and tree.
Split statistics returned if statistics=TRUE
, parsed with stat.split
.
Cumulative OOB error rate.
Cumulative error per ensemble block (size defined by block.size
). If block.size = 1
, error per tree.
Variable importance (VIMP) for each predictor.
In-bag predicted values.
Out-of-bag (OOB) predicted values.
(Classification) In-bag predicted class labels.
(Classification) OOB predicted class labels.
(Multivariate) List of performance results for continuous outcomes.
(Multivariate) List of performance results for categorical outcomes.
(Survival) In-bag survival functions.
(Survival) OOB survival functions.
(Survival or competing risks) In-bag cumulative hazard function.
(Survival or competing risks) OOB cumulative hazard function.
(Survival or competing risks) Unique sorted event times.
(Survival or competing risks) Total number of observed events.
(Competing risks) In-bag cumulative incidence function by cause.
(Competing risks) OOB cumulative incidence function by cause.
Values returned by the forest depend on the family:
Regression: predicted
and predicted.oob
are vectors of predicted values.
Classification: predicted
and predicted.oob
are matrices of class probabilities. VIMP and performance metrics are returned as a matrix with J+1
columns (J = number of classes). The first column ("all") gives unconditional results; remaining columns give class-conditional results.
Survival: predicted
contains mortality estimates (Ishwaran et al., 2008). These are calibrated to the number of expected events under identical covariate profiles. Also returned are matrices of the survival function and CHF for each individual over time.interest
.
Competing risks: predicted
contains expected life years lost by cause (Ishwaran et al., 2013). Also returned are three-dimensional arrays for CIF and CSCHF indexed by case, time, and event type.
Multivariate: Predictions, VIMP, and error rates are returned in regrOutput
and clasOutput
. Use get.mv.predicted
, get.mv.vimp
, and get.mv.error
to extract results.
Hemant Ishwaran and Udaya B. Kogalur
Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. (1984). Classification and Regression Trees, Belmont, California.
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Cutler A. and Zhao G. (2001). PERT-Perfect random tree ensembles. Comp. Sci. Statist., 33: 490-497.
Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40, 139-157.
Gray R.J. (1988). A class of k-sample tests for comparing the cumulative incidence of a competing risk, Ann. Statist., 16: 1141-1154.
Geurts, P., Ernst, D. and Wehenkel, L., (2006). Extremely randomized trees. Machine learning, 63(1):3-42.
Greenwald M. and Khanna S. (2001). Space-efficient online computation of quantile summaries. Proceedings of ACM SIGMOD, 30(2):58-66.
Harrell et al. F.E. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:2543-2546.
Hothorn T. and Lausen B. (2003). On the exact distribution of maximally selected rank statistics, Comp. Statist. Data Anal., 43:121-137.
Ishwaran H. (2007). Variable importance in binary regression trees and forests, Electronic J. Statist., 1:519-537.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.
Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2011). Random survival forests for high-dimensional data. Stat. Anal. Data Mining, 4:115-132
Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics, 15(4):757-773.
Ishwaran H. and Malley J.D. (2014). Synthetic learning machines. BioData Mining, 7:28.
Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75-118.
Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. J. Amer. Statist. Assoc., 101(474), 578-590.
Lu M., Sadiq S., Feaster D.J. and Ishwaran H. (2018). Estimating individual treatment effect in observational data using random forest methods. J. Comp. Graph. Statist, 27(1), 209-219
Ishwaran H. and Lu M. (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine, 38, 558-582.
LeBlanc M. and Crowley J. (1993). Survival trees by goodness of split, J. Amer. Statist. Assoc., 88:457-467.
Loh W.-Y and Shih Y.-S (1997). Split selection methods for classification trees, Statist. Sinica, 7:815-840.
Mantero A. and Ishwaran H. (2021). Unsupervised random forests. Statistical Analysis and Data Mining, 14(2):144-167.
Meinshausen N. (2006) Quantile regression forests, Journal of Machine Learning Research, 7:983-999.
Mogensen, U.B, Ishwaran H. and Gerds T.A. (2012). Evaluating random forests for survival analysis using prediction error curves, J. Statist. Software, 50(11): 1-23.
O'Brien R. and Ishwaran H. (2019). A random forests quantile classifier for class imbalanced data. Pattern Recognition, 90, 232-249
Segal M.R. (1988). Regression trees for censored data, Biometrics, 44:35-47.
Segal M.R. and Xiao Y. Multivariate random forests. (2011). Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1(1):80-87.
Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10:363-377.
Zhang H., Zimmerman J., Nettleton D. and Nordman D.J. (2019). Random forest prediction intervals. The American Statistician. 4:1-5.
find.interaction.rfsrc
,
get.tree.rfsrc
,
holdout.vimp.rfsrc
,
imbalanced.rfsrc
,
impute.rfsrc
,
max.subtree.rfsrc
,
partial.rfsrc
,
plot.competing.risk.rfsrc
,
plot.rfsrc
,
plot.survival.rfsrc
,
plot.variable.rfsrc
,
predict.rfsrc
,
print.rfsrc
,
quantreg.rfsrc
,
rfsrc
,
rfsrc.anonymous
,
rfsrc.cart
,
rfsrc.fast
,
sidClustering.rfsrc
,
stat.split.rfsrc
,
subsample.rfsrc
,
synthetic.rfsrc
,
tune.rfsrc
,
vimp.rfsrc
##------------------------------------------------------------
## survival analysis
##------------------------------------------------------------
## veteran data
## randomized trial of two treatment regimens for lung cancer
data(veteran, package = "randomForestSRC")
v.obj <- rfsrc(Surv(time, status) ~ ., data = veteran, block.size = 1)
## plot tree number 3
plot(get.tree(v.obj, 3))
## print results of trained forest
print(v.obj)
## plot results of trained forest
plot(v.obj)
## plot survival curves for first 10 individuals -- direct way
matplot(v.obj$time.interest, 100 * t(v.obj$survival.oob[1:10, ]),
xlab = "Time", ylab = "Survival", type = "l", lty = 1)
## plot survival curves for first 10 individuals
## using function "plot.survival"
plot.survival(v.obj, subset = 1:10)
## obtain Brier score using KM and RSF censoring distribution estimators
bs.km <- get.brier.survival(v.obj, cens.model = "km")$brier.score
bs.rsf <- get.brier.survival(v.obj, cens.model = "rfsrc")$brier.score
## plot the brier score
plot(bs.km, type = "s", col = 2)
lines(bs.rsf, type ="s", col = 4)
legend("topright", legend = c("cens.model = km", "cens.model = rfsrc"), fill = c(2,4))
## plot CRPS (continuous rank probability score) as function of time
## here's how to calculate the CRPS for every time point
trapz <- randomForestSRC:::trapz
time <- v.obj$time.interest
crps.km <- sapply(1:length(time), function(j) {
trapz(time[1:j], bs.km[1:j, 2] / diff(range(time[1:j])))
})
crps.rsf <- sapply(1:length(time), function(j) {
trapz(time[1:j], bs.rsf[1:j, 2] / diff(range(time[1:j])))
})
plot(time, crps.km, ylab = "CRPS", type = "s", col = 2)
lines(time, crps.rsf, type ="s", col = 4)
legend("bottomright", legend=c("cens.model = km", "cens.model = rfsrc"), fill=c(2,4))
## fast nodesize optimization for veteran data
## optimal nodesize in survival is larger than other families
## see the function "tune" for more examples
tune.nodesize(Surv(time,status) ~ ., veteran)
## Primary biliary cirrhosis (PBC) of the liver
data(pbc, package = "randomForestSRC")
pbc.obj <- rfsrc(Surv(days, status) ~ ., pbc)
print(pbc.obj)
## save.memory example for survival
## growing many deep trees creates memory issue without this option!
data(pbc, package = "randomForestSRC")
print(rfsrc(Surv(days, status) ~ ., pbc, splitrule = "random",
ntree = 25000, nodesize = 1, save.memory = TRUE))
##------------------------------------------------------------
## trees can be plotted for any family
## see get.tree for details and more examples
##------------------------------------------------------------
## survival where factors have many levels
data(veteran, package = "randomForestSRC")
vd <- veteran
vd$celltype=factor(vd$celltype)
vd$diagtime=factor(vd$diagtime)
vd.obj <- rfsrc(Surv(time,status)~., vd, ntree = 100, nodesize = 5)
plot(get.tree(vd.obj, 3))
## classification
iris.obj <- rfsrc(Species ~., data = iris)
plot(get.tree(iris.obj, 25, class.type = "bayes"))
plot(get.tree(iris.obj, 25, target = "setosa"))
plot(get.tree(iris.obj, 25, target = "versicolor"))
plot(get.tree(iris.obj, 25, target = "virginica"))
## ------------------------------------------------------------
## simple example of VIMP using iris classification
## ------------------------------------------------------------
## directly from trained forest
print(rfsrc(Species~.,iris,importance=TRUE)$importance)
## VIMP (and performance) use misclassification error by default
## but brier prediction error can be requested
print(rfsrc(Species~.,iris,importance=TRUE,perf.type="brier")$importance)
## example using vimp function (see vimp help file for details)
iris.obj <- rfsrc(Species ~., data = iris)
print(vimp(iris.obj)$importance)
print(vimp(iris.obj,perf.type="brier")$importance)
## example using hold out vimp (see holdout.vimp help file for details)
print(holdout.vimp(Species~.,iris)$importance)
print(holdout.vimp(Species~.,iris,perf.type="brier")$importance)
## ------------------------------------------------------------
## confidence interval for vimp using subsampling
## compare with holdout vimp
## ------------------------------------------------------------
## new York air quality measurements
o <- rfsrc(Ozone ~ ., data = airquality)
so <- subsample(o)
plot(so)
## compare with holdout vimp
print(holdout.vimp(Ozone ~ ., data = airquality)$importance)
##------------------------------------------------------------
## example of imputation in survival analysis
##------------------------------------------------------------
data(pbc, package = "randomForestSRC")
pbc.obj2 <- rfsrc(Surv(days, status) ~ ., pbc, na.action = "na.impute")
## same as above but iterate the missing data algorithm
pbc.obj3 <- rfsrc(Surv(days, status) ~ ., pbc,
na.action = "na.impute", nimpute = 3)
## fast way to impute data (no inference is done)
## see impute for more details
pbc.imp <- impute(Surv(days, status) ~ ., pbc, splitrule = "random")
##------------------------------------------------------------
## compare RF-SRC to Cox regression
## Illustrates C-error and Brier score measures of performance
## assumes "pec" and "survival" libraries are loaded
##------------------------------------------------------------
if (library("survival", logical.return = TRUE)
& library("pec", logical.return = TRUE)
& library("prodlim", logical.return = TRUE))
{
##prediction function required for pec
predictSurvProb.rfsrc <- function(object, newdata, times, ...){
ptemp <- predict(object,newdata=newdata,...)$survival
pos <- sindex(jump.times = object$time.interest, eval.times = times)
p <- cbind(1,ptemp)[, pos + 1]
if (NROW(p) != NROW(newdata) || NCOL(p) != length(times))
stop("Prediction failed")
p
}
## data, formula specifications
data(pbc, package = "randomForestSRC")
pbc.na <- na.omit(pbc) ##remove NA's
surv.f <- as.formula(Surv(days, status) ~ .)
pec.f <- as.formula(Hist(days,status) ~ 1)
## run cox/rfsrc models
## for illustration we use a small number of trees
cox.obj <- coxph(surv.f, data = pbc.na, x = TRUE)
rfsrc.obj <- rfsrc(surv.f, pbc.na, ntree = 150)
## compute bootstrap cross-validation estimate of expected Brier score
## see Mogensen, Ishwaran and Gerds (2012) Journal of Statistical Software
set.seed(17743)
prederror.pbc <- pec(list(cox.obj,rfsrc.obj), data = pbc.na, formula = pec.f,
splitMethod = "bootcv", B = 50)
print(prederror.pbc)
plot(prederror.pbc)
## compute out-of-bag C-error for cox regression and compare to rfsrc
rfsrc.obj <- rfsrc(surv.f, pbc.na)
cat("out-of-bag Cox Analysis ...", "\n")
cox.err <- sapply(1:100, function(b) {
if (b%%10 == 0) cat("cox bootstrap:", b, "\n")
train <- sample(1:nrow(pbc.na), nrow(pbc.na), replace = TRUE)
cox.obj <- tryCatch({coxph(surv.f, pbc.na[train, ])}, error=function(ex){NULL})
if (!is.null(cox.obj)) {
get.cindex(pbc.na$days[-train], pbc.na$status[-train], predict(cox.obj, pbc.na[-train, ]))
} else NA
})
cat("\n\tOOB error rates\n\n")
cat("\tRSF : ", rfsrc.obj$err.rate[rfsrc.obj$ntree], "\n")
cat("\tCox regression : ", mean(cox.err, na.rm = TRUE), "\n")
}
##------------------------------------------------------------
## competing risks
##------------------------------------------------------------
## WIHS analysis
## cumulative incidence function (CIF) for HAART and AIDS stratified by IDU
data(wihs, package = "randomForestSRC")
wihs.obj <- rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3, ntree = 100)
plot.competing.risk(wihs.obj)
cif <- wihs.obj$cif.oob
Time <- wihs.obj$time.interest
idu <- wihs$idu
cif.haart <- cbind(apply(cif[,,1][idu == 0,], 2, mean),
apply(cif[,,1][idu == 1,], 2, mean))
cif.aids <- cbind(apply(cif[,,2][idu == 0,], 2, mean),
apply(cif[,,2][idu == 1,], 2, mean))
matplot(Time, cbind(cif.haart, cif.aids), type = "l",
lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3,
ylab = "Cumulative Incidence")
legend("topleft",
legend = c("HAART (Non-IDU)", "HAART (IDU)", "AIDS (Non-IDU)", "AIDS (IDU)"),
lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3, cex = 1.5)
## illustrates the various splitting rules
## illustrates event specific and non-event specific variable selection
if (library("survival", logical.return = TRUE)) {
## use the pbc data from the survival package
## events are transplant (1) and death (2)
data(pbc, package = "survival")
pbc$id <- NULL
## modified Gray's weighted log-rank splitting
## (equivalent to cause=c(1,1) and splitrule="logrankCR")
pbc.cr <- rfsrc(Surv(time, status) ~ ., pbc)
## log-rank cause-1 specific splitting and targeted VIMP for cause 1
pbc.log1 <- rfsrc(Surv(time, status) ~ ., pbc,
splitrule = "logrankCR", cause = c(1,0), importance = TRUE)
## log-rank cause-2 specific splitting and targeted VIMP for cause 2
pbc.log2 <- rfsrc(Surv(time, status) ~ ., pbc,
splitrule = "logrankCR", cause = c(0,1), importance = TRUE)
## extract VIMP from the log-rank forests: event-specific
## extract minimal depth from the Gray log-rank forest: non-event specific
var.perf <- data.frame(md = max.subtree(pbc.cr)$order[, 1],
vimp1 = 100 * pbc.log1$importance[ ,1],
vimp2 = 100 * pbc.log2$importance[ ,2])
print(var.perf[order(var.perf$md), ], digits = 2)
}
## ------------------------------------------------------------
## regression analysis
## ------------------------------------------------------------
## new York air quality measurements
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")
# partial plot of variables (see plot.variable for more details)
plot.variable(airq.obj, partial = TRUE, smooth.lines = TRUE)
## motor trend cars
mtcars.obj <- rfsrc(mpg ~ ., data = mtcars)
## ------------------------------------------------------------
## regression with custom bootstrap
## ------------------------------------------------------------
ntree <- 25
n <- nrow(mtcars)
s.size <- n / 2
swr <- TRUE
samp <- randomForestSRC:::make.sample(ntree, n, s.size, swr)
o <- rfsrc(mpg ~ ., mtcars, bootstrap = "by.user", samp = samp)
## ------------------------------------------------------------
## classification analysis
## ------------------------------------------------------------
## iris data
iris.obj <- rfsrc(Species ~., data = iris)
## wisconsin prognostic breast cancer data
data(breast, package = "randomForestSRC")
breast.obj <- rfsrc(status ~ ., data = breast, block.size=1)
plot(breast.obj)
## ------------------------------------------------------------
## big data set, reduce number of variables using simple method
## ------------------------------------------------------------
## use Iowa housing data set
data(housing, package = "randomForestSRC")
## original data contains lots of missing data, use fast imputation
## however see impute for other methods
housing2 <- impute(data = housing, fast = TRUE)
## run shallow trees to find variables that split any tree
xvar.used <- rfsrc(SalePrice ~., housing2, ntree = 250, nodedepth = 4,
var.used="all.trees", mtry = Inf, nsplit = 100)$var.used
## now fit forest using filtered variables
xvar.keep <- names(xvar.used)[xvar.used >= 1]
o <- rfsrc(SalePrice~., housing2[, c("SalePrice", xvar.keep)])
print(o)
## ------------------------------------------------------------
## imbalanced classification data
## see the "imbalanced" function for further details
##
## (a) use balanced random forests with undersampling of the majority class
## Specifically let n0, n1 be sample sizes for majority, minority
## cases. We sample 2 x n1 cases with majority, minority cases chosen
## with probabilities n1/n, n0/n where n=n0+n1
##
## (b) balanced random forests using "imbalanced"
##
## (c) q-classifier (RFQ) using "imbalanced"
##
## ------------------------------------------------------------
## Wisconsin breast cancer example
data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
## balanced random forests - brute force
y <- breast$status
obdirect <- rfsrc(status ~ ., data = breast, nsplit = 10,
case.wt = randomForestSRC:::make.wt(y),
sampsize = randomForestSRC:::make.size(y))
print(obdirect)
print(get.imbalanced.performance(obdirect))
## balanced random forests - using "imbalanced"
ob <- imbalanced(status ~ ., data = breast, method = "brf")
print(ob)
print(get.imbalanced.performance(ob))
## q-classifier (RFQ) - using "imbalanced"
oq <- imbalanced(status ~ ., data = breast)
print(oq)
print(get.imbalanced.performance(oq))
## q-classifier (RFQ) - with auc splitting
oqauc <- imbalanced(status ~ ., data = breast, splitrule = "auc")
print(oqauc)
print(get.imbalanced.performance(oqauc))
## ------------------------------------------------------------
## unsupervised analysis
## ------------------------------------------------------------
## two equivalent ways to implement unsupervised forests
mtcars.unspv <- rfsrc(Unsupervised() ~., data = mtcars)
mtcars2.unspv <- rfsrc(data = mtcars)
## illustration of sidClustering for the mtcars data
## see sidClustering for more details
mtcars.sid <- sidClustering(mtcars, k = 1:10)
print(split(mtcars, mtcars.sid$cl[, 3]))
print(split(mtcars, mtcars.sid$cl[, 10]))
## ------------------------------------------------------------
## bivariate regression using Mahalanobis splitting
## also illustrates user specified covariance matrix
## ------------------------------------------------------------
if (library("mlbench", logical.return = TRUE)) {
## load boston housing data, specify the bivariate regression
data(BostonHousing)
f <- formula("Multivar(lstat, nox) ~.")
## Mahalanobis splitting
bh.mreg <- rfsrc(f, BostonHousing, importance = TRUE, splitrule = "mahal")
## performance error and vimp
vmp <- get.mv.vimp(bh.mreg)
pred <- get.mv.predicted(bh.mreg)
## standardized error and vimp
err.std <- get.mv.error(bh.mreg, standardize = TRUE)
vmp.std <- get.mv.vimp(bh.mreg, standardize = TRUE)
## same analysis, but with user specified covariance matrix
sigma <- cov(BostonHousing[, c("lstat","nox")])
bh.mreg2 <- rfsrc(f, BostonHousing, splitrule = "mahal", sigma = sigma)
}
## ------------------------------------------------------------
## multivariate mixed forests (nutrigenomic study)
## study effects of diet, lipids and gene expression for mice
## diet, genotype and lipids used as the multivariate y
## genes used for the x features
## ------------------------------------------------------------
## load the data (data is a list)
data(nutrigenomic, package = "randomForestSRC")
## assemble the multivariate y data
ydta <- data.frame(diet = nutrigenomic$diet,
genotype = nutrigenomic$genotype,
nutrigenomic$lipids)
## multivariate mixed forest call
## uses "get.mv.formula" for conveniently setting formula
mv.obj <- rfsrc(get.mv.formula(colnames(ydta)),
data.frame(ydta, nutrigenomic$genes),
importance=TRUE, nsplit = 10)
## print results for diet and genotype y values
print(mv.obj, outcome.target = "diet")
print(mv.obj, outcome.target = "genotype")
## extract standardized VIMP
svimp <- get.mv.vimp(mv.obj, standardize = TRUE)
## plot standardized VIMP for diet, genotype and lipid for each gene
boxplot(t(svimp), col = "bisque", cex.axis = .7, las = 2,
outline = FALSE,
ylab = "standardized VIMP",
main = "diet/genotype/lipid VIMP for each gene")
## ------------------------------------------------------------
## illustrates yvar.wt which sets the probability of selecting
## the response variables in multivariate regression
## ------------------------------------------------------------
## use mtcars: add fake responses
mult.mtcars <- cbind(mtcars, mtcars$mpg, mtcars$mpg)
names(mult.mtcars) = c(names(mtcars), "mpg2", "mpg3")
## noise up the fake responses
mult.mtcars$mpg2 <- sample(mtcars$mpg)
mult.mtcars$mpg3 <- sample(mtcars$mpg)
formula = as.formula(Multivar(mpg, mpg2, mpg3) ~ .)
## select 2 of the 3 responses randomly at each split with an associated weight vector.
## choose the noisy y responses which should degrade performance
yvar.wt = c(0.000001, 0.5, 0.5)
ytry = 2
mult.grow <- rfsrc(formula = formula, data = mult.mtcars, ytry = ytry, yvar.wt = yvar.wt)
print(mult.grow)
print(get.mv.error(mult.grow))
## Also, compare the following two results, as they should be similar:
yvar.wt = c(1.0, 00000.1, 00000.1)
ytry = 1
result1 = rfsrc(formula = formula, data = mult.mtcars, ytry = ytry, yvar.wt = yvar.wt)
result2 = rfsrc(mpg ~ ., mtcars)
print(get.mv.error(result1))
print(get.mv.error(result2))
## ------------------------------------------------------------
## custom splitting using the pre-coded examples
## ------------------------------------------------------------
## motor trend cars
mtcars.obj <- rfsrc(mpg ~ ., data = mtcars, splitrule = "custom")
## iris analysis
iris.obj <- rfsrc(Species ~., data = iris, splitrule = "custom1")
## WIHS analysis
wihs.obj <- rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3,
ntree = 100, splitrule = "custom1")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.