Description Usage Arguments Details Value Note Author(s) References See Also Examples
Fast OpenMP parallel processing unified treatment of Breiman's random
forests (Breiman 2001) for a variety of data settings. Applies when
the yresponse is numeric or categorical, yielding Breiman regression
and classification forests, while random survival forests (Ishwaran et
al. 2008, 2012) are grown for rightcensored survival and competing
risk data. Multivariate regression and classification responses as
well as mixed regression/classification responses are also handled.
Also includes unsupervised forests and quantile regression forests,
quantileReg
. Different splitting rules invoked under
deterministic or random splitting are available for all families.
Variable predictiveness can be assessed using variable importance
(VIMP) measures for single, as well as grouped variables. Missing
data can be imputed on both training and test data; see
impute
. The forest object, informally referred to as
an RFSRC object, contains many useful values which can be directly
extracted by the user and/or parsed using additional functions (see
the examples below).
This is the main entry point to the randomForestSRC
package. Also see rfsrcFast
for a fast
implementation of rfsrc
.
For more information about this package and OpenMP, use the command
package?randomForestSRC
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  rfsrc(formula, data, ntree = 1000,
mtry = NULL, ytry = NULL,
nodesize = NULL, nodedepth = NULL,
splitrule = NULL, nsplit = 10,
importance = c(FALSE, TRUE, "none", "permute", "random", "anti"),
block.size = if (importance == "none"  as.character(importance) == "FALSE") NULL
else 10,
ensemble = c("all", "oob", "inbag"),
bootstrap = c("by.root", "by.node", "none", "by.user"),
samptype = c("swr", "swor"), sampsize = NULL, samp = NULL, membership = FALSE,
na.action = c("na.omit", "na.impute"), nimpute = 1,
ntime, cause,
proximity = FALSE, distance = FALSE, forest.wt = FALSE,
xvar.wt = NULL, yvar.wt = NULL, split.wt = NULL, case.wt = NULL,
forest = TRUE,
var.used = c(FALSE, "all.trees", "by.tree"),
split.depth = c(FALSE, "all.trees", "by.tree"),
seed = NULL,
do.trace = FALSE,
statistics = FALSE,
...)

formula 
A symbolic description of the model to be fit. If missing, unsupervised splitting is implemented. 
data 
Data frame containing the youtcome and xvariables. 
ntree 
Number of trees in the forest. 
mtry 
Number of variables randomly selected as candidates for
splitting a node. The default is 
ytry 
For unsupervised forests, sets the number of randomly
selected pseudoresponses (see below for more details).
The default is 
nodesize 
Forest average number of unique cases (data points) in
a terminal node. The defaults are: survival (3), competing risk
(6), regression (5), classification (1), mixed outcomes (3),
unsupervised (3). It is recommended to experiment with different

nodedepth 
Maximum depth to which a tree should be grown. The default behaviour is that this parameter is ignored. 
splitrule 
Splitting rule used to grow trees. See below for details. 
nsplit 
Nonnegative integer value. When zero or NULL, deterministic splitting for an xvariable is in effect. When nonzero, a maximum of nsplit split points are randomly chosen among the possible split points for the xvariable. This significantly increases speed. 
importance 
Method for computing variable importance (VIMP).
Because VIMP is computationally expensive, the default action is

block.size 
Should the cumulative error rate be calculated on
every tree? When 
ensemble 
Specifies the type of ensemble. By default both outofbag (OOB) and inbag ensembles are returned. Always use OOB values for interfence on the training data. 
bootstrap 
Bootstrap protocol. The default is 
samptype 
Type of bootstrap when 
sampsize 
Requested size of bootstrap when 
samp 
Bootstrap specification when 
membership 
Should terminal node membership and inbag information be returned? 
na.action 
Action taken if the data contains 
nimpute 
Number of iterations of the missing data algorithm.
Performance measures such as outofbag (OOB) error rates tend
to become optimistic if 
ntime 
Integer value used for survival to constrain ensemble
calculations to a grid of 
cause 
Integer value between 1 and 
proximity 
Proximity of cases as measured by the frequency of
sharing the same terminal node. This is an 
distance 
Distance between cases as measured by the ratio of the
sum of the count of edges from each case to their immediate common
ancestor node to the sum of the count of edges from each case to the
root node. If the cases are coterminal for a tree, this measure is
zero and reduces to 1  the proximity measure for these cases in a
tree. This is an 
forest.wt 
Should the forest weight matrix be
calculated? Creates an 
xvar.wt 
Vector of nonnegative weights where entry

yvar.wt 
NOT YET IMPLEMENTED: Vector of nonnegative weights where entry

split.wt 
Vector of nonnegative weights where entry

case.wt 
Vector of nonnegative weights where entry

forest 
Should the forest object be returned? Used for prediction on new data and required by many of the functions used to parse the RFSRC object. It is recommended not to change the default setting. 
var.used 
Return variables used for splitting? Default is

split.depth 
Records the minimal depth for each variable.
Default is 
seed 
Negative integer specifying seed for the random number generator. 
do.trace 
Number of seconds between updates to the user on approximate time to completion. 
statistics 
Should split statistics be returned? Values can be
parsed using 
... 
Further arguments passed to or from other methods. 
Families
Do *not* set this value as the package automagically determines the underlying random forest family from the type of response and the formula supplied. There are eight possible scenarios:
regr
, regr+
, class
, class+
, mix+
,
unsupv
, surv
, and survCR
.
Regression forests (regr
) for continuous responses.
Multivariate regression forests (regr+
)
for multivariate continuous responses.
Classification forests (class
) for factor responses.
Multivariate classification forests (class+
)
for multivariate factor responses.
Multivariate mixed forests (mix+
) for mixed continuous
and factor responses.
Unsupervised forests (unsupv
) when there is no response.
Survival forest (surv
) for rightcensored survival settings.
Competing risk survival forests (survCR
)
for competing risk scenarios.
See below for how to code the response in the two different survival scenarios and for specifying a multivariate forest formula.
Splitrules
Splitrules are set according to the option splitrule
as follows:
Regression analysis:
The default rule is weighted meansquared error splitting
mse
(Breiman et al. 1984, Chapter 8.4).
Unweighted and heavy weighted meansquared error splitting
rules can be invoked using splitrules mse.unwt
and
mse.hvwt
. Generally mse
works best, but see
Ishwaran (2015) for details.
Multivariate regression analysis: For multivariate regression responses, a composite normalized meansquared error splitting rule is used.
Classification analysis:
The default rule is Gini index splitting gini
(Breiman
et al. 1984, Chapter 4.3).
Unweighted and heavy weighted Gini index splitting rules can
be invoked using splitrules gini.unwt
and
gini.hvwt
. Generally gini
works best, but see
Ishwaran (2015) for details.
Multivariate classification analysis: For multivariate classification responses, a composite normalized Gini index splitting rule is used.
Mixed outcomes analysis: When both regression and classification responses are detected, a multivariate normalized composite split rule of meansquared error and Gini index splitting is invoked. See Tang and Ishwaran (2017) for details.
Unsupervised analysis: In settings where there is no outcome, unsupervised splitting is invoked. In this case, the mixed outcome splitting rule (above) is applied. See Mantero and Ishwaran (2017) for details.
Survival analysis:
The default rule is logrank
which implements
logrank splitting (Segal, 1988; Leblanc and Crowley, 1993).
logrankscore
implements logrank score splitting
(Hothorn and Lausen, 2003).
Competing risk analysis:
The default rule is logrankCR
which implements a
modified weighted logrank splitting rule modeled after Gray's
test (Gray, 1988).
logrank
implements weighted logrank splitting where
each event type is treated as the event of interest and all
other events are treated as censored. The split rule is the
weighted value of each of logrank statistics, standardized by
the variance. For more details see Ishwaran et al. (2014).
Custom splitting: All families except unsupervised are
available for user defined custom splitting. Some basic
Cprogramming skills are required. The harness for defining these
rules is in splitCustom.c
. In this file we give examples
of how to code rules for regression, classification, survival, and
competing risk. Each family can support up to sixteen custom
split rules. Specifying splitrule="custom"
or
splitrule="custom1"
will trigger the first split rule for
the family defined by the training data set. Multivariate families
will need a custom split rule for both regression and
classification. In the examples, we demonstrate how the user is
presented with the node specific membership. The task is then to
define a split statistic based on that membership. Take note of
the instructions in splitCustom.c
on how to register the
custom split rules. It is suggested that the existing custom split
rules be kept in place for reference and that the user proceed
to develop splitrule="custom2"
and so on. The package must
be recompiled and installed for the custom split rules to become
available.
Random splitting. For all families, pure random splitting
can be invoked by setting splitrule="random"
. See
below for more details regarding randomized splitting rules.
Allowable data types
Data types must be real valued, integer, factor or logical –
however all except factors are coerced and treated as if real
valued. For ordered xvariable factors, splits are similar to real
valued variables. If the xvariable factor is unordered, a split
will move a subset of the levels in the parent node to the left
daughter, and the complementary subset to the right daughter. All
possible complementary pairs are considered and apply to factors
with an unlimited number of levels. However, there is an
optimization check to ensure that the number of splits attempted is
not greater than the number of cases in a node (this internal check
will override the nsplit
value in random splitting mode if
nsplit
is large enough; see below for information about
nsplit
).
Improving computational speed
See the function rfsrcFast
for a fast
implementation of rfsrc
. In general, the key methods for
increasing speed are as follows:
Randomized splitting rules
Trees tend to favor splits on continuous variables and factors
with large numbers of levels (Loh and Shih, 1997). To mitigate
this bias and improve speed, randomized splitting can be invoked
using the option nsplit
. If nsplit
is set to a
nonzero positive integer, then a maximum of nsplit
split
points are chosen randomly for each of the mtry
variables
within a node and only these points are used to determine the best
split. Pure random splitting can be invoked by setting
splitrule="random"
. In this case, a variable is randomly
selected and the node is split using a random split point (Cutler
and Zhao, 2001; Lin and Jeon, 2006). Note when pure random
splitting is in effect, nsplit
is set to one.
Subsampling
Subsampling can be used to reduce the size of the insample data
used to grow a tree and therefore can greatly reduce computational
load. Subsampling is implemented using options sampsize
and samptype
.
Unique time points
For large survival problems, users should consider setting
ntime
to a reasonably small value (such as 50 or 100).
This constrains ensemble calculations such as survival functions
to a restricted grid of time points of length no more than
ntime
and considerably reduces computational times.
Large number of variables
Use the default setting of importance="none"
which turns
off variable importance (VIMP) calculations and considerably
reduces computational times when there are a large number of
variables (see below for more details about variable importance).
Variable importance calculations can always be recovered later
using functions vimp
or predict
. Also
consider using the function max.subtree
which calculates
minimal depth, a measure of the depth that a variable splits, and
yields fast variable selection (Ishwaran, 2010).
Factors
For coherence, an immutable map is applied to each factor that ensures that factor levels in the training data set are consistent with the factor levels in any subsequent test data set. This map is applied to each factor before and after the native C library is executed. Because of this, if xvariables are all factors, then computational times may become very long in high dimensional problems. Consider converting factors to real if this is the case.
Prediction Error
Prediction error is calculated using OOB data. Performance is measured in terms of meansquarederror for regression, and misclassification error for classification. A normalized Brier score (relative to a cointoss) is also provided upon printing a classification forest.
For survival, prediction error is measured by 1C, where C is Harrell's (Harrell et al., 1982) concordance index. Prediction error is between 0 and 1, and measures how well the predictor correctly ranks (classifies) two random individuals in terms of survival. A value of 0.5 is no better than random guessing. A value of 0 is perfect.
When bootstrapping is by.node
or none
, a coherent OOB
subset is not available to assess prediction error. Thus, all outputs
dependent on this are suppressed. In such cases, prediction error is
only available via classical crossvalidation (the user will need to
use the predict.rfsrc
function).
Variable Importance (VIMP)
To calculate VIMP, use the option importance
. Classical
permutation VIMP is implemented when permute
or TRUE
is selected. In this case, OOB cases for a variable x are
randomly permuted and dropped down a tree. VIMP is calculated by
comparing OOB prediction performance for the permuted predictor to
the original predictor.
The exact calculation for VIMP depends upon block.size
(an
integer value between 1
and ntree
) specifying the
number of trees in a block used to determine VIMP. When the value
is 1
, VIMP is calculated by tree (blocks of size 1).
Specifically, the difference between prediction error under the
perturbed predictor and the original predictor is calculated for
each tree and averaged over the forest. This yields the original
BreimanCutler VIMP (Breiman 2001).
When block.size
is set to ntree
, VIMP is calculated by
comparing the error rate for the perturbed OOB forest ensemble
(using all trees) to the unperturbed OOB forest ensemble (using all
trees). Thus, unlike BreimanCutler VIMP, ensemble VIMP does not
measure the tree average effect of x, but rather its overall
forest effect. This is called IshwaranKogalur VIMP (Ishwaran et
al. 2008).
A useful compromise between BreimanCutler (BC) and IshwaranKogalur
(IK) VIMP can be obtained by setting block.size
to a value
between 1
and ntree
. Smaller values are closer to BC
and larger values closer to IK. Smaller generally gives better
accuracy, however computational times will be higher because VIMP is
calculated over more blocks.
The option importance
permits different ways to perturb a
variable. If random
is specified, then instead of permuting
x, OOB case are assigned a daughter node randomly whenever a
split on x is encountered. If anti
is specified,
x is assigned to the opposite node whenever a split on
x is encountered.
Note that the option none
turns VIMP off entirely.
Note that the function vimp
provides a friendly user
interface for extracting VIMP.
Multivariate Forests
Multivariate forests are specified by using the multivariate formula interface. Such a call takes one of two forms:
rfsrc(Multivar(y1, y2, ..., yd) ~ . , my.data, ...)
rfsrc(cbind(y1, y2, ..., yd) ~ . , my.data, ...)
A multivariate normalized composite splitting rule is used to split nodes. The nature of the outcomes will inform the code as to the type of multivariate forest to be grown; i.e. whether it is realvalued, categorical, or a combination of both (mixed). Note that performance measures (when requested) are returned for all outcomes.
Unsupervised Forests
In the case where no youtcomes are present, unsupervised forests can be invoked by one of two means:
rfsrc(data = my.data)
rfsrc(Unsupervised() ~ ., data = my.data)
To split a tree node, a random subset of ytry
variables are
selected from the available features, and these variables function
as "pseudoresponses" to be split on. Thus, in unsupervised mode,
the features take turns acting as both target youtcomes and
xvariables for splitting.
More precisely, as in supervised forests, mtry
xvariables
are randomly selected from the set of p
features for
splitting the node. Then on each mtry
loop, ytry
variables are selected from the p
1 remaining features to act
as the target pseduoresponses to be split on (there are p
1
possibilities because we exclude the currently selected xvariable
for the current mtry
loop — also, only pseudoresponses
that pass purity checks are used). The splitstatistic for
splitting the node on the pseudoresponses using the xvariable is
calculated. The best split over the mtry
pairs is used to
split the node.
The default value of ytry
is 1 but can be increased by the
ytry
option. A value larger than 1 initiates multivariate
splitting. As illustration, consider the call:
rfsrc(data = my.data, ytry = 5, mtry = 10)
This is equivalent to the call:
rfsrc(Unsupervised(5) ~ ., my.data, mtry = 10)
In the above, a node will be split by selecting mtry=10
xvariables, and for each of these a random subset of 5 features
will be selected as the multivariate pseudoresponses. The
splitstatistic is a multivariate normalized composite splitting
rule which is applied to each of the 10 multivariate regression
problems. The node is split on the variable leading to the best
split.
Note that all performance values (error rates, VIMP, prediction) are turned off in unsupervised mode.
Survival, Competing Risks
Survival settings require a time and censoring variable which
should be identifed in the formula as the response using the standard
Surv
formula specification. A typical formula call looks like:
Surv(my.time, my.status) ~ .
where my.time
and my.status
are the variables names for
the event time and status variable in the users data set.
For survival forests (Ishwaran et al. 2008), the censoring variable must be coded as a nonnegative integer with 0 reserved for censoring and (usually) 1=death (event). The event time must be nonnegative.
For competing risk forests (Ishwaran et al., 2013), the implementation is similar to survival, but with the following caveats:
Censoring must be coded as a nonnegative integer, where 0 indicates rightcensoring, and nonzero values indicate different event types. While 0,1,2,..,J is standard, and recommended, events can be coded nonsequentially, although 0 must always be used for censoring.
Setting the splitting rule to logrankscore
will result
in a survival analysis in which all events are treated as if they
are the same type (indeed, they will coerced as such).
Generally, competing risks requires a larger nodesize
than
survival settings.
Missing data imputation
Setting na.action="na.impute"
imputes missing data (both x
and yvariables) using a modification of the missing data algorithm
of Ishwaran et al. (2008). See also Tang and Ishwaran (2017).
Split statistics are calculated using nonmisssing data only. If a
node splits on a variable with missing data, the variable's missing
data is imputed by randomly drawing values from nonmissing inbag
data. The purpose of this is to make it possible to assign cases to
daughter nodes based on the split. Following a node split, imputed
data are reset to missing and the process is repeated until terminal
nodes are reached. Missing data in terminal nodes are imputed using
inbag nonmissing terminal node data. For integer valued variables
and censoring indicators, imputation uses a maximal class rule,
whereas continuous variables and survival time use a mean rule.
The missing data algorithm can be iterated by setting nimpute
to a positive integer greater than 1. Using only a few iterations are
needed to improve accuracy. When the algorithm is iterated, at the
completion of each iteration, missing data is imputed using OOB
nonmissing terminal node data which is then used as input to grow a
new forest. Note that when the algorithm is iterated, a side effect
is that missing values in the returned objects xvar
,
yvar
are replaced by imputed values. Further, imputed objects
such as imputed.data
are set to NULL
. Also, keep in
mind that if the algorithm is iterated, performance measures such as
error rates and VIMP become optimistically biased.
Finally, records in which all outcome and xvariable information are missing are removed from the forest analysis. Variables having all missing values are also removed.
See the function impute
for a fast impute interface.
An object of class (rfsrc, grow)
with the following
components:
call 
The original call to 
family 
The family used in the analysis. 
n 
Sample size of the data (depends upon 
ntree 
Number of trees grown. 
mtry 
Number of variables randomly selected for splitting at each node. 
nodesize 
Minimum size of terminal nodes. 
nodedepth 
Maximum depth allowed for a tree. 
splitrule 
Splitting rule used. 
nsplit 
Number of randomly selected split points. 
yvar 
youtcome values. 
yvar.names 
A character vector of the youtcome names. 
xvar 
Data frame of xvariables. 
xvar.names 
A character vector of the xvariable names. 
xvar.wt 
Vector of nonnegative weights specifying the probability used to select a variable for splitting a node. 
split.wt 
Vector of nonnegative weights where entry

cause.wt 
Vector of weights used for the composite competing risk splitting rule. 
leaf.count 
Number of terminal nodes for each tree in the
forest. Vector of length 
proximity 
Proximity matrix recording the frequency of pairs of data points occur within the same terminal node. 
forest 
If 
forest.wt 
Forest weight matrix. 
membership 
Matrix recording terminal node membership where each column contains the node number that a case falls in for that tree. 
splitrule 
Splitting rule used. 
inbag 
Matrix recording inbag membership where each column contains the number of times that a case appears in the bootstrap sample for that tree. 
var.used 
Count of the number of times a variable is used in growing the forest. 
imputed.indv 
Vector of indices for cases with missing values. 
imputed.data 
Data frame of the imputed data. The first column(s) are reserved for the yresponses, after which the xvariables are listed. 
split.depth 
Matrix [i][j] or array [i][j][k] recording the minimal depth for variable [j] for case [i], either averaged over the forest, or by tree [k]. 
node.stats 
Split statistics returned when

err.rate 
Tree cumulative OOB error rate. 
importance 
Variable importance (VIMP) for each xvariable. 
predicted 
Inbag predicted value. 
predicted.oob 
OOB predicted value. 
++++++++ 
for classification settings, additionally ++++++++ 
class 
Inbag predicted class labels. 
class.oob 
OOB predicted class labels. 
++++++++ 
for multivariate settings, additionally ++++++++ 
regrOutput 
List containing performance values for multivariate regression responses (applies only in multivariate settings). 
clasOutput 
List containing performance values for multivariate categorical (factor) responses (applies only in multivariate settings). 
++++++++ 
for survival settings, additionally ++++++++ 
survival 
Inbag survival function. 
survival.oob 
OOB survival function. 
chf 
Inbag cumulative hazard function (CHF). 
chf.oob 
OOB CHF. 
time.interest 
Ordered unique death times. 
ndead 
Number of deaths. 
++++++++ 
for competing risks, additionally ++++++++ 
chf 
Inbag causespecific cumulative hazard function (CSCHF) for each event. 
chf.oob 
OOB CSCHF. 
cif 
Inbag cumulative incidence function (CIF) for each event. 
cif.oob 
OOB CIF. 
time.interest 
Ordered unique event times. 
ndead 
Number of events. 
Values returned depend heavily on the family. In particular,
predicted
and predicted.oob
are the following values
calculated using inbag and OOB data:
For regression, a vector of predicted yresponses.
For classification, a matrix with columns containing the estimated class probability for each class. Performance values and VIMP for classification are reported as a matrix with J+1 columns where J is the number of classes. The first column "all" is the unconditional value for performance or VIMP, while the remaining columns are performance and VIMP conditioned on cases corresponding to that class label.
For survival, a vector of mortality values (Ishwaran et al.,
2008) representing estimated risk for each individual calibrated
to the scale of the number of events (as a specific example, if
i has a mortality value of 100, then if all individuals had
the same xvalues as i, we would expect an average of 100
events). Also returned are matrices containing
the CHF and survival function. Each row corresponds to an
individual's ensemble CHF or survival function evaluated at each
time point in time.interest
.
For competing risks, a matrix with one column for each event
recording the expected number of life years lost due to the event
specific cause up to the maximum follow up (Ishwaran et al.,
2013). Also returned are the causespecific cumulative hazard
function (CSCHF) and the cumulative incidence function (CIF) for
each event type. These are encoded as a threedimensional array,
with the third dimension used for the event type, each time point
in time.interest
making up the second dimension (columns),
and the case (individual) being the first dimension (rows).
For multivariate families, predicted values (and other
performance values such as VIMP and error rates) are stored in
the lists regrOutput
and clasOutput
which can be
parsed using the functions get.mv.error
,
get.mv.predicted
and get.mv.vimp
.
Hemant Ishwaran and Udaya B. Kogalur
Breiman L., Friedman J.H., Olshen R.A. and Stone C.J. Classification and Regression Trees, Belmont, California, 1984.
Breiman L. (2001). Random forests, Machine Learning, 45:532.
Cutler A. and Zhao G. (2001). PertPerfect random tree ensembles. Comp. Sci. Statist., 33: 490497.
Gray R.J. (1988). A class of ksample tests for comparing the cumulative incidence of a competing risk, Ann. Statist., 16: 11411154.
Harrell et al. F.E. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:25432546.
Hothorn T. and Lausen B. (2003). On the exact distribution of maximally selected rank statistics, Comp. Statist. Data Anal., 43:121137.
Ishwaran H. (2007). Variable importance in binary regression trees and forests, Electronic J. Statist., 1:519537.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):2531.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841860.
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). Highdimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205217.
Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2011). Random survival forests for highdimensional data. Stat. Anal. Data Mining, 4:115132
Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics, 15(4):757773.
Ishwaran H. and Malley J.D. (2014). Synthetic learning machines. BioData Mining, 7:28.
Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75118.
Ishwaran H. and Lu M. (2018). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Statistics in Medicine (in press).
Lin Y. and Jeon Y. (2006). Random forests and adaptive nearest neighbors, J. Amer. Statist. Assoc., 101:578590.
LeBlanc M. and Crowley J. (1993). Survival trees by goodness of split, J. Amer. Statist. Assoc., 88:457467.
Loh W.Y and Shih Y.S (1997). Split selection methods for classification trees, Statist. Sinica, 7:815840.
Mantero A. and Ishwaran H. (2017). Unsupervised random forests.
Mogensen, U.B, Ishwaran H. and Gerds T.A. (2012). Evaluating random forests for survival analysis using prediction error curves, J. Statist. Software, 50(11): 123.
O'Brien R. and Ishwaran H. (2017). A random forests quantile classifier for class imbalanced data.
Segal M.R. (1988). Regression trees for censored data, Biometrics, 44:3547.
Tang F. and Ishwaran H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10, 363377.
plot.competing.risk
,
plot.rfsrc
,
plot.survival
,
plot.variable
,
predict.rfsrc
,
print.rfsrc
,
quantileReg
,
rfsrcFast
,
rfsrcSyn
,
stat.split
,
tune
,
var.select
,
vimp
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260  ##
## Survival analysis
##
## veteran data
## randomized trial of two treatment regimens for lung cancer
data(veteran, package = "randomForestSRC")
v.obj < rfsrc(Surv(time, status) ~ ., data = veteran,
ntree = 100, block.size = 1)
## print and plot the grow object
print(v.obj)
plot(v.obj)
## plot survival curves for first 10 individuals  direct way
matplot(v.obj$time.interest, 100 * t(v.obj$survival.oob[1:10, ]),
xlab = "Time", ylab = "Survival", type = "l", lty = 1)
## plot survival curves for first 10 individuals  use wrapper
plot.survival(v.obj, subset = 1:10)
## Primary biliary cirrhosis (PBC) of the liver
data(pbc, package = "randomForestSRC")
pbc.obj < rfsrc(Surv(days, status) ~ ., pbc)
print(pbc.obj)
##
## Example of imputation in survival analysis
##
data(pbc, package = "randomForestSRC")
pbc.obj2 < rfsrc(Surv(days, status) ~ ., pbc,
nsplit = 10, na.action = "na.impute")
## same as above but we iterate the missing data algorithm
pbc.obj3 < rfsrc(Surv(days, status) ~ ., pbc,
na.action = "na.impute", nimpute = 3)
## fast way to impute the data (no inference is done)
## see impute for more details
pbc.imp < impute(Surv(days, status) ~ ., pbc, splitrule = "random")
##
## Compare RFSRC to Cox regression
## Illustrates Cindex and Brier score measures of performance
## assumes "pec" and "survival" libraries are loaded
##
if (library("survival", logical.return = TRUE)
& library("pec", logical.return = TRUE)
& library("prodlim", logical.return = TRUE))
{
##prediction function required for pec
predictSurvProb.rfsrc < function(object, newdata, times, ...){
ptemp < predict(object,newdata=newdata,...)$survival
pos < sindex(jump.times = object$time.interest, eval.times = times)
p < cbind(1,ptemp)[, pos + 1]
if (NROW(p) != NROW(newdata)  NCOL(p) != length(times))
stop("Prediction failed")
p
}
## data, formula specifications
data(pbc, package = "randomForestSRC")
pbc.na < na.omit(pbc) ##remove NA's
surv.f < as.formula(Surv(days, status) ~ .)
pec.f < as.formula(Hist(days,status) ~ 1)
## run cox/rfsrc models
## for illustration we use a small number of trees
cox.obj < coxph(surv.f, data = pbc.na)
rfsrc.obj < rfsrc(surv.f, pbc.na, ntree = 150)
## compute bootstrap crossvalidation estimate of expected Brier score
## see Mogensen, Ishwaran and Gerds (2012) Journal of Statistical Software
set.seed(17743)
prederror.pbc < pec(list(cox.obj,rfsrc.obj), data = pbc.na, formula = pec.f,
splitMethod = "bootcv", B = 50)
print(prederror.pbc)
plot(prederror.pbc)
## compute outofbag Cindex for cox regression and compare to rfsrc
rfsrc.obj < rfsrc(surv.f, pbc.na)
cat("outofbag Cox Analysis ...", "\n")
cox.err < sapply(1:100, function(b) {
if (b%%10 == 0) cat("cox bootstrap:", b, "\n")
train < sample(1:nrow(pbc.na), nrow(pbc.na), replace = TRUE)
cox.obj < tryCatch({coxph(surv.f, pbc.na[train, ])}, error=function(ex){NULL})
if (is.list(cox.obj)) {
randomForestSRC:::cindex(pbc.na$days[train],
pbc.na$status[train],
predict(cox.obj, pbc.na[train, ]))
} else NA
})
cat("\n\tOOB error rates\n\n")
cat("\tRSF : ", rfsrc.obj$err.rate[rfsrc.obj$ntree], "\n")
cat("\tCox regression : ", mean(cox.err, na.rm = TRUE), "\n")
}
##
## Competing risks
##
## WIHS analysis
## cumulative incidence function (CIF) for HAART and AIDS stratified by IDU
data(wihs, package = "randomForestSRC")
wihs.obj < rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3, ntree = 100)
plot.competing.risk(wihs.obj)
cif < wihs.obj$cif.oob
Time < wihs.obj$time.interest
idu < wihs$idu
cif.haart < cbind(apply(cif[,,1][idu == 0,], 2, mean),
apply(cif[,,1][idu == 1,], 2, mean))
cif.aids < cbind(apply(cif[,,2][idu == 0,], 2, mean),
apply(cif[,,2][idu == 1,], 2, mean))
matplot(Time, cbind(cif.haart, cif.aids), type = "l",
lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3,
ylab = "Cumulative Incidence")
legend("topleft",
legend = c("HAART (NonIDU)", "HAART (IDU)", "AIDS (NonIDU)", "AIDS (IDU)"),
lty = c(1,2,1,2), col = c(4, 4, 2, 2), lwd = 3, cex = 1.5)
## illustrates the various splitting rules
## illustrates event specific and nonevent specific variable selection
if (library("survival", logical.return = TRUE)) {
## use the pbc data from the survival package
## events are transplant (1) and death (2)
data(pbc, package = "survival")
pbc$id < NULL
## modified Gray's weighted logrank splitting
pbc.cr < rfsrc(Surv(time, status) ~ ., pbc)
## logrank eventone specific splitting
pbc.log1 < rfsrc(Surv(time, status) ~ ., pbc,
splitrule = "logrank", cause = c(1,0), importance = TRUE)
## logrank eventtwo specific splitting
pbc.log2 < rfsrc(Surv(time, status) ~ ., pbc,
splitrule = "logrank", cause = c(0,1), importance = TRUE)
## extract VIMP from the logrank forests: eventspecific
## extract minimal depth from the Gray logrank forest: nonevent specific
var.perf < data.frame(md = max.subtree(pbc.cr)$order[, 1],
vimp1 = 100 * pbc.log1$importance[ ,1],
vimp2 = 100 * pbc.log2$importance[ ,2])
print(var.perf[order(var.perf$md), ])
}
## 
## Regression analysis
## 
## New York air quality measurements
airq.obj < rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")
# partial plot of variables (see plot.variable for more details)
plot.variable(airq.obj, partial = TRUE, smooth.lines = TRUE)
## motor trend cars
mtcars.obj < rfsrc(mpg ~ ., data = mtcars)
## 
## Classification analysis
## 
## Edgar Anderson's iris data
iris.obj < rfsrc(Species ~., data = iris)
## Wisconsin prognostic breast cancer data
data(breast, package = "randomForestSRC")
breast.obj < rfsrc(status ~ ., data = breast, block.size=1)
plot(breast.obj)
## 
## Classification analysis with class imbalanced data
## 
data(breast, package = "randomForestSRC")
breast < na.omit(breast)
o < rfsrc(status ~ ., data = breast)
print(o)
## The data is imbalanced so we use balanced random forests
## with undersampling of the majority class
##
## Specifically let n0, n1 be sample sizes for majority, minority
## cases. We sample 2 x n1 cases with majority, minority cases chosen
## with probabilities n1/n, n0/n where n=n0+n1
y < breast$status
o < rfsrc(status ~ ., data = breast,
case.wt = randomForestSRC:::make.wt(y),
sampsize = randomForestSRC:::make.size(y))
print(o)
## 
## Unsupervised analysis
## 
# two equivalent ways to implement unsupervised forests
mtcars.unspv < rfsrc(Unsupervised() ~., data = mtcars)
mtcars2.unspv < rfsrc(data = mtcars)
## 
## Multivariate regression analysis
## 
mtcars.mreg < rfsrc(Multivar(mpg, cyl) ~., data = mtcars,
block.size=1, importance = TRUE)
## extract error rates, vimp, and OOB predicted values for all targets
err < get.mv.error(mtcars.mreg)
vmp < get.mv.vimp(mtcars.mreg)
pred < get.mv.predicted(mtcars.mreg)
## standardized error and vimp
err.std < get.mv.error(mtcars.mreg, standardize = TRUE)
vmp.std < get.mv.vimp(mtcars.mreg, standardize = TRUE)
## 
## Mixed outcomes analysis
## 
mtcars.new < mtcars
mtcars.new$cyl < factor(mtcars.new$cyl)
mtcars.new$carb < factor(mtcars.new$carb, ordered = TRUE)
mtcars.mix < rfsrc(cbind(carb, mpg, cyl) ~., data = mtcars.new, block.size=1)
print(mtcars.mix, outcome.target = "mpg")
print(mtcars.mix, outcome.target = "cyl")
plot(mtcars.mix, outcome.target = "mpg")
plot(mtcars.mix, outcome.target = "cyl")
## 
## Custom splitting using the precoded examples
## 
## motor trend cars
mtcars.obj < rfsrc(mpg ~ ., data = mtcars, splitrule = "custom")
## iris analysis
iris.obj < rfsrc(Species ~., data = iris, splitrule = "custom1")
## WIHS analysis
wihs.obj < rfsrc(Surv(time, status) ~ ., wihs, nsplit = 3,
ntree = 100, splitrule = "custom1")

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.