Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/var.select.rfsrc.R
Variable selection using minimal depth.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  ## S3 method for class 'rfsrc'
var.select(formula,
data,
object,
cause,
m.target,
method = c("md", "vh", "vh.vimp"),
conservative = c("medium", "low", "high"),
ntree = (if (method == "md") 1000 else 500),
mvars = (if (method != "md") ceiling(ncol(data)/5) else NULL),
mtry = (if (method == "md") ceiling(ncol(data)/3) else NULL),
nodesize = 2, splitrule = NULL, nsplit = 10, xvar.wt = NULL,
refit = (method != "md"), fast = FALSE,
na.action = c("na.omit", "na.impute"),
always.use = NULL, nrep = 50, K = 5, nstep = 1,
prefit = list(action = (method != "md"), ntree = 100,
mtry = 500, nodesize = 3, nsplit = 1),
verbose = TRUE, ...)

formula 
A symbolic description of the model to be fit.
Must be specified unless 
data 
Data frame containing the youtcome and xvariables in
the model. Must be specified unless 
object 
An object of class 
cause 
Integer value between 1 and 
m.target 
Character value for multivariate families specifying the target outcome to be used. If left unspecified, the algorithm will choose a default target. 
method 
Variable selection method:

conservative 
Level of conservativeness of the thresholding rule used in minimal depth selection:

ntree 
Number of trees to grow. 
mvars 
Number of randomly selected variables used in the variable hunting algorithm (ignored when method="md"). 
mtry 
The mtry value used. 
nodesize 
Forest average terminal node size. 
splitrule 
Splitting rule used. 
nsplit 
If nonzero, the specified tree splitting rule is randomized which significantly increases speed. 
xvar.wt 
Vector of nonnegative weights specifying the
probability of selecting a variable for splitting a node. Must be of
dimension equal to the number of variables. Default ( 
refit 
Should a forest be refit using the selected variables? 
fast 
Speeds up the crossvalidation used for variable hunting for a faster analysis. See miscellanea below. 
na.action 
Action to be taken if the data contains 
always.use 
Character vector of variable names to always be included in the model selection procedure and in the final selected model. 
nrep 
Number of Monte Carlo iterations of the variable hunting algorithm. 
K 
Integer value specifying the 
nstep 
Integer value controlling the step size used in the forward selection process of the variable hunting algorithm. Increasing this will encourage more variables to be selected. 
prefit 
List containing parameters used in preliminary forest analysis for determining weight selection of variables. Users can set all or some of the following parameters:

verbose 
Set to 
... 
Further arguments passed to forest grow call. 
This function implements random forest variable selection using tree minimal depth methodology (Ishwaran et al., 2010). The option method allows for two different approaches:
method="md"
Invokes minimal depth variable selection. Variables are selected
using minimal depth variable selection. Uses all data and all
variables simultaneously. This is basically a frontend to the
max.subtree
wrapper. Users should consult the
max.subtree
help file for details.
Set mtry to larger values in highdimensional problems.
method="vh" or method="vh.vimp"
Invokes variable hunting. Variable hunting is used for problems where the number of variables is substantially larger than the sample size (e.g., p/n is greater than 10). It is always prefered to use method="md", but to find more variables, or when computations are high, variable hunting may be preferred.
When method="vh": Using training data from a stratified
Kfold subsampling (stratification based on the youtcomes), a
forest is fit using mvars
randomly selected variables
(variables are chosen with probability proportional to weights
determined using an initial forest fit; see below for more
details). The mvars
variables are ordered by increasing
minimal depth and added sequentially (starting from an initial
model determined using minimal depth selection) until joint VIMP
no longer increases (signifying the final model). A forest is
refit to the final model and applied to test data to estimate
prediction error. The process is repeated nrep
times.
Final selected variables are the top P ranked variables, where P
is the average model size (rounded up to the nearest integer) and
variables are ranked by frequency of occurrence.
The same algorithm is used when method="vh.vimp", but variables are ordered using VIMP. This is faster, but not as accurate.
Miscellanea
When variable hunting is used, a preliminary forest is run
and its VIMP is used to define the probability of selecting a
variable for splitting a node. Thus, instead of randomly
selecting mvars
at random, variables are selected with
probability proportional to their VIMP (the probability is zero
if VIMP is negative). A preliminary forest is run once prior
to the analysis if prefit$action=TRUE
, otherwise it is
run prior to each iteration (this latter scenario can be slow).
When method="md", a preliminary forest is fit only if
prefit$action=TRUE
. Then instead of randomly selecting
mtry
variables at random, mtry
variables are
selected with probability proportional to their VIMP. In all
cases, the entire option is overridden if xvar.wt
is
nonnull.
If object
is supplied and method="md",
the grow forest from object
is parsed for minimal depth
information. While this avoids fitting another forest, thus
saving computational time, certain options no longer apply. In
particular, the value of cause
plays no role in the
final selected variables as minimal depth is extracted from the
grow forest, which has already been grown under a preselected
cause
specification. Users wishing to specify
cause
should instead use the formula and data interface.
Also, if the user requests a prefitted forest via
prefit$action=TRUE
, then object
is not used and a
refitted forest is used in its place for variable selection.
Thus, the effort spent to construct the original grow forest is
not used in this case.
If fast=TRUE, and variable hunting is used, the training data is chosen to be of size n/K, where n=sample size (i.e., the size of the training data is swapped with the test data). This speeds up the algorithm. Increasing K also helps.
Can be used for competing risk data. When
method="vh.vimp", variable selection based on VIMP is
confined to an event specific cause specified by cause
.
However, this can be unreliable as not all youtcomes can be
guaranteed when subsampling (this is true even when stratifed
subsampling is used as done here).
Invisibly, a list with the following components:
err.rate 
Prediction error for the forest (a vector of
length 
modelsize 
Number of variables selected. 
topvars 
Character vector of names of the final selected variables. 
varselect 
Useful output summarizing the final selected variables. 
rfsrc.refit.obj 
Refitted forest using the final set of selected variables (requires refit=TRUE). 
md.obj 
Minimal depth object. 
Hemant Ishwaran and Udaya B. Kogalur
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). Highdimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205217.
Ishwaran H., Kogalur U.B., Chen X. and Minn A.J. (2011). Random survival forests for highdimensional data. Statist. Anal. Data Mining, 4:115132.
find.interaction
,
max.subtree
,
vimp
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84  ## 
## Minimal depth variable selection
## survival analysis
## use larger node size which is better for minimal depth
## 
data(pbc, package = "randomForestSRC")
pbc.obj < rfsrc(Surv(days, status) ~ ., pbc, nodesize = 20, importance = TRUE)
# default call corresponds to minimal depth selection
vs.pbc < var.select(object = pbc.obj)
topvars < vs.pbc$topvars
# the above is equivalent to
max.subtree(pbc.obj)$topvars
# different levels of conservativeness
var.select(object = pbc.obj, conservative = "low")
var.select(object = pbc.obj, conservative = "medium")
var.select(object = pbc.obj, conservative = "high")
## 
## Minimal depth variable selection
## competing risk analysis
## use larger node size which is better for minimal depth
## 
## competing risk data set involving AIDS in women
data(wihs, package = "randomForestSRC")
vs.wihs < var.select(Surv(time, status) ~ ., wihs, nsplit = 3,
nodesize = 20, ntree = 100, importance = TRUE)
## competing risk analysis of pbc data from survival package
## implement causespecific variable selection
if (library("survival", logical.return = TRUE)) {
data(pbc, package = "survival")
pbc$id < NULL
var.select(Surv(time, status) ~ ., pbc, cause = 1)
var.select(Surv(time, status) ~ ., pbc, cause = 2)
}
## 
## Minimal depth variable selection
## classification analysis
## 
vs.iris < var.select(Species ~ ., iris)
## 
## Variable hunting highdimensional example
## van de Vijver microarray breast cancer survival data
## nrep is small for illustration; typical values are nrep = 100
## 
data(vdv, package = "randomForestSRC")
vh.breast < var.select(Surv(Time, Censoring) ~ ., vdv,
method = "vh", nrep = 10, nstep = 5)
# plot top 10 variables
plot.variable(vh.breast$rfsrc.refit.obj,
xvar.names = vh.breast$topvars[1:10])
plot.variable(vh.breast$rfsrc.refit.obj,
xvar.names = vh.breast$topvars[1:10], partial = TRUE)
## similar analysis, but using weights from univarate cox pvalues
if (library("survival", logical.return = TRUE))
{
cox.weights < function(rfsrc.f, rfsrc.data) {
event.names < all.vars(rfsrc.f)[1:2]
p < ncol(rfsrc.data)  2
event.pt < match(event.names, names(rfsrc.data))
xvar.pt < setdiff(1:ncol(rfsrc.data), event.pt)
sapply(1:p, function(j) {
cox.out < coxph(rfsrc.f, rfsrc.data[, c(event.pt, xvar.pt[j])])
pvalue < summary(cox.out)$coef[5]
if (is.na(pvalue)) 1.0 else 1/(pvalue + 1e100)
})
}
data(vdv, package = "randomForestSRC")
rfsrc.f < as.formula(Surv(Time, Censoring) ~ .)
cox.wts < cox.weights(rfsrc.f, vdv)
vh.breast.cox < var.select(rfsrc.f, vdv, method = "vh", nstep = 5,
nrep = 10, xvar.wt = cox.wts)
}

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.