Description OpenMP Parallel Processing – Installation OpenMP Parallel Processing – Setting the Number of CPUs R-side Parallel Processing – Setting the Number of CPUs Example: Setting the Number of CPUs CAUTIONARY NOTE Package Overview Author(s) References See Also
This package provides a unified treatment of Breiman's random forests (Breiman 2001) for a variety of data settings. Regression and classification forests are grown when the response is numeric or categorical (factor), while survival and competing risk forests (Ishwaran et al. 2008, 2012) are grown for right-censored survival data. Multivariate regression and classification responses as well as mixed outcomes (regression/classification responses) are also handled as are unsupervised forests. Different splitting rules invoked under deterministic or random splitting are available for all families. Variable predictiveness can be assessed using variable importance (VIMP) measures for single, as well as grouped variables. Variable selection is implemented using minimal depth variable selection (Ishwaran et al. 2010). Missing data (for x-variables and y-outcomes) can be imputed on both training and test data. The underlying code is based on Ishwaran and Kogalur's now retired randomSurvivalForest package (Ishwaran and Kogalur 2007), and has been significantly refactored for improved computational speed.
This package implements OpenMP shared-memory parallel programming. However, the default installation will only execute serially. To utilize OpenMP, the target architecture and operating system must first support it.
To install the package with OpenMP parallel processing enabled, on most non-Windows systems, do the following:
Download the package source code randomForestSRC_X.x.x.tar.gz from CRAN (do not download the binary).
Open a console, navigate to the directory containing the
tarball, and untar it using the command
tar -xvf randomForestSRC_X.x.x.tar.gz
This will create a directory structure with the root directory
of the package named randomForestSRC
. Change into the root
directory of the package using the command cd randomForestSRC
Run autoconf using the command autoconf
Change back to your working directory using the command
cd ..
Run R CMD INSTALL randomForestSRC
on the modified
package. Ensure that you do not target the unmodified tarball, but
instead act on the directory structure you just modified.
To install the package with OpenMP parallel processing enabled, on most Windows systems, do the following:
Download the Windows binary file randomForestSRC_X.x.x.zip from http://www.ccs.miami.edu/~hishwaran/rfsrc.html
If you are using the R GUI, start the GUI. From the menu click on
Packages > Install package(s) from local zip files
Then navigate to the directory where you downloaded the zip file and click on it.
There are several ways to control the number of CPU cores that the
package accesses during OpenMP parallel execution. First, you will
need to determine the number of cores on your local machine. Do this
by starting an R session and issuing the command
detectCores()
.
Then you can do the following:
At the start of every R session, you can set the number of cores
accessed during OpenMP parallel execution by issuing the command
options(rf.cores = x)
, where x
is the number of
cores. If x
is a negative number, the package will access
the maximum number of cores on your machine. The options command can
also be placed in the users .Rprofile file for convenience. You can,
alternatively, initialize the environment variable RF_CORES
in your shell environment.
The default value for rf.cores is -1 (-1L), if left unspecified, which uses all available cores, with a minimum of two.
The package also implements R-side parallel processing by replacing
the R function lapply
with mclapply
found in the
parallel package. You can set the number of cores accessed by
mclapply
by issuing the command options(mc.cores =
x)
, where x
is the number of cores. The options command
can also be placed in the users .Rprofile file for convenience. You
can, alternatively, initialize the environment variable
MC_CORES
in your shell environment. See the help files in
parallel for more information.
The default value for mclapply
on non-Windows systems is
two (2L) cores. On Windows systems, the default value is one (1L)
core.
As an example, issuing the following options command uses all available cores for both OpenMP and R-side processing:
options(rf.cores=detectCores(), mc.cores=detectCores())
As stated above, this option command can be placed in the users .Rprofile file.
Regarding C-side threading (accessed via OpenMP compilation) versus
R-side forking (accessed via mclapply
in package
parallel).
Once the package has been compiled with OpenMP enabled, trees
will be grown in parallel using the rf.cores
option.
Independently of this, we also utilize mclapply
to
parallelize loops in R-side pre-processing and post-processing
of the forest. This is always available and independent of
whether the user chooses to compile the package with the OpenMP
option enabled.
It is important NOT to write programs that fork R processes
containing OpenMP threads. That is, one should not use
mclapply
around the functions rfsrc
,
predict.rfsrc
, vimp.rfsc
,
var.select.rfsrc
, and
find.interaction.rfsrc
. In such a scenario, program
execution is not guaranteed.
Note that options(rf.cores=0)
disables C-side
threading, and options(mc.cores=1)
disables R-side
forking. Therefore, setting options(rf.cores=0)
, is
one means to wrap mclapply
around the functions
listed above in 2.
This package contains many useful functions and users should read the help file in its entirety for details. However, we briefly mention several key functions that may make it easier to navigate and understand the layout of the package.
rfsrc
This is the main entry point to the package. It grows a random forest
using user supplied training data. We refer to the resulting object
as a RF-SRC grow object. Formally, the resulting object has class
(rfsrc, grow)
.
predict.rfsrc
(predict
)
Used for prediction. Predicted values are obtained by dropping the
user supplied test data down the grow forest. The resulting object
has class (rfsrc, predict)
.
max.subtree
, var.select
Used for variable selection. The function max.subtree
extracts maximal subtree information from a RF-SRC object which is
used for selecting variables by making use of minimal depth variable
selection. The function var.select
provides
an extensive set of variable selection options and is a wrapper to
max.subtree
.
impute.rfsrc
Fast imputation mode for RF-SRC. Both rfsrc
and
predict.rfsrc
are capable of imputing missing data.
However, for users whose only interest is imputing data, this function
provides an efficient and fast interface for doing so.
Hemant Ishwaran and Udaya B. Kogalur
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.
Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.
Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics, 15(4):757-773.
Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75-118.
find.interaction
,
impute.rfsrc
,
max.subtree
,
plot.competing.risk
,
plot.rfsrc
,
plot.survival
,
plot.variable
,
predict.rfsrc
,
print.rfsrc
,
rf2rfz
,
rfsrcSyn
,
stat.split
var.select
,
vimp
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.