randomForestSRC_package: Random Forests for Survival, Regression and Classification...

Description OpenMP Parallel Processing – Installation OpenMP Parallel Processing – Setting the Number of CPUs R-side Parallel Processing – Setting the Number of CPUs Example: Setting the Number of CPUs CAUTIONARY NOTE Package Overview Author(s) References See Also

Description

This package provides a unified treatment of Breiman's random forests (Breiman 2001) for a variety of data settings. Regression and classification forests are grown when the response is numeric or categorical (factor), while survival and competing risk forests (Ishwaran et al. 2008, 2012) are grown for right-censored survival data. Multivariate regression and classification responses as well as mixed outcomes (regression/classification responses) are also handled as are unsupervised forests. Different splitting rules invoked under deterministic or random splitting are available for all families. Variable predictiveness can be assessed using variable importance (VIMP) measures for single, as well as grouped variables. Variable selection is implemented using minimal depth variable selection (Ishwaran et al. 2010). Missing data (for x-variables and y-outcomes) can be imputed on both training and test data. The underlying code is based on Ishwaran and Kogalur's now retired randomSurvivalForest package (Ishwaran and Kogalur 2007), and has been significantly refactored for improved computational speed.

OpenMP Parallel Processing – Installation

This package implements OpenMP shared-memory parallel programming. However, the default installation will only execute serially. To utilize OpenMP, the target architecture and operating system must first support it.

To install the package with OpenMP parallel processing enabled, on most non-Windows systems, do the following:

  1. Download the package source code randomForestSRC_X.x.x.tar.gz from CRAN (do not download the binary).

  2. Open a console, navigate to the directory containing the tarball, and untar it using the command tar -xvf randomForestSRC_X.x.x.tar.gz

  3. This will create a directory structure with the root directory of the package named randomForestSRC. Change into the root directory of the package using the command cd randomForestSRC

  4. Run autoconf using the command autoconf

  5. Change back to your working directory using the command cd ..

  6. Run R CMD INSTALL randomForestSRC on the modified package. Ensure that you do not target the unmodified tarball, but instead act on the directory structure you just modified.

To install the package with OpenMP parallel processing enabled, on most Windows systems, do the following:

  1. Download the Windows binary file randomForestSRC_X.x.x.zip from http://www.ccs.miami.edu/~hishwaran/rfsrc.html

  2. If you are using the R GUI, start the GUI. From the menu click on

    Packages > Install package(s) from local zip files

    Then navigate to the directory where you downloaded the zip file and click on it.

OpenMP Parallel Processing – Setting the Number of CPUs

There are several ways to control the number of CPU cores that the package accesses during OpenMP parallel execution. First, you will need to determine the number of cores on your local machine. Do this by starting an R session and issuing the command detectCores().

Then you can do the following:

At the start of every R session, you can set the number of cores accessed during OpenMP parallel execution by issuing the command options(rf.cores = x), where x is the number of cores. If x is a negative number, the package will access the maximum number of cores on your machine. The options command can also be placed in the users .Rprofile file for convenience. You can, alternatively, initialize the environment variable RF_CORES in your shell environment.

The default value for rf.cores is -1 (-1L), if left unspecified, which uses all available cores, with a minimum of two.

R-side Parallel Processing – Setting the Number of CPUs

The package also implements R-side parallel processing by replacing the R function lapply with mclapply found in the parallel package. You can set the number of cores accessed by mclapply by issuing the command options(mc.cores = x), where x is the number of cores. The options command can also be placed in the users .Rprofile file for convenience. You can, alternatively, initialize the environment variable MC_CORES in your shell environment. See the help files in parallel for more information.

The default value for mclapply on non-Windows systems is two (2L) cores. On Windows systems, the default value is one (1L) core.

Example: Setting the Number of CPUs

As an example, issuing the following options command uses all available cores for both OpenMP and R-side processing:

options(rf.cores=detectCores(), mc.cores=detectCores())

As stated above, this option command can be placed in the users .Rprofile file.

CAUTIONARY NOTE

Regarding C-side threading (accessed via OpenMP compilation) versus R-side forking (accessed via mclapply in package parallel).

  1. Once the package has been compiled with OpenMP enabled, trees will be grown in parallel using the rf.cores option. Independently of this, we also utilize mclapply to parallelize loops in R-side pre-processing and post-processing of the forest. This is always available and independent of whether the user chooses to compile the package with the OpenMP option enabled.

  2. It is important NOT to write programs that fork R processes containing OpenMP threads. That is, one should not use mclapply around the functions rfsrc, predict.rfsrc, vimp.rfsc, var.select.rfsrc, and find.interaction.rfsrc. In such a scenario, program execution is not guaranteed.

  3. Note that options(rf.cores=0) disables C-side threading, and options(mc.cores=1) disables R-side forking. Therefore, setting options(rf.cores=0), is one means to wrap mclapply around the functions listed above in 2.

Package Overview

This package contains many useful functions and users should read the help file in its entirety for details. However, we briefly mention several key functions that may make it easier to navigate and understand the layout of the package.

  1. rfsrc

    This is the main entry point to the package. It grows a random forest using user supplied training data. We refer to the resulting object as a RF-SRC grow object. Formally, the resulting object has class (rfsrc, grow).

  2. predict.rfsrc (predict)

    Used for prediction. Predicted values are obtained by dropping the user supplied test data down the grow forest. The resulting object has class (rfsrc, predict).

  3. max.subtree, var.select

    Used for variable selection. The function max.subtree extracts maximal subtree information from a RF-SRC object which is used for selecting variables by making use of minimal depth variable selection. The function var.select provides an extensive set of variable selection options and is a wrapper to max.subtree.

  4. impute.rfsrc

    Fast imputation mode for RF-SRC. Both rfsrc and predict.rfsrc are capable of imputing missing data. However, for users whose only interest is imputing data, this function provides an efficient and fast interface for doing so.

Author(s)

Hemant Ishwaran and Udaya B. Kogalur

References

Breiman L. (2001). Random forests, Machine Learning, 45:5-32.

Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.

Ishwaran H., Kogalur U.B., Gorodeski E.Z, Minn A.J. and Lauer M.S. (2010). High-dimensional variable selection for survival data. J. Amer. Statist. Assoc., 105:205-217.

Ishwaran H., Gerds T.A., Kogalur U.B., Moore R.D., Gange S.J. and Lau B.M. (2014). Random survival forests for competing risks. Biostatistics, 15(4):757-773.

Ishwaran H. (2015). The effect of splitting on random forests. Machine Learning, 99:75-118.

See Also

find.interaction, impute.rfsrc, max.subtree, plot.competing.risk, plot.rfsrc, plot.survival, plot.variable, predict.rfsrc, print.rfsrc, rf2rfz, rfsrcSyn, stat.split var.select, vimp


ehrlinger/randomForestSRC documentation built on May 16, 2019, 1:20 a.m.