rsf.main: Univariate Minimal Depth of a Maximal Subtree (MDMS)

Description Usage Arguments Details Value Acknowledgments Author(s) References See Also Examples

View source: R/IRSF.r

Description

Ranking of individual and noise variables main effects by univariate Minimal Depth of a Maximal Subtree (MDMS)

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
    rsf.main(X,
             ntree = 1000,
             method = "mdms",
             splitrule = "logrank",
             importance = "random",
             B,
             ci = 90,
             parallel = FALSE,
             conf = NULL,
             verbose = TRUE,
             seed = NULL)

Arguments

X

data.frame or numeric matrix of input covariates. Dataset X assumes that: - all variables are in columns - the observed times to event and censoring variables are in the first two columns: "stime": numeric vector of observed times. "status": numeric vector of observed status (censoring) indicator variable. - each variable has a unique name, excluding the word "noise"

ntree

Number of trees in the forest. Defaults to 1000.

method

Method for ranking of individual and noise variables. character string "mdms" (default) that stands for Univariate Minimal Depth of a Maximal Subtree (MDMS).

splitrule

Splitting rule used to grow trees. For time-to-event analysis, use "logrank" (default), which implements log-rank splitting (Segal, 1988; Leblanc and Crowley, 1993).

importance

Method for computing variable importance. Defaults to Character string "random". See details below.

B

Postitive integer of the number of replications of the cross-validation procedure.

ci

Confidence Interval for inferences of individual and noise variables. numeric scalar between 50 and 100. Defaults to 90.

parallel

logical. Is parallel computing to be performed? Defaults to FALSE.

conf

list of 5 fields containing the parameters values needed for creating the parallel backend (cluster configuration). See details below for usage. Optional, defaults to NULL, but all fields are required if used:

  • type : character vector specifying the cluster type ("SOCKET", "MPI").

  • spec : A specification (character vector or integer scalar) appropriate to the type of cluster.

  • homogeneous : logical scalar to be set to FALSE for inhomogeneous clusters.

  • verbose : logical scalar to be set to FALSE for quiet mode.

  • outfile : character vector of an output log file name to direct the stdout and stderr connection output from the workernodes. "" indicates no redirection.

verbose

logical scalar. Is the output to be verbose? Optional, defaults to TRUE.

seed

Positive integer scalar of the user seed to reproduce the results. Defaults to NULL.

Details

The option importance allows several ways to calculate Variable Importance (VIMP). The default "permute" returns Breiman-Cutler permutation VIMP as described in Breiman (2001). For each tree, the prediction error on the out-of-bag (OOB) data is recorded. Then for a given variable x, OOB cases are randomly permuted in x and the prediction error is recorded. The VIMP for x is defined as the difference between the perturbed and unperturbed error rate, averaged over all trees. If "random" is used, then x is not permuted, but rather an OOB case is assigned a daughter node randomly whenever a split on x is encountered in the in-bag tree. If "anti" is used, then x is assigned to the opposite node whenever a split on x is encountered in the in-bag tree.

The function rsf.main relies on the R package parallel to create a parallel backend within an R session, enabling access to a clusterof compute cores and/or nodes on a local and/or remote machine(s) and scaling-up with the number of CPU cores available and efficient parallel execution. To run a procedure in parallel (with parallel RNG), argument parallel is to be set to TRUE and argument conf is to be specified (i.e. non NULL). Argument conf uses the options described in function makeCluster of the R packages parallel and snow. IRSF supports two types of communication mechanisms between master and worker processes: 'Socket' or 'Message-Passing Interface' ('MPI'). In IRSF, parallel 'Socket' clusters use sockets communication mechanisms only (no forking) and are therefore available on all platforms, including Windows, while parallel 'MPI' clusters use high-speed interconnects mechanism in networks of computers (with distributed memory) and are therefore available only in these architectures. A parallel 'MPI' cluster also requires R package Rmpi to be installed. Value type is used to setup a cluster of type 'Socket' ("SOCKET") or 'MPI' ("MPI"), respectively. Depending on this type, values of spec are to be used alternatively:

The actual creation of the cluster, its initialization, and closing are all done internally. For more details, see the reference manual of R package snow and examples below.

When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes. For more details, see the vignette of R package parallel. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of returned seed below).

Value

data.frame containing the following columns:

Acknowledgments

This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. We are thankful to Ms. Janet Schollenberger, Senior Project Coordinator, CAMACS, as well as Dr. Jeremy J. Martinson, Sudhir Penugonda, Shehnaz K. Hussain, Jay H. Bream, and Priya Duggal, for providing us the data related to the samples analyzed in the present study. Data in this manuscript were collected by the Multicenter AIDS Cohort Study (MACS) at (http://www.statepi.jhsph.edu/macs/macs.html) with centers at Baltimore, Chicago, Los Angeles, Pittsburgh, and the Data Coordinating Center: The Johns Hopkins University Bloomberg School of Public Health. The MACS is funded primarily by the National Institute of Allergy and Infectious Diseases (NIAID), with additional co-funding from the National Cancer Institute (NCI), the National Heart, Lung, and Blood Institute (NHLBI), and the National Institute on Deafness and Communication Disorders (NIDCD). MACS data collection is also supported by Johns Hopkins University CTSA. This study was supported by two grants from the National Institute of Health: NIDCR P01DE019759 (Aaron Weinberg, Peter Zimmerman, Richard J. Jurevic, Mark Chance) and NCI R01CA163739 (Hemant Ishwaran). The work was also partly supported by the National Science Foundation grant DMS 1148991 (Hemant Ishwaran) and the Center for AIDS Research grant P30AI036219 (Mark Chance).

Author(s)

Jean-Eudes Dazard <[email protected]>

Maintainer: Jean-Eudes Dazard <[email protected]>

References

See Also

randomForestSRC

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
   ## Not run: 
   #===================================================
   # Loading the library and its dependencies
   #===================================================
   library("IRSF")

   #==========================================================================================#
   # Continuous case:
   # All variables xj, j in {1,...,p}, are iid from a multivariate uniform distribution
   # with parmeters  a=1, b=5, i.e. on [1, 5].
   # rho = 0.50
   # Regression model: X1 + X5
   #==========================================================================================#
   seed <- 1234567
   set.seed(seed)
   n <- 200
   p <- 5
   x <- matrix(data=runif(n=n*p, min=1, max=5),
               nrow=n, ncol=p, byrow=FALSE,
               dimnames=list(1:n, paste("X", 1:p, sep="")))
   beta <- c(1,0,0,0,1)
   covar <- x
   eta <- covar 

   seed <- 1234567
   set.seed(seed)
   lambda0 <- 1
   lambda <- lambda0 * exp(eta - mean(eta))  # hazards function
   tt <- rexp(n=n, rate=lambda)              # true (uncensored) event times
   tc <- runif(n=n, min=0, max=1.50)         # true (censored) event times
   stime <- pmin(tt, tc)                     # observed event times
   status <- 1 * (tt <= tc)                  # observed event indicator
   X <- data.frame(stime, status, x)

   #===================================================
   # Examples of parallel backend parametrization
   #===================================================
   if (require("parallel")) {
      print("'parallel' is attached correctly \n")
   } else {
      stop("'parallel' must be attached first \n")
   }

   #===================================================
   # Example #1 - Quad core PC
   # Running WINDOWS with SOCKET communication
   #===================================================
   cpus <- detectCores(logical = TRUE)
   conf <- list("spec" = rep("localhost", cpus),
                "type" = "SOCKET",
                "homo" = TRUE,
                "verbose" = TRUE,
                "outfile" = "")

   #===================================================
   # Example #2 - Master node + 3 Worker nodes cluster
   # Running LINUX with SOCKET communication
   # All nodes equipped with identical setups of
   # multicores (8 core CPUs per machine for a total of 32)
   #===================================================
   masterhost <- Sys.getenv("HOSTNAME")
   slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
   nodes <- length(slavehosts) + 1
   cpus <- 8
   conf <- list("spec" = c(rep(masterhost, cpus),
                           rep(slavehosts, cpus)),
                "type" = "SOCKET",
                "homo" = TRUE,
                "verbose" = TRUE,
                "outfile" = "")

   #===================================================
   # Example #3 - Multinode of multicore per node cluster
   # Running LINUX with SLURM scheduler and MPI communication
   # Below, variable 'cpus' is the total number
   # of requested core CPUs, which is specified from
   # within a SLURM script.
   #===================================================
   if (require("Rmpi")) {
      print("'Rmpi' is attached correctly \n")
   } else {
      stop("'Rmpi' must be attached first \n")
   }
   cpus <- as.numeric(Sys.getenv("SLURM_NTASKS"))
   conf <- list("spec" = cpus,
                "type" = "MPI",
                "homo" = TRUE,
                "verbose" = TRUE,
                "outfile" = "")

   main.mdms <- rsf.main(X=X,
                         ntree=1000,
                         method="mdms",
                         splitrule="logrank",
                         importance="random",
                         B=1000,
                         ci=90,
                         parallel=FALSE,
                         conf=NULL,
                         verbose=TRUE,
                         seed=seed)
   
## End(Not run)

jedazard/IRSF documentation built on Oct. 19, 2017, 11:49 p.m.