CrossValidated Survival Bump Hunting
Description
Main enduser function for fitting a crossvalidated Survival Bump Hunting (SBH) model.
Returns a crossvalidated PRSP
object, as generated by our Patient Recursive Survival Peeling or PRSP algorithm,
containing crossvalidated estimates of endpoints statistics of interest.
Usage
1 2 3 4 5 6 7 8  sbh(dataset,
B = 10, K = 5, A = 1000,
vs = TRUE, cpv = FALSE, decimals = 2,
cvtype = c("combined", "averaged", "none", NULL),
cvcriterion = c("lrt", "cer", "lhr", NULL),
arg = "beta=0.05,alpha=0.05,minn=5,L=NULL,peelcriterion=\"lr\"",
probval = NULL, timeval = NULL,
parallel = FALSE, conf = NULL, seed = NULL)

Arguments
dataset 

B 
Positive 
K 

A 
Positive 
vs 

cpv 

decimals 

cvtype 

cvcriterion 

arg 
Note that the parameters in 
probval 

timeval 

parallel 

conf 

seed 
Positive 
Details
At this point, the main function sbh
performs the search of the first box of the recursive coverage (outer) loop of our
Patient Recursive Survival Peeling (PRSP) algorithm. It relies on an optional variable preselection procedure that is run before
the PRSP algorithm. At this point, this is done by ElasticNet (EN) penalization of the partial likelihood, where both mixing (alpha
)
and overal shrinkage (lambda
) parameters are simultaneously estimated by crossvalidation using
the glmnet::cv.glmnet
function of the R package glmnet.
The returned S3class PRSP
object contains crossvalidated estimates of all the decisionrules of preselected covariates and
all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm).
This enables the graphical display of results of profiling curves for model tuning, peeling trajectories, covariate traces and
survival distributions (see plotting functions for more details).
The function offers a number of options for the number of crossvalidation replicates to be perfomed: B; the type of crossvalidation desired: Kfold (replicated)averaged orcombined, as well as the peeling and optimization critera chosen for model tuning and a few more parameters for the PRSP algorithm.
In case replicated crossvalidations are performed, a "summary" of the outputs is done over the B replicates, which requires some explanation:
Even thought the PRSP algorithm uses only one covariate at a time at each peeling step, the reported matrix of "Replicated CV" box decision rules may show several covariates being used in a given step, simply because these decision rules are averaged over the B replicates (see equation #21 in Dazard et al. 2015). This is also reflected in the reported "Replicated CV" importance and usage plots of covariate traces.
Likewise, the output matrix of "Replicated CV" box membership indicator does not necessarily match exactly the output vector of "Replicated CV" box support (and corresponding box sample size) for all peeling steps. The reason is that the reported "Replicated CV" box membership indicators are computed (at each peeling step) as the pointwise majority vote over the B replicates (see equation #22 in Dazard et al. 2015), whereas the "Replicated CV" box support vector (and corresponding box sample size) is averaged (at each peeling step) over the B replicates.
The function takes advantage of the R package parallel, which allows users to create a cluster of workstations on a local and/or remote machine(s), enabling scalingup with the number of CPU cores specified and efficient parallel execution.
If the computation of permutation pvalues is desired, then running with the parallelization option is strongly advised as it may take a while. In the case of large (p > n) or very large (p >> n) datasets, it is also required to use the parallelization option.
To run a parallel session (and parallel RNG) of the PRIMsrc procedures (parallel
=TRUE
), argument conf
is to be specified (i.e. non NULL
). It must list the specifications of the folowing parameters for cluster configuration:
"names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster
of the R package parallel. All fields are required to properly configure the cluster, except for "names" and "cpus",
which are the values used alternatively in the case of a cluster of type "SOCK" (socket), or in the case of a cluster
of type other than "SOCK" (socket), respectively. See examples below.
"names":
names
:character
vector
specifying the host names on which to run the job. Could default to a unique local machine, in which case, one may use the unique host name "localhost". Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine."cpus":
spec
:integer
scalar specifying the total number of CPU cores to be used across the network of available nodes, counting the workernodes and masternode."type":
type
:character
vector
specifying the cluster type ("SOCK", "PVM", "MPI")."homo":
homogeneous
:logical
scalar to be set toFALSE
for inhomogeneous clusters."verbose":
verbose
:logical
scalar to be set toFALSE
for quiet mode."outfile":
outfile
:character
vector
of the output log file name for the workernodes.
Note that argument B
is internally reset to conf$cpus
*ceiling
(B
/conf$cpus
) in case the
parallelization is used (i.e. conf
is non NULL
), where conf$cpus
denotes the total number of CPUs to be
used (see above). The argument A
is similarly reset.
The actual creation of the cluster, its initialization, and closing are all done internally.
In addition, when random number generation is needed, the creation of separate streams of parallel RNG per node
is done internally by distributing the stream states to the nodes (For more details see function makeCluster
(R package parallel) and/or http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html.
The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a nonparallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a nonparallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of retuned seed below).
Value
Object of class
PRSP
(Patient Recursive Survival Peeling)
List
containing the following 19 fields:
x 

times 

status 

B 
positive 
K 
positive 
A 
positive 
vs 

cpv 

decimals 

cvtype 

cvcriterion 

arg 

probval 

timeval 

cvfit 

cvprofiles 

cvmeanprofiles 

plot 

config 

seed 
User seed(s) used:

Note
Unique enduser function for fitting the Survival Bump Hunting model.
Author(s)
"JeanEudes Dazard, Ph.D." jxd101@case.edu
"Michael Choe, M.D." mjc206@case.edu
"Michael LeBlanc, Ph.D." mleblanc@fhcrc.org
"Alberto Santana, MBA." ahs4@case.edu
Maintainer: "JeanEudes Dazard, Ph.D." jxd101@case.edu
Acknowledgments: This project was partially funded by the National Institutes of Health NIH  National Cancer Institute (R01CA160593) to JE. Dazard and J.S. Rao.
References
Dazard JE., Choe M., LeBlanc M. and Rao J.S. (2015). "Crossvalidation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods." Statistical Analysis and Data Mining (in press).
Dazard JE., Choe M., LeBlanc M. and Rao J.S. (2014). "CrossValidation of Survival Bump Hunting by Recursive Peeling Methods." In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA. American Statistical Association IMS  JSM, p. 33663380.
Dazard JE., Choe M., LeBlanc M. and Rao J.S. (2015). "R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification." In JSM Proceedings, Statistical Programmers and Analysts Section. Seattle, WA, USA. American Statistical Association IMS  JSM, (in press).
Dazard JE. and J.S. Rao (2010). "Local Sparse Bump Hunting." J. Comp Graph. Statistics, 19(4):90092.
See Also

makeCluster
(R package parallel) 
cv.glmnet
(R package glmnet) 
glmnet
(R package glmnet)
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116  #===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")
#===================================================
# Package news
# Package citation
#===================================================
PRIMsrc.news()
citation("PRIMsrc")
#===================================================
# Demo with a synthetic dataset
# Use help for descriptions
#===================================================
data("Synthetic.1", package="PRIMsrc")
?Synthetic.1
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Non Replicated Combined CrossValidation (RCCV)
# Peeling criterion = LRT
# Optimization criterion = LRT
# Without parallelization
# Without computation of permutation pvalues
#===================================================
CVCOMB.synt1 < sbh(dataset = Synthetic.1,
cvtype = "combined", cvcriterion = "lrt",
B = 1, K = 5,
vs = TRUE, cpv = FALSE,
decimals = 2, probval = 0.5,
arg = "beta=0.05,
alpha=0.05,
minn=5,
L=NULL,
peelcriterion=\"lr\"",
parallel = FALSE, conf = NULL, seed = 123)
## Not run:
#===================================================
# Examples of parallel backend parametrization
#===================================================
# Example #1  1Quad (4core double threaded) PC
# Running WINDOWS
# With SOCKET communication
#===================================================
if (.Platform$OS.type == "windows") {
cpus < detectCores()
conf < list("names" = rep("localhost", cpus),
"cpus" = cpus,
"type" = "SOCK",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
}
#===================================================
# Example #2  1 master node + 3 worker nodes cluster
# All nodes equipped with identical setups and multicores
# Running LINUX
# With SOCKET communication
#===================================================
if (.Platform$OS.type == "unix") {
masterhost < Sys.getenv("HOSTNAME")
slavehosts < c("compute00", "compute01", "compute02")
nodes < length(slavehosts) + 1
cpus < 8
conf < list("names" = c(rep(masterhost, cpus),
rep(slavehosts, cpus)),
"cpus" = nodes * cpus,
"type" = "SOCK",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
}
#===================================================
# Example #3  Multinode multicore per node cluster
# Running LINUX
# with MPI communication
# Here, a file named ".nodes" (e.g. in the home directory)
# contains the list of nodes of the cluster
#===================================================
if (.Platform$OS.type == "unix") {
hosts < scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""),
what="",
sep="\n")
hostnames < unique(hosts)
nodes < length(hostnames)
cpus < length(hosts)/length(hostnames)
conf < list("cpus" = nodes * cpus,
"type" = "MPI",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
}
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Replicated Combined CrossValidation (RCCV)
# Peeling criterion = LRT
# Optimization criterion = LRT
# With parallelization
# With computation of permutation pvalues
#===================================================
CVCOMBREP.synt1 < sbh(dataset = Synthetic.1,
cvtype = "combined", cvcriterion = "lrt",
B = 10, K = 5, A = 1024,
vs = TRUE, cpv = TRUE,
decimals = 2, probval = 0.5,
arg = "beta=0.05,
alpha=0.05,
minn=5,
L=NULL,
peelcriterion=\"lr\"",
parallel = TRUE, conf = conf, seed = 123)
## End(Not run)
