Description Usage Arguments Details Value Acknowledgments Note Author(s) References See Also Examples
Main enduser function for fitting a crossvalidated Survival Bump Hunting (SBH) model.
Returns a crossvalidated sbh
object, as generated by our
Patient Recursive Survival Peeling (PRSP) algorithm, containing crossvalidated
estimates of endpoints statistics of interest.
Generates an object of class sbh
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  sbh(X,
y,
delta,
B = 30,
K = 5,
A = 1000,
vs = TRUE,
vstype = "ppl",
vsarg = "alpha=1,
nalpha=1,
nlambda=100,
vscons=0.5",
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.01,
beta=0.05,
minn=5,
L=NULL,
peelcriterion=\"lrt\",
cvcriterion=\"cer\"",
pv = FALSE,
decimals = 2,
onese = FALSE,
probval = NULL,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = FALSE,
parallel.pv = FALSE,
conf = NULL,
verbose = TRUE,
seed = NULL)

X 
(n x p) 
y 
n 
delta 
n 
B 
Postitive 
K 
Postitive

A 
Positive 
vs 

vstype 

vsarg 
PCQR:
PPL:
SPCA:

cv 

cvtype 

cvarg 

pv 

decimals 
Positive 
onese 

probval 

timeval 

parallel.vs 

parallel.rep 

parallel.pv 

conf 

verbose 

seed 
Positive 
At this point, the main function sbh
relies on an optional variable screening (preselection) procedure that is run
before the variable usage (selection) procedure is done by our PRSP algorithm. User can choose between four possible procedures:
Patient Recursive Survival Peeling (PRSP) (by univariate screening of our algorithm)
Penalized Censored Quantile Regression (PCQR) (by Semismooth Newton Coordinate Descent fiting algorithm adapted from package hqreg)
Penalized Partial Likelihood (PPL) (by Elasticnet Regularization adapted from package glmnet)
Supervised Principal Component Analysis (SPCA) (by Supervised Principal Component adapted from package superpc)
The recommendation is to use PPL or SPCA because of computational efficiency. Variable screening (preselection) is done
by computing occurrence frequencies of topranking variables over the crossvalidation folds and replicates.
The conservativeness of the procedure is controled by the argument vscons
. Example of calls for preselection are as follows:
'1.0' represents a presence in all the folds (unanimity vote)
'0.5' represents a presence in at least half of the folds (majority vote)
'1/K
' represents a presence in at least one of the folds (minority vote)
Although any value in the interval [1/K
,1] is accepted, we recommand using the interval [1/K
,1/2] to avoid excessive
conservativeness. Final variable usage (selection) is done after running our PRSP algorithm on previously screened variables
by collecting those variables that have the maximum occurrence frequency in each peeling step over crossvalidation folds and replicates.
In the PRSP variable selection procedure (vsarg
). Setting S
to a value within the allowable range [1,floor
(p)]
will override the crossvalidation procedure that determines the optimal size (cardinal) of subset of topscreened variables,
but will also reduce computational burden. Setting S
to NULL
will automaticaly generate a vector of values within
the restricted range [1,floor
(p/5)] that will be used to determine the optimal size (cardinal) of subset of topscreened variables
by crossvalidation. Setting L
to a single value within the allowable range [1,ceiling
(log
(1/n) / log
(1  (1/n)))]
will override the internal peeling stopping rule of the PRSP algorithm and arbitrarily fix the maximal number of peeling steps to that value.
Note that the upper bound of the allowable range is further restricted in case of crossvalidation and when any of the metaparameters
{alpha
, beta
} takes on the smallest possible fraction of the training data i.e. \frac{1}{n^t},
where n^t is the training sample size:
ceiling
(log
(beta
) / log
(1  alpha
)) : alpha
and beta
fixed by user
ceiling
(log
(1/n^t) / log
(1  alpha
)) : alpha
fixed by user and beta
fixed by data
ceiling
(log
(beta
) / log
(1  (1/n^t))) : alpha
fixed by data and beta
fixed by user
ceiling
(log
(1/n^t) / log
(1  (1/n^t))) : alpha
and beta
fixed by data
Setting L
to NULL
will let the maximal number of peeling steps be internally determined by the peeling stopping rule of
the PRSP algorithm.
In the PRSP algorithm itself (cvarg
), setting L
to a nonNULL
value (within the allowable range)
will override the crossvalidation procedure, but will also reduce computational burden. Setting L
to NULL
will let
the optimal peeling sequence length be determined by crossvalidation.
If a crossvalidation is requested, the function performs a supervised (stratified) random splitting of the data based on the outcome,
which is in that case the delta
argument. This is because it is desireable that the data splitting balances the class
distributions of the outcome (events) within the crossvalidation splits. For each screening method and for building (by PRSP algorithm)
the final Survival Bump Hunting (SBH) model, all model tuning parameters are simultaneously estimated by crossvalidation.
The function offers a number of options for the crossvalidation to be perfomed: the number of replications B
; the type of technique;
the peeling criterion; and the optimization criterion.
The returned S3class sbh
object contains crossvalidated estimates of all the decisionrules of used (selected) covariates and
all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm).
This enables the graphical display of results of profiling curves for model tuning, peeling trajectories, covariate traces and
survival distributions (see plotting functions for more details).
In case replicated crossvalidations are performed, a "summary report" of the outputs is done over the B
replicates as follows:
Even thought the PRSP algorithm uses only one covariate at a time at each peeling step, the reported matrix of
"Replicated CV" box decision rules may show more than one covariate being used in a given step, because these decision rules
are averaged over the B
replicates (see equation #21 in Dazard et al. 2016).
However, the reported "Replicated CV" trace values are computed (at each peeling step) as a single modal trace value of
covariate usage over the B
replicates. This is also reflected in the reported "Replicated CV" importance and usage plots
of covariate traces.
The reported "Replicated CV" box membership indicators are computed (at each peeling step) as the pointwise majority vote over
the B
replicates (righthand side of equation #22 in Dazard et al. 2016).
The reported "Replicated CV" box support vector and corresponding box sample size are computed (at each peeling step) based on the above "Replicated CV" box membership indicators (i.e. not as equation #23 in Dazard et al. 2016).
If the computation of logrank pvalues is desired, then running with the parallelization option is strongly advised as it may take a while. In case of large (p > n) or very large (p >> n) datasets, it is also highly recommended to use the parallelization option.
The function sbh
relies on the R package parallel to create a parallel backend within an R session, enabling access to a cluster
of compute cores and/or nodes on a local and/or remote machine(s) and scalingup with the number of CPU cores available and efficient parallel
execution. To run a procedure in parallel (with parallel RNG), argument parallel
is to be set to TRUE
and argument conf
is to be specified (i.e. non NULL
). Argument conf
uses the options described in function makeCluster
of the R packages
parallel and snow. PRIMsrc supports two types of communication mechanisms between master and worker processes:
'Socket' or 'MessagePassing Interface' ('MPI'). In PRIMsrc, parallel 'Socket' clusters use sockets communication mechanisms only
(no forking) and are therefore available on all platforms, including Windows, while parallel 'MPI' clusters use highspeed interconnects
mechanism in networks of computers (with distributed memory) and are therefore available only in these architectures. A parallel 'MPI'
cluster also requires R package Rmpi to be installed. Value type
is used to setup a cluster of type 'Socket' ("SOCKET")
or 'MPI' ("MPI"), respectively. Depending on this type, values of spec
are to be used alternatively:
For 'Socket' clusters (conf$type="SOCKET"
), spec
should be a character
vector
naming the hosts on which
to run the job; it can default to a unique local machine, in which case, one may use the unique host name "localhost".
Each host name can potentially be repeated to the number of CPU cores available on the local machine.
It can also be an integer
scalar specifying the number of processes to spawn on the local machine;
or a list of machine specifications if you have ssh installed (a character value named host specifying the name or address of the host to use).
For 'MPI' clusters (conf$type="MPI"
), spec
should be an integer
scalar
specifying the total number of processes to be spawned across the network of available nodes, counting the workernodes and masternode.
The actual creation of the cluster, its initialization, and closing are all done internally. For more details, see the reference manual of R package snow and examples below.
When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes. For more details, see the vignette of R package parallel. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a nonparallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a nonparallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of returned seed below).
Object of class
sbh
(Patient Recursive Survival Peeling)
list
containing the following 21 fields:
X 

y 

delta 

B 
positive 
K 
positive 
A 
positive 
vs 

vstype 

vsarg 

cv 

cvtype 

cvarg 

pv 

onese 

decimals 

probval 

timeval 

cvprofiles 

cvfit 

success 

seed 
User seed. An 
This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. This project was partially funded by the National Institutes of Health NIH  National Cancer Institute (R01CA160593) to JE. Dazard and J.S. Rao.
Unique enduser function for fitting the Survival Bump Hunting model.
"JeanEudes Dazard, Ph.D." jeaneudes.dazard@case.edu
"Michael Choe, M.D." mjc206@case.edu
"Michael LeBlanc, Ph.D." mleblanc@fhcrc.org
"Alberto Santana, MBA." ahs4@case.edu
Maintainer: "JeanEudes Dazard, Ph.D." jeaneudes.dazard@case.edu
Dazard JE. and Rao J.S. (2017). "Variable Selection Strategies for HighDimensional Survival Bump Hunting using Recursive Peeling Methods." (in prep).
DiazPachon D.A., Dazard JE. and Rao J.S. (2017). "Unsupervised Bump Hunting Using Principal Components." In: Ahmed SE, editor. Big and Complex Data Analysis: Methodologies and Applications. Contributions to Statistics, vol. Edited Refereed Volume. Springer International Publishing, Cham Switzerland, p. 325345.
Yi C. and Huang J. (2016). "Semismooth Newton Coordinate Descent Algorithm for ElasticNet Penalized Huber Loss Regression and Quantile Regression." J. Comp Graph. Statistics, DOI: 10.1080/10618600.2016.1256816.
Dazard JE., Choe M., LeBlanc M. and Rao J.S. (2016). "Crossvalidation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods." Statistical Analysis and Data Mining, 9(1):1242.
Dazard JE., Choe M., LeBlanc M. and Rao J.S. (2015). "R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification." In JSM Proceedings, Statistical Programmers and Analysts Section. Seattle, WA, USA. American Statistical Association IMS  JSM, p. 650664.
Dazard JE., Choe M., LeBlanc M. and Rao J.S. (2014). "CrossValidation of Survival Bump Hunting by Recursive Peeling Methods." In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA. American Statistical Association IMS  JSM, p. 33663380.
Dazard JE. and J.S. Rao (2010). "Local Sparse Bump Hunting." J. Comp Graph. Statistics, 19(4):90092.
makeCluster
(R package parallel)
glmnet
, cv.glmnet
(R package glmnet)
hqreg
, cv.hqreg
(R package hqreg)
superpc.cv
(R package superpc)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207  #===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")
## Not run:
#===================================================
# PRIMsrc Package news
#===================================================
PRIMsrc.news()
#===================================================
# PRIMsrc Package citation
#===================================================
citation("PRIMsrc")
#===================================================
# Demo with a synthetic dataset
# Use help for descriptions
#===================================================
data("Synthetic.1", package="PRIMsrc")
?Synthetic.1
## End(Not run)
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Peeling criterion = LRT
# CrossValidation criterion = LRT
# With Combined CrossValidation (RCCV)
# Without Replications (B = 1)
# Without variable screening (preselection)
# Without computation of logrank \eqn{p}values
# Without parallelization
#===================================================
synt1 < sbh(X = Synthetic.1[ , c(1,2), drop=FALSE],
y = Synthetic.1[ ,1, drop=TRUE],
delta = Synthetic.1[ ,2, drop=TRUE],
B = 1,
K = 3,
vs = FALSE,
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.10,
beta=0.05,
minn=5,
L=NULL,
peelcriterion=\"lrt\",
cvcriterion=\"lrt\"",
pv = FALSE,
decimals = 2,
onese = FALSE,
probval = 0.5,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = FALSE,
parallel.pv = FALSE,
conf = NULL,
verbose = FALSE,
seed = 123)
summary(object = synt1)
print(x = synt1)
n < 100
p < length(synt1$cvfit$cv.used)
x < matrix(data = runif(n = n*p, min = 0, max = 1),
nrow = n, ncol = p, byrow = FALSE,
dimnames=list(1:n, paste("X", 1:p, sep="")))
synt1.pred < predict(object = synt1,
newdata = x,
steps = synt1$cvfit$cv.nsteps)
plot(x = synt1,
main = paste("Scatter plot for model #1", sep=""),
proj = c(1,2), splom = TRUE, boxes = TRUE,
steps = synt1$cvfit$cv.nsteps,
pch = 16, cex = 0.5, col = 2,
col.box = 2, lty.box = 2, lwd.box = 1,
add.legend = TRUE, device = NULL)
plot_profile(object = synt1,
main = "Crossvalidated tuning profiles for model #1",
pch=20, col=1, lty=1, lwd=0.5, cex=0.5,
add.sd = TRUE, add.legend = TRUE, add.profiles = TRUE,
device = NULL, file = "Profile Plot", path=getwd(),
horizontal = FALSE, width = 8.5, height = 5.0)
plot_boxtraj(object = synt1,
main = paste("Crossvalidated peeling trajectories for model #1", sep=""),
col=1, lty=1, lwd=0.5, cex=0.5,
toplot = synt1$cvfit$cv.used,
device = NULL, file = "Trajectory Plots", path=getwd(),
horizontal = FALSE, width = 8.5, height = 8.5)
plot_boxtrace(object = synt1,
main = paste("Crossvalidated trace plots for model #1", sep=""),
xlab = "Box Mass", ylab = "Covariate Range (centered)",
col=1, lty=1, lwd=0.5, cex=0.5,
toplot = synt1$cvfit$cv.used,
center = TRUE, scale = FALSE,
device = NULL, file = "Covariate Trace Plots", path=getwd(),
horizontal = FALSE, width = 8.5, height = 8.5)
plot_boxkm(object = synt1,
main = paste("Crossvalidated probability curves for model #1", sep=""),
xlab = "Time", ylab = "Probability",
col=2, lty=1, lwd=0.5, cex=0.5,
device = NULL, file = "Survival Plots", path=getwd(),
horizontal = TRUE, width = 11.5, height = 8.5)
## Not run:
#===================================================
# Examples of parallel backend parametrization
#===================================================
if (require("parallel")) {
print("'parallel' is attached correctly \n")
} else {
stop("'parallel' must be attached first \n")
}
#===================================================
# Example #1  Quad core PC
# Running WINDOWS with SOCKET communication
#===================================================
cpus < parallel::detectCores(logical = TRUE)
conf < list("spec" = rep("localhost", cpus),
"type" = "SOCKET",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
#===================================================
# Example #2  Master node + 3 Worker nodes cluster
# Running LINUX with SOCKET communication
# All nodes equipped with identical setups of
# multicores (8 core CPUs per machine for a total of 32)
#===================================================
masterhost < Sys.getenv("HOSTNAME")
slavehosts < c("compute00", "compute01", "compute02")
nodes < length(slavehosts) + 1
cpus < 8
conf < list("spec" = c(rep(masterhost, cpus),
rep(slavehosts, cpus)),
"type" = "SOCKET",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
#===================================================
# Example #3  Multinode of multicore per node cluster
# Running LINUX with SLURM scheduler and MPI communication
# Below, variable 'cpus' is the total number
# of requested core CPUs, which is specified from
# within a SLURM script.
#===================================================
if (require("Rmpi")) {
print("'Rmpi' is attached correctly \n")
} else {
stop("'Rmpi' must be attached first \n")
}
cpus < as.numeric(Sys.getenv("SLURM_NTASKS"))
conf < list("spec" = cpus,
"type" = "MPI",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Peeling criterion = LRT
# CrossValidation criterion = LRT
# With Combined CrossValidation (RCCV)
# With Replications (B = 30)
# With variable screening (preselection) (PPL)
# With computation of logrank \eqn{p}values
# With parallelization
#===================================================
synt1 < sbh(X = Synthetic.1[ , c(1,2), drop=FALSE],
y = Synthetic.1[ ,1, drop=TRUE],
delta = Synthetic.1[ ,2, drop=TRUE],
B = 30,
K = 5,
A = 1000,
vs = TRUE,
vstype = "ppl",
vsarg = "alpha=1,
nalpha=1,
nlambda=100,
vscons=0.5",
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.01,
beta=0.05,
minn=5,
L=NULL,
peelcriterion=\"lrt\",
cvcriterion=\"lrt\"",
pv = TRUE,
decimals = 2,
onese = FALSE,
probval = 0.5,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = TRUE,
parallel.pv = TRUE,
conf = conf,
verbose = TRUE,
seed = 123)
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.