Description Usage Arguments Details Value Acknowledgments Note Author(s) References See Also Examples
Main end-user function for fitting a Survival Bump Hunting (SBH) model
(or Group Survival Bump Hunting (GSBH)). It returns an object of class
sbh
, as generated by our Patient Recursive Survival Peeling (PRSP)
algorithm (or Patient Recursive Group Survival Peeling (PRGSP)), containing
cross-validated estimates of the target region of the input space with
end-points statistics of interest.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | sbh(X,
y,
delta,
B = 30,
K = 5,
A = 1000,
vs = TRUE,
vstype = "ppl",
vsarg = "alpha=1,
nalpha=1,
nlambda=100",
vscons = 0.5,
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.01,
beta=0.10,
peelcriterion=\"lrt\",
cvcriterion=\"cer\"",
groups = NULL,
pv = FALSE,
decimals = 2,
onese = FALSE,
probval = NULL,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = FALSE,
parallel.pv = FALSE,
conf = NULL,
verbose = TRUE,
seed = NULL)
|
X |
(n x p) |
y |
n- |
delta |
n- |
B |
Postitive |
K |
Postitive
|
A |
Positive |
vs |
|
vstype |
|
vsarg |
PCQR:
PPL:
SPCA:
|
vscons |
|
cv |
|
cvtype |
|
cvarg |
if |
groups |
|
pv |
|
decimals |
Positive |
onese |
|
probval |
|
timeval |
|
parallel.vs |
|
parallel.rep |
|
parallel.pv |
|
conf |
|
verbose |
|
seed |
Positive |
The main function sbh
relies on an optional variable screening (pre-selection) procedure that is run
before the actual variable usage (selection) is done at the time of fitting the Survival Bump Hunting (SBH)
or Group Survival Bump Hunting (GSBH) model using our PRSP or PRGSP algorithm, respectively. At this point,
the user can choose between four possible variable screening (pre-selection) procedures:
Patient Recursive Survival Peeling (PRSP) (by univariate screening of our algorithm)
Penalized Censored Quantile Regression (PCQR) (by Semismooth Newton Coordinate Descent fiting algorithm adapted from package hqreg)
Penalized Partial Likelihood (PPL) (by Elasticnet Regularization adapted from package glmnet)
Supervised Principal Component Analysis (SPCA) (by Supervised Principal Component adapted from package superpc)
NA
missing values are not allowed in PRIMsrc, because it depends on R package glmnet, which doesn't handle
missing values. In case of high-dimensional data (p >> n), the recommendation is to use PPL or SPCA because of computational
efficiency. Variable screening (pre-selection) is done by computing occurrence frequencies of top-ranking variables over the
cross-validation folds and replicates. The conservativeness of the procedure is controled by the argument vscons
.
Example of vscons
values for pre-selection are as follows:
'1.0' represents a presence in all the folds (unanimity vote)
'0.5' represents a presence in at least half of the folds (majority vote)
'1/K
' represents a presence in at least one of the folds (minority vote)
Although any value in the interval [1/K
,1] is accepted, we recommand using the interval [1/K
, 1/2] to avoid
excessive conservativeness. Final variable usage (selection) is done at the time of fitting the Survival Bump Hunting (SBH) model
itself using our PRSP algorithm on previously screened variables by collecting those variables that have the maximum occurrence
frequency in each peeling step over cross-validation folds and replicates.
If cross-validation is done (cv = TRUE
, the optimal number of peeling steps (optimal peeling sequence length),
and the optimal model size (cardinal of subset of top-screened variables) will be determined by cross-validation.
If cv = FALSE
, no cross-validation at all will be performed, and the values of K
and vscons
will both
be reset to 1, and traditional log-rank Mantel-Haenszel p-values will be computed (using the Chi-Squared
distribution with 1 df for the null distribution) instead of log-rank permutation p-values (using the permutation
distribution for the null distribution).
The argument groups
is to be specified if the Patient Recursive Group Survival Peeling (PRGSP) algorithm is used.
The PRGSP algorithm is a derivation of our original Patient Recursive Survival Peeling (PRSP) algorithm (Dazard et al. 2016)
to search for (or find an extreme of) outcome difference within existing (fixed) groups of observations. See Rao et al. (2018)
for details and an application in Disparity Subtyping.
In the PRSP variable screening procedure (vsarg
of "prsp"), setting option msize
to a single non-NULL
value
within the allowable range [1,floor
(p)] will override the cross-validation setting within the variable screening
procedure. This could be recommended for high-dimensional data (p >> n) to reduce the computational burden. In this situation,
we suggest an arbitrary value of msize
within [1, floor
(p/5)]. Conversely, setting msize=NULL
will force the cross-validation within the variable screening procedure by automaticaly generating a vector of model sizes
(cardinals of subset of top-screened variables) within the restricted range [1, floor
(p/5)], which will be used to
determine the optimal value of model size.
In fitting the Survival Bump Hunting (SBH) model itself, note that the result contains initial step #0, which corresponds
to the entire set of the (training) data. Also, the number of peeling steps that is within the allowable range
[1,ceiling
(log
(1/n) / log
(1 - (1/n)))] is further restricted when either of the metaparameter
alpha
or beta
takes on values other than the smallest possible fraction of the (training) data, i.e. \frac{1}{n^t},
where n^t is the training sample size:
ceiling
(log
(beta
) / log
(1 - alpha
)) : alpha
and beta
fixed by user
ceiling
(log
(1/n^t) / log
(1 - alpha
)) : alpha
fixed by user and beta
fixed by data
ceiling
(log
(beta
) / log
(1 - (1/n^t))) : alpha
fixed by data and beta
fixed by user
ceiling
(log
(1/n^t) / log
(1 - (1/n^t))) : alpha
and beta
fixed by data
When cross-validation is requested (cv=TRUE
), the function performs a supervised (stratified) random splitting of the observations
accounting for the classes/strata provided by delta
(censoring). This is because it is desireable that the data splitting balances
the class distributions of the outcome within the cross-validation splits. For each screening method and for building the final
Survival Bump Hunting (SBH) model, all model tuning parameters are simultaneously estimated by cross-validation. The function offers a
number of options for the cross-validation to be perfomed: the number of replications B
; the type of technique; the peeling
criterion; and the optimization criterion.
The returned S3-class sbh
object contains cross-validated estimates of all the decision-rules of used (selected) covariates
and all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm).
This enables the graphical display of results of profiling curves for model tuning, peeling trajectories, covariate traces and survival
distributions (see plotting functions for more details).
In case replicated cross-validations are performed, a "summary report" of the outputs is done over the B
replicates as follows:
Even thought the PRSP algorithm uses only one covariate at a time at each peeling step, the reported matrix of
"Replicated CV" box decision rules may show more than one covariate being used in a given step depending on the replication.
In the end, the reported "Replicated CV" trace values are computed (at each peeling step) as a single modal trace value
of covariate usage over the B
replicates. This is also reflected in the "Replicated CV" importance and usage plots of
covariate traces.
Similarly, the reported "Replicated CV" box membership indicators are computed (at each peeling step) as the point-wise modal
membership value, that is majority vote, over the B
replicates (right-hand side of equation #22 in Dazard et al. 2016).
The reported "Replicated CV" box support and corresponding box sample size are computed (at each peeling step) based on the
above "Replicated CV" box membership indicators (i.e. not as equation #23 in Dazard et al. 2016).
All other reported "Replicated CV" box estimates are computed (at each peeling step) as average statistics over the B
replicates (i.e. as equation #21 in Dazard et al. 2016), that is, not as a single box estimate computed from the
"Replicated CV" box membership indicators. This includes the decision rules, the p-values, and all other box statistics.
This may result in some apparent discordance if these estimates are re-computed directly from the reported "Replicated CV"
box membership indicators.
If the computation of log-rank p-values is desired, then running with the parallelization option is strongly advised. In case of large (p > n) or very large (p >> n) datasets, it is also highly recommended to use the parallelization option.
The function sbh
relies on the R package parallel to create a parallel backend within an R session. This enables access to a
cluster of compute cores and/or nodes on a local and/or remote machine(s) and scaling-up with the number of CPU cores available and
efficient parallel execution. To run a procedure in parallel (with parallel RNG), argument parallel
is to be set to TRUE
and argument conf
is to be specified (i.e. non NULL
). Argument conf
uses the options described in function
makeCluster
of the R packages parallel and snow. PRIMsrc supports two types of communication mechanisms
between master and worker processes: 'Socket' or 'Message-Passing Interface' ('MPI'). In PRIMsrc, parallel 'Socket' clusters
use sockets communication mechanisms only (no forking) and are therefore available on all platforms, including Windows, while parallel
'MPI' clusters use high-speed interconnects mechanism in networks of computers (with distributed memory) and are therefore available
only in these architectures. A parallel 'MPI' cluster also requires R package Rmpi to be installed. Value type
is used
to setup a cluster of type 'Socket' ("SOCKET") or 'MPI' ("MPI"), respectively. Depending on this type, values of spec
are to
be used alternatively:
For 'Socket' clusters (conf$type="SOCKET"
), spec
should be a character
vector
naming the hosts on
which to run the job; it can default to a unique local machine, in which case, one may use the unique host name "localhost".
Each host name can potentially be repeated to the number of CPU cores available on the local machine.
It can also be an integer
scalar specifying the number of processes to spawn on the local machine;
or a list of machine specifications if you have ssh installed (a character value named host specifying the name or address
of the host to use).
For 'MPI' clusters (conf$type="MPI"
), spec
should be an integer
scalar
specifying the total number of processes to be spawned across the network of available nodes, counting the workernodes and
masternode.
The actual creation of the cluster, its initialization, and closing are all done internally. For more details, see the reference manual of R package snow and examples below.
When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes. For more details, see the vignette of R package parallel. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of returned seed below).
Object of class
sbh
(Patient Recursive Survival Peeling)
list
containing the following 23 fields:
X |
|
y |
|
delta |
|
B |
positive |
K |
positive |
A |
positive |
vs |
|
vstype |
|
vsarg |
|
vscons |
|
cv |
|
cvtype |
|
cvarg |
|
groups |
|
pv |
|
onese |
|
decimals |
|
probval |
|
timeval |
|
cvprofiles |
|
cvfit |
|
success |
|
seed |
User seed. An |
This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. This project was partially funded by the National Institutes of Health NIH - National Cancer Institute (R01-CA160593) to J-E. Dazard and J.S. Rao.
Unique end-user function for fitting the Survival Bump Hunting model.
"Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu
"Michael Choe, M.D." mjc206@case.edu
"Michael LeBlanc, Ph.D." mleblanc@fhcrc.org
"Alberto Santana, MBA." ahs4@case.edu
"J. Sunil Rao, Ph.D." Rao@biostat.med.miami.edu
Maintainer: "Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu
Dazard J-E. and Rao J.S. (2018). "Variable Selection Strategies for High-Dimensional Survival Bump Hunting using Recursive Peeling Methods." (in prep).
Rao J.S., Huilin Y. and Dazard J-E. (2018). "Disparity Subtyping: Bringing Precision Medicine Closer to Disparity Science." (in prep).
Diaz-Pachon D.A., Saenz J.P., Dazard J-E. and Rao J.S. (2018). "Mode Hunting through Active Information." (in press).
Diaz-Pachon D.A., Dazard J-E. and Rao J.S. (2017). "Unsupervised Bump Hunting Using Principal Components." In: Ahmed SE, editor. Big and Complex Data Analysis: Methodologies and Applications. Contributions to Statistics, vol. Edited Refereed Volume. Springer International Publishing, Cham Switzerland, p. 325-345.
Yi C. and Huang J. (2017). "Semismooth Newton Coordinate Descent Algorithm for Elastic-Net Penalized Huber Loss Regression and Quantile Regression." J. Comp Graph. Statistics, 26(3):547-557.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2016). "Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods." Statistical Analysis and Data Mining, 9(1):12-42.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). "R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification." In JSM Proceedings, Statistical Programmers and Analysts Section. Seattle, WA, USA. American Statistical Association IMS - JSM, p. 650-664.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2014). "Cross-Validation of Survival Bump Hunting by Recursive Peeling Methods." In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA. American Statistical Association IMS - JSM, p. 3366-3380.
Dazard J-E. and J.S. Rao (2010). "Local Sparse Bump Hunting." J. Comp Graph. Statistics, 19(4):900-92.
makeCluster
(R package parallel)
glmnet
, cv.glmnet
(R package glmnet)
hqreg
, cv.hqreg
(R package hqreg)
superpc.cv
(R package superpc)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | #===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")
## Not run:
#===================================================
# PRIMsrc Package news
#===================================================
PRIMsrc.news()
#===================================================
# PRIMsrc Package citation
#===================================================
citation("PRIMsrc")
#===================================================
# Demo with a synthetic dataset
# Use help for descriptions
#===================================================
data("Synthetic.1", package="PRIMsrc")
?Synthetic.1
## End(Not run)
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Peeling criterion = LRT
# Cross-Validation criterion = LRT
# With Combined Cross-Validation (RCCV)
# Without Replications (B = 1)
# Without variable screening (pre-selection)
# Without computation of log-rank \eqn{p}-values
# Without parallelization
#===================================================
synt1 <- sbh(X = Synthetic.1[ , -c(1,2), drop=FALSE],
y = Synthetic.1[ ,1, drop=TRUE],
delta = Synthetic.1[ ,2, drop=TRUE],
B = 1,
K = 3,
vs = FALSE,
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.10,
beta=0.10,
peelcriterion=\"lrt\",
cvcriterion=\"lrt\"",
groups = NULL,
pv = FALSE,
decimals = 2,
onese = FALSE,
probval = 0.5,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = FALSE,
parallel.pv = FALSE,
conf = NULL,
verbose = FALSE,
seed = 123)
summary(object = synt1)
print(x = synt1)
n <- 100
p <- length(synt1$cvfit$cv.used)
x <- matrix(data = runif(n = n*p, min = 0, max = 1),
nrow = n, ncol = p, byrow = FALSE,
dimnames = list(1:n, paste("X", 1:p, sep="")))
synt1.pred <- predict(object = synt1,
newdata = x,
steps = synt1$cvfit$cv.nsteps)
plot(x = synt1,
main = paste("Scatter plot for model #1", sep=""),
proj = c(1,2),
steps = synt1$cvfit$cv.nsteps,
pch = 16, cex = 0.5, col = c(1,2),
boxes = TRUE,
col.box = 2, lty.box = 2, lwd.box = 1,
add.caption.box = TRUE,
text.caption.box = paste("Step: ", synt1$cvfit$cv.nsteps, sep=""),
device = NULL)
plot_profile(object = synt1,
main = "Cross-validated tuning profiles for model #1",
pch = 20, col = 1, lty = 1, lwd = 0.5, cex = 0.5,
add.sd = TRUE,
add.profiles = TRUE,
add.caption = TRUE,
text.caption = c("Mean","Std. Error"),
device = NULL)
plot_traj(object = synt1,
main = paste("Cross-validated peeling trajectories for model #1", sep=""),
col = 1, lty = 1, lwd = 0.5, cex = 0.5,
toplot = synt1$cvfit$cv.used,
device = NULL)
plot_trace(object = synt1,
main = paste("Cross-validated trace plots for model #1", sep=""),
xlab = "Box Mass", ylab = "Covariate Range (centered)",
col = 1, lty = 1, lwd = 0.5, cex = 0.5,
toplot = synt1$cvfit$cv.used,
center = TRUE, scale = FALSE,
device = NULL)
plot_km(object = synt1,
main = paste("Cross-validated probability curves for model #1", sep=""),
xlab = "Time", ylab = "Probability",
ci = TRUE,
steps = 1:synt1$cvfit$cv.nsteps,
col = c(1,2), lty = 1, lwd = 0.5, cex = 0.5,
add.caption = TRUE,
text.caption = c("outbox","inbox"),
device = NULL)
## Not run:
#===================================================
# Examples of parallel backend parametrization
#===================================================
if (require("parallel")) {
cat("'parallel' is attached correctly \n")
} else {
stop("'parallel' must be attached first \n")
}
#===================================================
# Ex. #1 - Multicore PC
# Running WINDOWS
# SOCKET communication cluster
# Shared memory parallelization
#===================================================
cpus <- parallel::detectCores(logical = TRUE)
conf <- list("spec" = rep("localhost", cpus),
"type" = "SOCKET",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
#===================================================
# Ex. #2 - Master node + 3 Worker nodes cluster
# All nodes equipped with identical setups of multicores
# (8 core CPUs per machine for a total of 32)
# SOCKET communication cluster
# Distributed memory parallelization
#===================================================
masterhost <- Sys.getenv("HOSTNAME")
slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
nodes <- length(slavehosts) + 1
cpus <- 8
conf <- list("spec" = c(rep(masterhost, cpus),
rep(slavehosts, cpus)),
"type" = "SOCKET",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
#===================================================
# Ex. #3 - Enterprise Multinode Cluster w/ multicore/node
# Running LINUX with SLURM scheduler
# MPI communication cluster
# Distributed memory parallelization
#==================================================
if (require("Rmpi")) {
cat("'Rmpi' is attached correctly \n")
} else {
stop("'Rmpi' must be attached first \n")
}
# Below, variable 'cpus' is the total number of requested
# taks (threads/CPUs), which is specified from within a
# SLURM script.
cpus <- as.numeric(Sys.getenv("SLURM_NTASKS"))
conf <- list("spec" = cpus,
"type" = "MPI",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Peeling criterion = LRT
# Cross-Validation criterion = LRT
# With Combined Cross-Validation (RCCV)
# With Replications (B = 30)
# With PPL variable screening (pre-selection)
# With computation of log-rank \eqn{p}-values
# With parallelization
#===================================================
synt1 <- sbh(X = Synthetic.1[ , -c(1,2), drop=FALSE],
y = Synthetic.1[ ,1, drop=TRUE],
delta = Synthetic.1[ ,2, drop=TRUE],
B = 30,
K = 5,
A = 1000,
vs = TRUE,
vstype = "ppl",
vsarg = "alpha=1,
nalpha=1,
nlambda=100",
vscons = 0.5,
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.01,
beta=0.10,
peelcriterion=\"lrt\",
cvcriterion=\"lrt\"",
groups = NULL,
pv = TRUE,
decimals = 2,
onese = FALSE,
probval = 0.5,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = TRUE,
parallel.pv = TRUE,
conf = conf,
verbose = TRUE,
seed = 123)
#===================================================
# Simulated dataset #4 (n=100, p=1000)
# Peeling criterion = LRT
# Cross-Validation criterion = CER
# With Combined Cross-Validation (RCCV)
# With Replications (B = 30)
# With PRSP variable screening (pre-selection)
# With computation of log-rank \eqn{p}-values
# With parallelization
#===================================================
synt4 <- sbh(X = Synthetic.4[ , -c(1,2), drop=FALSE],
y = Synthetic.4[ ,1, drop=TRUE],
delta = Synthetic.4[ ,2, drop=TRUE],
B = 30,
K = 5,
A = 1000,
vs = TRUE,
vstype = "prsp",
vsarg = "alpha=0.01,
beta=0.10,
msize=NULL,
peelcriterion=\"lrt\",
cvcriterion=\"cer\"",
vscons = 0.5,
cv = TRUE,
cvtype = "combined",
cvarg = "alpha=0.01,
beta=0.10,
peelcriterion=\"lrt\",
cvcriterion=\"cer\"",
groups = NULL,
pv = TRUE,
decimals = 2,
onese = FALSE,
probval = 0.5,
timeval = NULL,
parallel.vs = FALSE,
parallel.rep = TRUE,
parallel.pv = TRUE,
conf = conf,
verbose = TRUE,
seed = 123)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.