sbh: Cross-Validated Survival Bump Hunting
In jedazard/PRIMsrc: PRIM Survival Regression Classification

View source: R/PRIMsrc.r

sbh	R Documentation

Cross-Validated Survival Bump Hunting

Description

Main end-user function for fitting a Survival Bump Hunting (SBH) model (or Group Survival Bump Hunting (GSBH)). It returns an object of class sbh, as generated by our Patient Recursive Survival Peeling (PRSP) algorithm (or Patient Recursive Group Survival Peeling (PRGSP)), containing cross-validated estimates of the target region (bump) of the input space with end-points statistics of interest. See Dazard and Rao (2014, 2015, 2016, 2021a, 2021b) for details, as well as Dazard et al. (2021c) ans Rao et al. (2020) for applications in Patient Survival Subtyping and Survival Disparity Subtyping.

Usage

  sbh(X, 
      y, 
      groups = NULL,
      delta, 
      B = 30, 
      K = 5, 
      A = 1000, 
      vs = TRUE, 
      vstype = "ppl",
      vsarg = "alpha=1,
               nalpha=1,
               nlambda=100",
      cv = TRUE, 
      cvtype = "combined", 
      cvarg = "alpha=0.01,
               beta=0.10,
               peelcriterion=\"lrt\",
               cvcriterion=\"cer\"",
      pv = FALSE,
      control = sbh.control(vscons = 0.5, 
                            decimals = 2, 
                            onese = FALSE, 
                            probval = NULL, 
                            timeval = NULL, 
                            lag = 2, 
                            span = 0.10, 
                            degree = 2), 
      parallel.vs = FALSE,
      parallel.rep = FALSE,
      parallel.pv = FALSE,
      conf = NULL,
      verbose = TRUE, 
      seed = NULL)

Arguments

`X`	A `numeric` (`n` x `p`) `matrix` of `n` observations and `p` input covariates. If input `X` is a `numeric` `data.frame`, it is acceptable and will be coerced to a `numeric` `matrix`. Discrete nominal covariates will be treated as ordinal variables. `NA` missing values are not allowed.
`y`	`n`-`numeric` `vector` of observed times to event. `NA` missing values are not allowed.
`groups`	`character` or `numeric` `vector`, or `factor` of group membership indicator variable, automatically converted into a `factor`. It is of length the sample size, where entries take on group `levels`. Only two groups are allowed at this point (see details). Defaults to `NULL`, i.e. when regular Patient Recursive Survival Peeling (PRSP) is used.
`delta`	`n`-`numeric` `vector` of observed status (censoring) indicator variable.
`B`	Postitive `integer` of the number of replications of the cross-validation procedure. Defaults to 30.
`K`	Postitive `integer` of the number of cross-validation folds (partitions) into which the total number of observations (`n`) should be randomly split. See details.
`A`	Positive `integer` of the number of permutations used for generating the permutation distribution of the null distribution and the computation of log-rank permutation `p`-values. Defaults to 1000. Ignored if `pv=FALSE`.
`vs`	`logical` scalar. Flag for optional variable (covariate) screening (pre-selection). Defaults to `TRUE`.
`vstype`	`character` `vector` in {"ppl", "pcqr", "spca", "prsp"} of one of the four possible variable screening (pre-selection) procedure. See details below. Defaults to "ppl". Ignored if `vs` is `FALSE`.
`vsarg`	`Character` `vector` of parameters of the cross-validated variable screening (pre-selection) procedure. Defaults to parameters values of default variable screening (pre-selection) procedure "ppl": `vsarg="alpha=1,nalpha=1,nlambda=100"`. Note that `vsarg` comes as a character string between double quotes, with comas separated values, and without white spaces. All the following parameters are ignored if `vs` is `FALSE`. PRSP: `alpha` = `numeric` data quantile in (0,1) to peel off at each step of the peeling sequence of the PRSP algorithm. Suggests 0.01. `beta` = `numeric` scalar of minimum box support at the end of the peeling sequence. Suggests 0.10. `msize` = positive `integer` or `NULL` to control the model size, i.e the number of screened variables used for fitting the Survival Bump Hunting model. Use a single non-`NULL` value as the maximum model size (cardinal of subset of top-screened variables) within the allowable range [1,`floor`(`p`)]. Alternatively, use `msize=NULL` to allow the optimal model size to be determined by cross-validation. See below for details. Suggests `NULL`. `peelcriterion` in {"lhr", "lrt", "chs", "grp"} stands for the peeling criterion used in the rate of increase between in-bump vs out-bump Log-Hazard Ratio (LHR), Log-Rank Test (LRT), or Cumulative Hazard Summary (CHS), and between fixed groups (GRP), respectively (see details). LHR, LRT, and CHS are used in the PRSP algorithm for building a SBH model, while GRP is used in the PRGSP algorithm for building a GSBH model. Suggests "lrt" for SBH and "grp" for GSBH. `cvcriterion` in {"lhr", "lrt", "cer"} stands for the cross-validation criterion Log-Hazard Ratio (LHR), Log-Rank Test (LRT), or Concordance Error Rate (CER), respectively, that is used for optimizing the model size (cardinal of subset of top-screened variables) and the optimal number of peeling steps (optimal peeling sequence length) in the PRSP variable screening procedure. Suggests "cer". PCQR: `tau` = `numeric` quantile in [0, 0.5] used in the censored quantile regression model. It is the tuning parameter of the censored quantile loss. It represents the conditional censored quantile of the survival response to be estimated. It includes the absolute loss when `tau` = 0.5. Suggests 0.5. `alpha` = `numeric` elasticnet mixing parameter in [0, 1] that controls the relative contribution from the lasso and the ridge penalty. The penalty is defined as (1-`alpha`)/2\|\|`beta`\|\|_2^2+`alpha`\|\|`beta`\|\|_1. `alpha` = 1 is the lasso penalty, and `alpha` = 0 the ridge penalty. If `alpha` is set to `NULL`, a vector of values of length `nalpha` is used, else `alpha` value is used and `nalpha` is set to 1. Suggests `alpha`=1 (lasso). `nalpha` = positive `integer` of number of `alpha` values to consider in the grid search. Suggests 1 (see above: lasso). `nlambda` = positive `integer` of number of elasticnet penalization `lambda` values to consider in the grid search. Suggests 100. PPL: `alpha` = `numeric` elasticnet mixing parameter in [0, 1] that controls the relative contribution from the lasso and the ridge penalty. See R package glmnet. The penalty is defined as (1-`alpha`)/2\|\|`beta`\|\|_2^2+`alpha`\|\|`beta`\|\|_1. `alpha` = 1 is the lasso penalty, and `alpha` = 0 the ridge penalty. If `alpha` is set to `NULL`, a vector of values of length `nalpha` is used, else `alpha` value is used and `nalpha` is set to 1. Suggests `alpha`=1 (lasso). `nalpha` = positive `integer` of number of `alpha` values to consider in the grid search. Suggests 1 (see above: lasso). `nlambda` = positive `integer` of number of elasticnet penalization `lambda` values to consider in the grid search. Suggests 100. SPCA: `n.thres` = number of thresholds to consider in the grid search. It cannot be less than `n` (sample size). Suggests 20. `n.pcs` = number of cross-validation principal components to use in {1,2,3}. It cannot be less than `n` (sample size) and greater than `p` (dimensionality of covariates); otherwise, it will be reset to `n.pcs` = `p` - 1. Suggests 3. `n.var` = minimum number of variables to include in determining range for threshold. If cannot be greater than `p` (dimensionality of covariates); otherwise, it will be reset to `n.var` = `p` - 1. Suggests 5.
`cv`	`logical` scalar. Flag for optional cross-validation (CV) of variable screening (pre-selection) parameters and Survival Bump Hunting fitting by PRSP algorithm. See below for details. Defaults to `TRUE`.
`cvtype`	`character` `vector` in {"combined", "averaged"} specifying the cross-validation technique. Defaults to "combined". Ignored if `cv` is `FALSE`.
`cvarg`	`character` `vector` describing the parameters used in the PRSP algorithm for fitting the Survival Bump Hunting model. Defaults to: `cvarg="alpha=0.01,beta=0.10,peelcriterion=\"lrt\",cvcriterion=\"cer\""`. Note that `cvarg` comes as a character string between double quotes, with comas separated values, and without white spaces. `alpha` = `numeric` data quantile in (0,1) to peel off at each step of the peeling sequence of the PRSP algorithm. Defaults to 0.01. `beta` = `numeric` scalar of minimum box support at the end of the peeling sequence. Defaults to 0.10. `peelcriterion` in {"lhr", "lrt", "chs", "grp"} stands for the peeling criterion used in the rate of increase between in-bump vs out-bump Log-Hazard Ratio (LHR), Log-Rank Test (LRT), or Cumulative Hazard Summary (CHS), and between groups (GRP), respectively (see details). LHR, LRT, and CHS are used in the PRSP algorithm for building a SBH model, while GRP is used in the PRGSP algorithm for building a GSBH model. Defaults to "lrt" for SBH and "grp" for GSBH. `cvcriterion` in {"lrt", "lhr", "cer"} stands for the cross-validation criterion Log-Hazard Ratio (LHR), Log-Rank Test (LRT), or Concordance Error Rate (CER), respectively, that is used for tuning/optimizing the optimal number of peeling steps. If a SBH model is built (`peelcriterion` in {"lrt", "lhr", "chs"}), all three `cvcriterion` in {"lrt", "lhr", "cer"} are admissible. If a GSBH model is built, only `cvcriterion` in {"cer"} is allowed. Ignored if `cv` is `FALSE`. Defaults to "cer". if `peelcriterion` is in {"lrt", "lhr", "chs"}, `groups` is automatically set to `NULL`.
`pv`	`logical` scalar. Flag for computation of log-rank permutation `p`-values. Defaults to `FALSE`.
`control`	Optional function to set ancillary parameters for fitting the Survival Bump Hunting (SBH) model. See `sbh.control` for details.
`parallel.vs`	`logical`. Is parallelization to be performed for variable screening? Defaults to `FALSE`, because it is not implemented yet.
`parallel.rep`	`logical`. Is parallelization to be performed for replications? Defaults to `FALSE`.
`parallel.pv`	`logical`. Is parallelization to be performed for computation of log-rank `p`-values? Defaults to `FALSE`.
`conf`	`list` of 5 fields containing the parameters values needed for creating the parallel backend (cluster configuration). See details below for usage. Optional, defaults to `NULL`, but all fields are required if used: `type` : `character` `vector` specifying the cluster type ("SOCKET", "MPI"). `spec` : A specification (`character` `vector` or `integer` scalar) appropriate to the type of cluster. `homogeneous` : `logical` scalar to be set to `FALSE` for inhomogeneous clusters. `verbose` : `logical` scalar to be set to `FALSE` for quiet mode. `outfile` : `character` `vector` of an output log file name to direct the stdout and stderr connection output from the workernodes. "" indicates no redirection.
`verbose`	`logical` scalar. Is the output to be verbose? Optional, defaults to `TRUE`.
`seed`	Positive `integer` scalar of the user seed to reproduce all the results. Defaults to `NULL`.

Details

The main function sbh relies on an optional variable screening (pre-selection) procedure that is run before the actual variable usage (selection) is done at the time of fitting the Survival Bump Hunting (SBH) or Group Survival Bump Hunting (GSBH) model using our PRSP or PRGSP algorithm, respectively. The user can choose between four possible variable screening (pre-selection) procedures (see Dazard and Rao (2021a) for details, as well as Dazard et al. (2021c) for an application in Patient Survival Subtyping):

Patient Recursive Survival Peeling (PRSP) (by univariate screening of our algorithm)
Penalized Censored Quantile Regression (PCQR) (by Semismooth Newton Coordinate Descent fiting algorithm adapted from package hqreg)
Penalized Partial Likelihood (PPL) (by Elasticnet Regularization adapted from package glmnet)
Supervised Principal Component Analysis (SPCA) (by Supervised Principal Component adapted from package superpc)

NA missing values are not allowed in PRIMsrc, because it depends on R package glmnet, which doesn't handle missing values. In case of high-dimensional data (p >> n), the recommendation is to use PPL or SPCA because of computational efficiency. Variable screening (pre-selection) is done by computing occurrence frequencies of top-ranking variables over the cross-validation folds and replicates. The conservativeness of the procedure is controled by the argument vscons.

The argument K must be bigger than 2 for a regular K-fold cross-validation procedure to work, and should be no less than 3 for a regular procedure to make sense; K \in {5,...,10} is recommended; defaults to K=5. Setting K also specifies the type of cross-validation to be done:

K = 1 carries no cross-validation out, or set-value when cv = FALSE (see below).
K \in {2,...,n-1} carries out K-fold cross-validation.
K = n carries out leave-one-out cross-validation.

If cross-validation is done (cv = TRUE, the optimal number of peeling steps (optimal peeling sequence length), and the optimal model size (cardinal of subset of top-screened variables) will be determined by cross-validation. If cv = FALSE, no cross-validation at all will be performed, and the values of K and vscons will both be reset to 1.

The Patient Recursive Group Survival Peeling (PRGSP) algorithm is a derivation of our original Patient Recursive Survival Peeling (PRSP) algorithm to search for (or find an extreme of) outcome difference within existing (user-defined) fixed groups of observations. See Dazard and Rao (2021b) for details, as well as Rao et al. (2020) for an application in Survival Disparity Subtyping.

The argument object$cvarg$peelcriterion is the peeling criterion that determines what type of bump hunting is done, that is, either using the PRSP or PRGSP algorithm for building a SBH model or GSBH model, respectively. If a regular hunt of bump difference is done (SBH model, peelcriterion in {"lrt", "lhr", "chs"}), PRSP algorithm is used, and cross-validated bumps are generated between observations from the higher-risk bump (in-bump) versus lower-risk bump (out-bump). If a hunt of (user-specified) fixed group difference is done (GSBH model, peelcriterion = {"grp"}), PRGSP algorithm is used, and cross-validated bumps are generated between interacting subgroup from the fixed groups and bumps (higher-risk (in-bump) versus lower-risk bump (out-bump)), that is, either between both groups within the higher-risk bump (in-bump), or equivalently, between higher-risk (in-bump) versus lower-risk bump (out-bump) within a given group.

The argument groups is to be specified only if a hunt of (user-specified) fixed group difference is to be done, i.e. when option peelcriterion = {"grp"} and PRGSP are used.

In the PRSP variable screening procedure (vsarg of "prsp"), setting option msize to a single non-NULL value within the allowable range [1,floor(p)] will override the cross-validation setting within the variable screening procedure. This could be recommended for high-dimensional data (p >> n) to reduce the computational burden. In this situation, we suggest an arbitrary value of msize within [1, floor(p/5)]. Conversely, setting msize=NULL will force the cross-validation within the variable screening procedure by automaticaly generating a vector of model sizes (cardinals of subset of top-screened variables) within the restricted range [1, floor(p/5)], which will be used to determine the optimal value of model size.

In fitting the Survival Bump Hunting (SBH) model itself, note that the result contains initial step #0, which corresponds to the entire set of the (training) data. Also, the number of peeling steps that is within the allowable range [1,ceiling(log(1/n) / log(1 - (1/n)))] is further restricted when either of the metaparameter alpha or beta takes on values other than the smallest possible fraction of the (training) data, i.e. \frac{1}{n^t}, where n^t is the training sample size:

ceiling(log(beta) / log(1 - alpha)) : alpha and beta fixed by user
ceiling(log(1/n^t) / log(1 - alpha)) : alpha fixed by user and beta fixed by data
ceiling(log(beta) / log(1 - (1/n^t))) : alpha fixed by data and beta fixed by user
ceiling(log(1/n^t) / log(1 - (1/n^t))) : alpha and beta fixed by data

When cross-validation is requested (cv=TRUE), the function performs a supervised (stratified) random splitting of the observations accounting for the classes/strata provided by delta (censoring). This is because it is desireable that the data splitting balances the class distributions of the outcome within the cross-validation splits. For each screening method and for building the final Survival Bump Hunting (SBH) model, all model tuning parameters are simultaneously estimated by cross-validation. The function offers a number of options for the cross-validation to be perfomed: the number of replications B; the type of technique; the peeling criterion; and the optimization criterion.

The returned S3-class sbh object contains cross-validated estimates of all the decision-rules of used (selected) covariates and all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm). This enables the graphical display of results of profiling curves for model tuning, peeling trajectories, covariate traces and survival distributions (see plotting functions for more details).

In case replicated cross-validations are performed, a "summary report" of the outputs is done over the B replicates as follows:

Even thought the PRSP algorithm uses only one covariate at a time at each peeling step, the reported matrix of "Replicated CV" box decision rules may show more than one covariate being used in a given step depending on the replication. In the end, the reported "Replicated CV" trace values are computed (at each peeling step) as a single modal trace value of covariate usage over the B replicates. This is also reflected in the "Replicated CV" importance and usage plots of covariate traces.
Similarly, the reported "Replicated CV" box membership indicators are computed (at each peeling step) as the point-wise modal membership value, that is majority vote, over the B replicates (right-hand side of equation #22 in Dazard et al. 2016). The reported "Replicated CV" box support and corresponding box sample size are computed (at each peeling step) based on the above "Replicated CV" box membership indicators (i.e. not as equation #23 in Dazard et al. 2016).
All other reported "Replicated CV" box estimates are computed (at each peeling step) as average statistics over the B replicates (i.e. as equation #21 in Dazard et al. 2016), that is, not as a single box estimate computed from the "Replicated CV" box membership indicators. This includes the decision rules, the p-values, and all other box statistics. This may result in some apparent discordance if these estimates are re-computed directly from the reported "Replicated CV" box membership indicators.

If the computation of log-rank p-values is desired, then running with the parallelization option is strongly recommended. In case of large (p > n) or very large (p >> n) datasets, it is also highly recommended to use the parallelization option.

The function sbh relies on the R package parallel to create a parallel backend within an R session. This enables access to a cluster of compute cores and/or nodes on a local and/or remote machine(s) and scaling-up with the number of CPU cores available and efficient parallel execution. To run a procedure in parallel (with parallel RNG), argument parallel is to be set to TRUE and argument conf is to be specified (i.e. non NULL). Argument conf uses the options described in function makeCluster of the R packages parallel and snow. PRIMsrc supports two types of communication mechanisms between master and worker processes: 'Socket' or 'Message-Passing Interface' ('MPI'). In PRIMsrc, parallel 'Socket' clusters use sockets communication mechanisms only (no forking) and are therefore available on all platforms, including Windows, while parallel 'MPI' clusters use high-speed interconnects mechanism in networks of computers (with distributed memory) and are therefore available only in these architectures. A parallel 'MPI' cluster also requires R package Rmpi to be installed. Value type is used to setup a cluster of type 'Socket' ("SOCKET") or 'MPI' ("MPI"), respectively. Depending on this type, values of spec are to be used alternatively:

For 'Socket' clusters (conf$type="SOCKET"), spec should be a character vector naming the hosts on which to run the job; it can default to a unique local machine, in which case, one may use the unique host name "localhost". Each host name can potentially be repeated to the number of CPU cores available on the local machine. It can also be an integer scalar specifying the number of processes to spawn on the local machine; or a list of machine specifications if you have ssh installed (a character value named host specifying the name or address of the host to use).
For 'MPI' clusters (conf$type="MPI"), spec should be an integer scalar specifying the total number of processes to be spawned across the network of available nodes, counting the workernodes and masternode.

The actual creation of the cluster, its initialization, and closing are all done internally. For more details, see the reference manual of R package snow and examples below.

When random number generation is needed, the creation of separate streams of parallel RNG per node is done internally by distributing the stream states to the nodes. For more details, see the vignette of R package parallel. The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability) between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and value of returned seed below).

Value

Object of class sbh (Patient Recursive Survival Peeling) list containing the following 23 fields:

`X`	`numeric` `matrix` of original dataset.
`y`	`numeric` `vector` of observed failure / survival times.
`groups`	`vector` of group membership if algorithm Patient Recursive Group Survival Peeling (PRGSP) is used.
`delta`	`numeric` `vector` of observed event indicator in {1,0}.
`B`	positive `integer` of the number of replications used in the cross-validation procedure.
`K`	positive `integer` of the number of folds used in the cross-validation procedure.
`A`	positive `integer` of the number of permutations used for the computation of log-rank permutation `p`-values.
`vs`	`logical` scalar of returned flag of optional variable pre-selection.
`vstype`	`character` `vector` of the optional variable pre-selection procdure used.
`vsarg`	`list` of parameters used in the pre-selection procedure.
`cv`	`logical` scalar of returned flag of optional cross-validation.
`cvtype`	`character` `vector` of the cross-validation technique used.
`cvarg`	`list` of parameters used in the Survival Bump Hunting procedure.
`pv`	`logical` scalar of returned flag of optional computation of log-rank permutation `p`-values.
`control`	`list` of 8 fields of secondary (ancillary) parameters used for fitting the Survival Bump Hunting (SBH) model.
`vscons`	`numeric` scalar of conservativeness of the variable screening (pre-selection) procedure.
`onese`	`logical` scalar of returned flag of 1-standard error rule.
`decimals`	`integer` of the number of user-specified significant decimals.
`probval`	`Numeric` scalar of survival probability used.
`timeval`	`Numeric` scalar of survival time used.
`cvprofiles`	`list` of 10 fields of cross-validated tuning profiles and estimates, each of length `B` (one for each replicate): cv.varprofiles: `numeric` `matrix` of cross-validation criterion used for tuning/optimizing the variable screening size in the PRSP variable screening (pre-selection) procedure (`NULL` otherwise). Values are by columns (peeling steps) and replicates (rows). cv.varprofiles.mean: `numeric` `vector` of means (across replicates) of the above cross-validation criterion by peeling steps. cv.varprofiles.se: `numeric` `vector` of standard errors (across replicates) of the above cross-validation criterion by peeling steps. cv.varset.opt: `numeric` scalar of optimal variable screening size according to the extremum. cv.varset.1se: `numeric` scalar of optimal variable screening size according to 1SE rule. cv.stepprofiles: `numeric` `matrix` of cross-validation criterion used for tuning/optimizing the peeling sequence length (i.e. number of peeling steps) in the PRSP algorithm. Values are by columns (peeling steps) and replicates (rows). cv.stepprofiles.mean: `numeric` `vector` of means (across replicates) of the above cross-validation criterion by peeling steps. cv.stepprofiles.se: `numeric` `vector` of standard errors (across replicates) of the above cross-validation criterion by peeling steps. cv.nsteps.opt: `numeric` scalar of optimal number of peeling steps according to the extremum. cv.nsteps.1se: `numeric` scalar of optimal number of peeling steps according to 1SE rule.
`cvfit`	`list` with 12 fields of cross-validated SBH output estimates, each of length `B` (one for each replicate): cv.maxsteps: `numeric` scalar of maximal number of peeling steps (counting step #0 - see Details section). cv.nsteps: `numeric` scalar of optimal number of peeling steps (counting step #0 - see Details section). cv.boxind: `logical` `matrix` in `TRUE`, `FALSE` of individual observation box membership indicator (columns) for all peeling steps (rows). cv.boxind.size: `numeric` `vector` of box sample size for all peeling steps. cv.boxind.support: `numeric` `vector` of box support for all peeling steps. cv.rules: `data.frame` of decision rules on the covariates (columns) for all peeling steps (rows). cv.screened: `numeric` `vector` of screened (pre-selected) covariates, indexed in reference to original index. cv.trace: `numeric` `vector` of the modal trace values of covariate usage for all peeling steps. cv.sign: `numeric` `vector` in {-1,+1} of directions of peeling for all used (selected) covariates. cv.used: `numeric` `vector` of covariates used (selected) for peeling, indexed in reference to original index. cv.stats: `numeric` `matrix` of box endpoint descriptive statistics of interest (columns) at all peeling steps (rows). The output varies according to the peeling criterion (`peelcriterion`) used (only relevant outputs are returned). cv.pval: `list` with 2 fields of two `vectors`. The first `cvfit$pval` is a `numeric` `vector` for log-rank permutation `p`-values. The second `cvfit$seed` is is an `integer` scalar of the seed if parallelization is used, or an `integer` `vector` of `A` values of seeds, one for each permutation, if parallelization is not used.
`success`	`logical` scalar of the returned flag of success at fitting the SBH model.
`seed`	User seed. An `integer` scalar if parallelization is used, or an `integer` `vector` of `B` values, one for each replication, if parallelization is not used.

Acknowledgments

This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. This project was partially funded by the National Institutes of Health NIH - National Cancer Institute (R01-CA160593) to J-E. Dazard and J.S. Rao.

Note

Main end-user function for fitting the Survival Bump Hunting model.

Author(s)

"Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu
"Michael Choe, M.D." mjc206@case.edu
"Michael LeBlanc, Ph.D." mleblanc@fhcrc.org
"Alberto Santana, MBA." ahs4@case.edu
"J. Sunil Rao, Ph.D." Rao@biostat.med.miami.edu

Maintainer: "Jean-Eudes Dazard, Ph.D." jean-eudes.dazard@case.edu

References

Dazard J-E. and Rao J.S. (2021a). "Variable Selection Strategies for High-Dimensional Recursive Peeling-Based Survival Bump Hunting Models." (in prep).
Dazard J-E. and Rao J.S. (2021b). "Group Bump Hunting by Recursive Peeling-Based Methods: Application to Survival/Risk Predictive Models." (in prep).
Dazard J-E., Choe M., Pawitan Y., and Rao J.S. (2021c). "Identification and Characterization of Informative Prognostic Subgroups by Survival Bump Hunting." (in prep).
Rao J.S., Huilin Y., and Dazard J-E. (2020). "Disparity Subtyping: Bringing Precision Medicine Closer to Disparity Science." Cancer Epidemiology Biomarkers & Prevention, 29(6 Suppl):C018.
Yi C. and Huang J. (2017). "Semismooth Newton Coordinate Descent Algorithm for Elastic-Net Penalized Huber Loss Regression and Quantile Regression." J. Comp Graph. Statistics, 26(3):547-557.
Dazard J-E., Choe M., LeBlanc M., and Rao J.S. (2016). "Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods." Statistical Analysis and Data Mining, 9(1):12-42.
Dazard J-E., Choe M., LeBlanc M., and Rao J.S. (2015). "R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification." In JSM Proceedings, Statistical Programmers and Analysts Section. Seattle, WA, USA. American Statistical Association IMS - JSM, p. 650-664.
Dazard J-E., Choe M., LeBlanc M., and Rao J.S. (2014). "Cross-Validation of Survival Bump Hunting by Recursive Peeling Methods." In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA. American Statistical Association IMS - JSM, p. 3366-3380.

Examples

#===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")

## Not run: 
    #===================================================
    # PRIMsrc Package news
    #===================================================
    PRIMsrc.news()
    
    #===================================================
    # PRIMsrc Package citation
    #===================================================
    citation("PRIMsrc")
    
    #===================================================
    # Demo with a synthetic dataset
    # Use help for descriptions
    #===================================================
    data("Synthetic.1", package="PRIMsrc")
    ?Synthetic.1

## End(Not run)

#===================================================
# Simulated dataset #1 (n=250, p=3)
# Peeling criterion = LRT
# Cross-Validation criterion = LRT
# With Combined Cross-Validation (RCCV)
# Without Replications (B = 1)
# Without variable screening (pre-selection)
# Without computation of log-rank \eqn{p}-values
# Without parallelization
#===================================================
data("Synthetic.1", package="PRIMsrc")
synt1 <- sbh(X = Synthetic.1[ , -c(1,2), drop=FALSE],
             y = Synthetic.1[ ,1, drop=TRUE],
             groups = NULL,
             delta = Synthetic.1[ ,2, drop=TRUE],
             B = 1,
             K = 3,
             vs = FALSE,
             cv = TRUE,
             cvtype = "combined",
             cvarg = "alpha=0.05,
                      beta=0.10,
                      peelcriterion=\"lrt\",
                      cvcriterion=\"lrt\"",
             pv = FALSE,
             control = sbh.control(probval = 0.5), 
             parallel.vs = FALSE,
             parallel.rep = FALSE,
             parallel.pv = FALSE,
             conf = NULL,
             verbose = TRUE,
             seed = 123)

summary(object = synt1)
print(x = synt1)

n <- 100
p <- length(synt1$cvfit$cv.used)
x <- matrix(data = runif(n = n*p, min = 0, max = 1),
            nrow = n, ncol = p, byrow = FALSE,
            dimnames = list(1:n, paste("X", 1:p, sep="")))
synt1.pred <- predict(object = synt1,
                      newdata = x,
                      steps = synt1$cvfit$cv.nsteps)

plot(x = synt1,
     main = paste("Scatter plot for model #1", sep=""),
     proj = synt1$cvfit$cv.used[c(1,2)],
     steps = synt1$cvfit$cv.nsteps,
     pch = 16, cex = 0.5, col = c(1,2),
     boxes = TRUE,
     col.box = 2, lty.box = 2, lwd.box = 1,
     add.caption.box = TRUE,
     text.caption.box = paste("Step: ", synt1$cvfit$cv.nsteps, sep=""),
     device = NULL)

plot_profile(object = synt1,
             main = "Cross-validated tuning profiles for model #1",
             pch = 20, col = 1, lty = 1, lwd = 0.5, cex = 0.5,
             add.sd = TRUE, 
             add.profiles = TRUE,
             add.caption = TRUE, 
             text.caption = c("Mean","Std. Error"),
             device = NULL)

plot_traj(object = synt1,
          main = paste("Cross-validated peeling trajectories for model #1", sep=""),
          col = 1, lty = 1, lwd = 0.5, cex = 0.5,
          toplot = synt1$cvfit$cv.used,
          device = NULL)

plot_trace(object = synt1,
           main = paste("Cross-validated trace plots for model #1", sep=""),
           xlab = "Box Support", ylab = "Covariate Range (centered)",
           col = 1, lty = 1, lwd = 0.5, cex = 0.5,
           toplot = synt1$cvfit$cv.used,
           center = TRUE, scale = FALSE,
           device = NULL)

plot_km(object = synt1,
        main = paste("Cross-validated probability curves for model #1", sep=""),
        xlab = "Time", ylab = "Probability",
        ci = TRUE,
        col = c(1,2), lty = 1, lwd = 0.5, cex = 0.5,
        steps = 1:synt1$cvfit$cv.nsteps,
        plot.type = "bumps",
        bump.reference = "in-bump",
        group.reference = levels(synt1$groups)[1],
        add.caption = TRUE,
        text.caption = c("out-bump","in-bump"), 
        device = NULL)
                                
## Not run: 
    #===================================================
    # Examples of parallel backend parametrization 
    #===================================================
    if (require("parallel")) {
       cat("'parallel' is attached correctly \n")
    } else {
       stop("'parallel' must be attached first \n")
    }
    #===================================================
    # Ex. #1 - Multicore PC
    # Running WINDOWS
    # SOCKET communication cluster
    # Shared memory parallelization
    #===================================================
    cpus <- parallel::detectCores(logical = TRUE)
    conf <- list("spec" = rep("localhost", cpus),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Ex. #2 - Master node + 3 Worker nodes cluster
    # All nodes equipped with identical setups of multicores 
    # (8 core CPUs per machine for a total of 32)
    # SOCKET communication cluster
    # Distributed memory parallelization
    #===================================================
    masterhost <- Sys.getenv("HOSTNAME")
    slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
    nodes <- length(slavehosts) + 1
    cpus <- 8
    conf <- list("spec" = c(rep(masterhost, cpus),
                            rep(slavehosts, cpus)),
                 "type" = "SOCKET",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Ex. #3 - Enterprise Multinode Cluster w/ multicore/node  
    # Running LINUX with SLURM scheduler
    # MPI communication cluster
    # Distributed memory parallelization
    # Below, variable 'cpus' is the total number of requested 
    # tasks (threads/CPUs), which is specified from within a 
    # SLURM script.
    #==================================================
    if (require("Rmpi")) {
        print("Rmpi is loaded correctly \n")
    } else {
        stop("Rmpi must be installed first to use MPI\n")
    }
    cpus <- as.numeric(Sys.getenv("SLURM_NTASKS"))
    conf <- list("spec" = cpus,
                 "type" = "MPI",
                 "homo" = TRUE,
                 "verbose" = TRUE,
                 "outfile" = "")
    #===================================================
    # Simulated dataset #1 (n=250, p=3)
    # Peeling criterion = LRT
    # Cross-Validation criterion = LRT
    # With Combined Cross-Validation (RCCV)
    # With Replications (B = 30)
    # With PPL variable screening (pre-selection)
    # With computation of log-rank \eqn{p}-values
    # With parallelization
    #===================================================                         
    data("Synthetic.1", package="PRIMsrc")
    synt1 <- sbh(X = Synthetic.1[ , -c(1,2), drop=FALSE],
                 y = Synthetic.1[ ,1, drop=TRUE],
                 groups = NULL,
                 delta = Synthetic.1[ ,2, drop=TRUE],
                 B = 30,
                 K = 5,
                 A = 1000,
                 vs = TRUE,
                 vstype = "ppl",
                 vsarg = "alpha=1,
                          nalpha=1,
                          nlambda=100",
                 cv = TRUE,
                 cvtype = "combined",
                 cvarg = "alpha=0.01,
                          beta=0.10,
                          peelcriterion=\"lrt\",
                          cvcriterion=\"lrt\"",
                 pv = TRUE,
                 control = sbh.control(probval = 0.5, 
                                       vscons = 0.5), 
                 parallel.vs = FALSE,
                 parallel.rep = TRUE,
                 parallel.pv = TRUE,
                 conf = conf,
                 verbose = TRUE,
                 seed = 123)      
    #===================================================
    # Simulated dataset #4 (n=100, p=1000)
    # Peeling criterion = LRT
    # Cross-Validation criterion = CER
    # With Combined Cross-Validation (RCCV)
    # With Replications (B = 30)
    # With PRSP variable screening (pre-selection)
    # With computation of log-rank \eqn{p}-values
    # With parallelization
    #===================================================                         
    data("Synthetic.4", package="PRIMsrc")
    synt4 <- sbh(X = Synthetic.4[ , -c(1,2), drop=FALSE],
                 y = Synthetic.4[ ,1, drop=TRUE],
                 groups = NULL,
                 delta = Synthetic.4[ ,2, drop=TRUE],
                 B = 30,
                 K = 5,
                 A = 1000,
                 vs = TRUE,
                 vstype = "prsp",
                 vsarg = "alpha=0.01,
                          beta=0.10,
                          msize=NULL,
                          peelcriterion=\"lrt\",
                          cvcriterion=\"cer\"",
                 cv = TRUE,
                 cvtype = "combined",
                 cvarg = "alpha=0.01,
                          beta=0.10,
                          peelcriterion=\"lrt\",
                          cvcriterion=\"cer\"",
                 pv = TRUE,
                 control = sbh.control(probval = 0.5, 
                                       vscons = 0.5), 
                 parallel.vs = FALSE,
                 parallel.rep = TRUE,
                 parallel.pv = TRUE,
                 conf = conf,
                 verbose = TRUE,
                 seed = 123)

## End(Not run)

jedazard/PRIMsrc documentation built on June 13, 2025, 10:13 p.m.