This R package computes non-parametric and semi-parametric estimates of accuracy measures for risk prediction markers from survival data under two phase study designs, namely the case-cohort (cch) and the nested case-control (ncc) study designs.
The accuracy measures that can be estimated include the AUC, TPR(c), FPR(c), PPV(c), and NPV(c) for for a specific timepoint and fixed marker cutoff value c. Standard errors, along with normal approximated confidence intervals are also calculated. Note: standard errors for non-parametric estimates from a cch design are currently not available.
Below is a brief tutorial to get you started. For more information regarding estimation procedures see the references below.
#download the package from github if (!require("devtools")) install.packages("devtools") devtools::install_github("mdbrown/survMarkerTwoPhase")
library(survMarkerTwoPhase) #simulated data for illustration data(SimData) head(SimData)
For the case-cohort design we sample n=150 from entire cohort, and include all participants with observed failures.
#generate a sub-cohort from SimData set.seed(12321) #create a sample index. 1 if sampled, 0 if not N <- nrow(SimData) sampleInd <- rep(0, N) # sample all with observed failure time. (200 individuals) sampleInd[SimData$status==1] <- 1 #sample 150 more observations from the entire data set without replacement sampleInd[sample(1:N, 150)] <- 1 table(sampleInd) #total number of subcohort is 293
To estimate accuracy measures for the case-cohort design, sample weights must be calculated. The sample weights w = 1/Pr(Sampled from cohort) for each observation included in the sub-cohort.
cohortData_cch <- SimData #first calculate the Pr(Sampled from cohort) for each observation sampleProb <- numeric(500) #all non-censored observations were sampled, so their sample probability is 1 sampleProb[cohortData_cch$status==1] <- 1 #all other individuals had a 150/N chance to be sampled sampleProb[cohortData_cch$status==0] <- 150/N #the sample weights are 1/(probability of being sampled) cohortData_cch$weights <- 1/sampleProb #indicator of inclusion in the subcohort cohortData_cch$subcohort = sampleInd #marker data is unavailable for those not in the subcohort cohortData_cch$Y[sampleInd==0] = NA
Here we estimate using non-parametric estimation methods using double inverse probability weighting.
#estimate accuracy measures using non-parametric estimates #by setting estimation.method = "NP" survMTP.cch(time =survTime, event = status, marker = Y, weights = weights, subcoh = subcohort, data = cohortData_cch, estimation.method = "NP", predict.time = 2, marker.cutpoint = 0)
Now estimate measures using semi-parametric estimation methods. These methods assume a Cox proportional hazards model.
#estimate accuracy measures using semi-parametric estimates survMTP.cch(time =survTime, event = status, marker = Y, weights = weights, subcoh = subcohort, data = cohortData_cch, estimation.method = "SP", predict.time = 2, marker.cutpoint = 0)
For more information see ?survMTP.cch
.
For a nested case-control sample, we need to input the full cohort data, with an indicator for inclusion in the subcohort, as well as a dataframe of risk sets, where each row is a case id followed by the corresponding id's that were matched to it.
Generate a nested case-control subcohort using the function ccwc
from the Epi
package.
require('Epi') #2 matched controls for each case nmatch = 2 cohortData_ncc <- SimData cohortData_ncc$id = 1:dim(cohortData_ncc)[1] subcohort_ncc <- ccwc(exit = survTime, fail= status, data = cohortData_ncc, controls = nmatch) # match 2 controls for each case. #indicator for inclusion in the subcohort sampleInd = rep(0, nrow(cohortData_ncc)) #initialize all equal to zero sampleInd[subcohort_ncc$Map] = 1 cohortData_ncc$subcohort = sampleInd table(sampleInd) #subcohort sample size of 352 #marker data is unavailable for those not in the subcohort cohortData_ncc$Y[sampleInd==0] = NA #now we need to build the set matrix, #which will be dimension (# of cases) x (nmatch + 1), so 200x3 here #each row denotes a selected set, with the case id followed by the matched control ids Sets <- matrix(nrow = sum(cohortData_ncc$status), ncol = nmatch +1) for(i in subcohort_ncc$Set){ Sets[i, ] <- unlist(subset(subcohort_ncc, Set==i, select=Map)) }
Now we are ready to calculate the measures using survMTP.ncc
. Standard errors for both SP and NP estimates are calculated using perturbation. NP estimates use kernel smoothing while SP estimates assume a Cox proportional hazards model.
#Nonparametric estimates survMTP.ncc(time = survTime, event = status, marker=Y, subcoh=subcohort, id = id, data = cohortData_ncc, sets = Sets, estimation.method = "NP", predict.time = 2 , marker.cutpoint = 0)
#Semiparametric estimates survMTP.ncc(time = survTime, event = status, marker=Y, subcoh=subcohort, id = id, data = cohortData_ncc, sets = Sets, estimation.method = "SP", predict.time = 2 , marker.cutpoint = 0)
See ?survMTP.ncc
for further help.
Liu D, Cai T, Zheng Y. Evaluating the predictive value of biomarkers with stratified case-cohort design. Biometrics 2012, 4: 1219-1227.
Pepe MS, Zheng Y, Jin Y. Evaluating the ROC performance of markers for future events. Lifetime Data Analysis. 2008, 14: 86-113.
Zheng Y, Cai T, Pepe MS, Levy, W. Time-dependent predictive values of prognostic biomarkers with failure time outcome. JASA 2008, 103: 362-368.
Cai T. and Zheng Y . Resampling Procedures for Making Inference under Nested Case-control Studies. JASA 2013 (in press).
Cai T and Zheng Y/, Evaluating Prognostic Accuracy of Biomarkers under nested case-control studies. Biostatistics* 2012, 13,1, 89-100.
(* equal contributor and corresponding author).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.