Estimate Cox model hazard ratios for covariates with missing data

Share:

Description

nested.coxph fits the Cox model to estimate hazard ratios for covariates that are missing data on some cohort members. All covariates may be continous or categorical. nested.coxph requires knowledge of the variables that missingness depends on, with missingness probability modeled through a glm sampling model. Often, the data is in the form of a case-control sample taken within a cohort. nested.coxph allows cases to have missing data, and can extract efficiency from auxiliary variables by including them in the sampling model. nested.coxph requires coxph from the survival package.

Usage

1
2
3
4
5
nested.coxph(coxformula, samplingmod, data, outputsamplingmod = FALSE,
             glmlink = binomial(link = "logit"),
             glmcontrol = glm.control(epsilon = 1e-10, maxit = 10, trace = FALSE),
             coxphcontrol = coxph.control(eps = 1e-10, iter.max = 50),
             missvarwarn = TRUE, ...)

Arguments

Required arguments:

coxformula

Standard coxph formula

samplingmod

Right side of the formula for the glm sampling model that models the probability of missingness

data

Data Frame that all variables are in

Optional arguments:

outputsamplingmod

Output the sampling model, default is false

glmlink

Sampling model link function, default is logistic regression

glmcontrol

See glm.control

coxphcontrol

See coxph.control

missvarwarn

Warn if there is missing data in the sampling variable. Default is TRUE

...

Any additional arguments to be passed on to glm or coxph

Details

If nested.coxph reports that the sampling model "failed to converge", the sampling model will be returned for your inspection. Note that if some sampling probabilities are estimated at 1, the model technically cannot converge, but you get very close to 1, and nested.coxph will not report non-convergence for this situation.

Note these issues. The data must be in a dataframe and specified in the data statement. No variable can be named 'o.b.s.e.r.v.e.d.' or 'p.i.h.a.t.'. Cases and controls cannot be finely matched on time, but matching on time within large strata is allowed. cluster() statements are not allowed in coxformula. Allows left truncation, staggered entry, open cohorts, and stratified baseline hazards. Must use Breslow Tie-Breaking.

Value

If outputsamplingmod=FALSE, the output are the hazard ratios and the coxph model. Any method for coxph objects will work for this so long as that method only requires consistent estimates of the parameters and their standard errors. If outputsamplingmod=TRUE, then the sampling model is also returned, and the output is a list with components:

coxmod

The Cox model of class coxph

samplingmod

The sampling model of class glm

Note

Requires the MASS library from the VR bundle that is available from the CRAN website.

Author(s)

Hormuzd A. Katki

References

Katki HA, Mark SD. Survival Analysis for Cohorts with Missing Covariate Information. R-News, 8(1) 14-9, 2008. http://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf

Mark, S.D. and Katki, H.A. Specifying and Implementing Nonparametric and Semiparametric Survival Estimators in Two-Stage (sampled) Cohort Studies with Missing Case Data. Journal of the American Statistical Association, 2006, 101, 460-471.

Mark SD, Katki H. Influence function based variance estimation and missing data issues in case-cohort studies. Lifetime Data Analysis, 2001; 7; 329-342

Christian C. Abnet, Barry Lai, You-Lin Qiao, Stefan Vogt, Xian-Mao Luo, Philip R. Taylor, Zhi-Wei Dong, Steven D. Mark, Sanford M. Dawsey. Zinc concentration in esophageal biopsies measured by X-ray fluorescence and cancer risk. To Appear in Journal of the National Cancer Institute.

See Also

See Also: nested.stdsurv, zinc, nested.km, coxph, glm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## Simple analysis of zinc and esophageal cancer data:
## We sampled zinc (variable zncent) on a fraction of the subjects, with
## sampling fractions depending on cancer status and baseline histology.
## We observed the confounding variables on almost all subjects.
data(zinc)
coxmod <- nested.coxph(coxformula="Surv(futime01,ec01==1)~
          sex+agepill+smoke+drink+mildysp+moddysp+sevdysp+anyhist+zncent",
          samplingmod="ec01*basehist",data=zinc)
summary(coxmod)

# This is the output:
# Call:
# coxph(formula = as.formula(coxformula), data = data, weights = 1/p.i.h.a.t., 
# subset = TRUE, na.action = na.omit, control = coxphcontrol, 
# x = TRUE, method = "breslow")

# n= 123, number of events= 56 
# (308 observations deleted due to missingness)

#                              coef exp(coef) se(coef)     z Pr(>|z|)    
# sexMale                    0.2953    1.3436   0.5558  0.53   0.5952    
# agepill                    0.0539    1.0554   0.0275  1.96   0.0499 *  
# smokeEver                  0.0145    1.0146   0.5870  0.02   0.9803    
# drinkEver                 -0.8548    0.4254   0.5896 -1.45   0.1471    
# mildyspMild Dysplasia      0.9023    2.4653   0.3937  2.29   0.0219 *  
# moddyspModerate Dysplasia  1.3309    3.7845   0.4212  3.16   0.0016 ** 
# sevdyspSevere Dysplasia    2.1334    8.4439   0.4615  4.62  3.8e-06 ***
# anyhistFamily History      0.0904    1.0946   0.3896  0.23   0.8165    
# zncent                    -0.2498    0.7789   0.1351 -1.85   0.0645 .  

#                           exp(coef) exp(-coef) lower .95 upper .95
# sexMale                       1.344      0.744     0.452      3.99
# agepill                       1.055      0.948     1.000      1.11
# smokeEver                     1.015      0.986     0.321      3.21
# drinkEver                     0.425      2.351     0.134      1.35
# mildyspMild Dysplasia         2.465      0.406     1.140      5.33
# moddyspModerate Dysplasia     3.784      0.264     1.658      8.64
# sevdyspSevere Dysplasia       8.444      0.118     3.417     20.86
# anyhistFamily History         1.095      0.914     0.510      2.35
# zncent                        0.779      1.284     0.598      1.02

# Concordance= NA  (se = NA )
# Rsquare= NA   (max possible= NA )
# Likelihood ratio test= NA  on 9 df,   p=NA
# Wald test            = 65.1  on 9 df,   p=1.36e-10
# Score (logrank) test = NA  on 9 df,   p=NA