# prodlim: Functions for estimating probabilities from right censored... In prodlim: Product-Limit Estimation for Censored Event History Analysis

## Description

Functions for estimating probabilities from right censored data

Nonparametric estimation in event history analysis. Featuring fast algorithms and user friendly syntax adapted from the survival package. The product limit algorithm is used for right censored data; the self-consistency algorithm for interval censored data.

## Usage

 ```1 2 3 4 5``` ```prodlim(formula, data = parent.frame(), subset, na.action = NULL, reverse = FALSE, conf.int = 0.95, bandwidth = NULL, caseweights, discrete.level = 3, x = TRUE, maxiter = 1000, grid, tol = 7, method = c("npmle", "one.step", "impute.midpoint", "impute.right"), exact = TRUE, type) ```

## Arguments

 `formula` A formula whose left hand side is a `Hist` object. In some special cases it can also be a `Surv` response object, see the details section. The right hand side is as usual a linear combination of covariates which may contain at most one continuous factor. Whether or not a covariate is recognized as continuous or discrete depends on its class and on the argument `discrete.level`. The right hand side may also be used to specify clusters, see the details section. `data` A data.frame in which all the variables of `formula` can be interpreted. `subset` Passed as argument `subset` to function `subset` which applied to `data` before the formula is processed. `na.action` All lines in data with any missing values in the variables of formula are removed. `reverse` For right censored data, if reverse=TRUE then the censoring distribution is estimated. `conf.int` The level (between 0 and 1) for two-sided pointwise confidence intervals. Defaults to 0.95. Remark: only plain Wald-type confidence limits are available. `bandwidth` Smoothing parameter for nearest neighborhoods based on the values of a continuous covariate. See function `neighborhood` for details. `caseweights` Weights applied to the contribution of each subject to change the number of events and the number at risk. This can be used for bootstrap and survey analysis. Should be a vector of the same length and the same order as `data`. `discrete.level` Numeric covariates are treated as factors when their number of unique values exceeds not `discrete.level`. Otherwise the product limit method is applied, in overlapping neighborhoods according to the bandwidth. `x` logical value: if `TRUE`, the full covariate matrix with is returned in component `model.matrix`. The reduced matrix contains unique rows of the full covariate matrix and is always returned in component `X`. `maxiter` For interval censored data only. Maximal number of iterations to obtain the nonparametric maximum likelihood estimate. Defaults to 1000. `grid` For interval censored data only. When method=one.step grid for one-step product limit estimate. Defaults to sorted list of unique left and right endpoints of the observed intervals. `tol` For interval censored data only. Numeric value whose negative exponential is used as convergence criterion for finding the nonparametric maximum likelihood estimate. Defaults to 7 meaning exp(-7). `method` For interval censored data only. If equal to `"npmle"` (the default) use the usual Turnbull algorithm, else the product limit version of the self-consistent estimate. `exact` If TRUE the grid of time points used for estimation includes all the L and R endpoints of the observed intervals. `type` In two state models either `"surv"` for the Kaplan-Meier estimate of the survival function or `"cuminc"` for 1-Kaplan-Meier. Default is `"surv"` when `reverse==FALSE` and `"cuminc"` when `reverse==TRUE`. In competing risks models it has to be `"cuminc"` Aalen-Johansen estimate of the cumulative incidence function.

## Details

The response of `formula` (ie the left hand side of the ‘~’ operator) specifies the model.

In two-state models – the classical survival case – the standard Kaplan-Meier method is applied. For this the response can be specified as a `Surv` or as a `Hist` object. The `Hist` function allows you to change the code for censored observations, e.g. `Hist(time,status,cens.code="4")`.

Besides a slight gain of computing efficiency, there are some extensions that are not included in the current version of the survival package:

(0) The Kaplan-Meier estimator for the censoring times `reverse=TRUE` is correctly estimated when there are ties between event and censoring times.

(1) A conditional version of the kernel smoothed Kaplan-Meier estimator for at most one continuous predictors using nearest neighborhoods (Beran 1981, Stute 1984, Akritas 1994).

(2) For cluster-correlated data the right hand side of `formula` may identify a `cluster` variable. In that case Greenwood's variance formula is replaced by the formula of Ying \& Wei (1994).

(3) Competing risk models can be specified via `Hist` response objects in `formula`.

The Aalen-Johansen estimator is applied for estimating the cumulative incidence functions for all causes. The advantage over the function `cuminc` of the cmprsk package are user-friendly model specification via `Hist` and sophisticated print, summary, predict and plot methods.

Under construction:

(U0) Interval censored event times specified via `Hist` are used to find the nonparametric maximum likelihood estimate. Currently this works only for two-state models and the results should match with those from the package ‘Icens’.

(U1) Extensions to more complex multi-states models

(U2) The nonparametric maximum likelihood estimate for interval censored observations of competing risks models.

## Value

Object of class "prodlim". See `print.prodlim`, `predict.prodlim`, predict, `summary.prodlim`, `plot.prodlim`.

## Author(s)

Thomas A. Gerds [email protected]

Thomas A. Gerds <[email protected]>

## References

Andersen, Borgan, Gill, Keiding (1993) Springer 'Statistical Models Based on Counting Processes'

Akritas (1994) The Annals of Statistics 22, 1299-1327 Nearest neighbor estimation of a bivariate distribution under random censoring.

R Beran (1981) http://anson.ucdavis.edu/~beran/paper.html 'Nonparametric regression with randomly censored survival data'

Stute (1984) The Annals of Statistics 12, 917–926 'Asymptotic Normality of Nearest Neighbor Regression Function Estimates'

Ying, Wei (1994) Journal of Multivariate Analysis 50, 17-29 The Kaplan-Meier estimate for dependent failure time observations

`predictSurv`, `predictSurvIndividual`, `predictCuminc`, `Hist`, `neighborhood`, `Surv`, `survfit`, `strata`,
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95``` ```##---------------------two-state survival model------------ dat <- SimSurv(30) with(dat,plot(Hist(time,status))) fit <- prodlim(Hist(time,status)~1,data=dat) print(fit) plot(fit) summary(fit) quantile(fit) ## Subset fit1a <- prodlim(Hist(time,status)~1,data=dat,subset=dat\$X1==1) fit1b <- prodlim(Hist(time,status)~1,data=dat,subset=dat\$X1==1 & dat\$X2>0) ## --------------------clustered data--------------------- library(survival) cdat <- cbind(SimSurv(30),patnr=sample(1:5,size=30,replace=TRUE)) fit <- prodlim(Hist(time,status)~cluster(patnr),data=cdat) print(fit) plot(fit) summary(fit) ##-----------compare Kaplan-Meier to survival package--------- dat <- SimSurv(30) pfit <- prodlim(Surv(time,status)~1,data=dat) pfit <- prodlim(Hist(time,status)~1,data=dat) ## same thing sfit <- survfit(Surv(time,status)~1,data=dat,conf.type="plain") ## same result for the survival distribution function all(round(pfit\$surv,12)==round(sfit\$surv,12)) summary(pfit,digits=3) summary(sfit,times=quantile(unique(dat\$time))) ##-----------estimating the censoring survival function---------------- rdat <- data.frame(time=c(1,2,3,3,3,4,5,5,6,7),status=c(1,0,0,1,0,1,0,1,1,0)) rpfit <- prodlim(Hist(time,status)~1,data=rdat,reverse=TRUE) rsfit <- survfit(Surv(time,1-status)~1,data=rdat,conf.type="plain") ## When there are ties between times at which events are observed ## times at which subjects are right censored, then the convention ## is that events come first. This is not obeyed by the above call to survfit, ## and hence only prodlim delivers the correct reverse Kaplan-Meier: cbind("Wrong:"=rsfit\$surv,"Correct:"=rpfit\$surv) ##-------------------stratified Kaplan-Meier--------------------- pfit.X2 <- prodlim(Surv(time,status)~X2,data=dat) summary(pfit.X2) summary(pfit.X2,intervals=TRUE) plot(pfit.X2) ##----------continuous covariate: Stone-Beran estimate------------ prodlim(Surv(time,status)~X1,data=dat) ##-------------both discrete and continuous covariates------------ prodlim(Surv(time,status)~X2+X1,data=dat) ##----------------------interval censored data---------------------- dat <- data.frame(L=1:10,R=c(2,3,12,8,9,10,7,12,12,12),status=c(1,1,0,1,1,1,1,0,0,0)) with(dat,Hist(time=list(L,R),event=status)) dat\$event=1 npmle.fitml <- prodlim(Hist(time=list(L,R),event)~1,data=dat) ##-------------competing risks------------------- CompRiskFrame <- data.frame(time=1:100,event=rbinom(100,2,.5),X=rbinom(100,1,.5)) crFit <- prodlim(Hist(time,event)~X,data=CompRiskFrame) summary(crFit) plot(crFit) summary(crFit,cause=2) plot(crFit,cause=2) # Changing the cens.code: dat <- data.frame(time=1:10,status=c(1,2,1,2,5,5,1,1,2,2)) fit <- prodlim(Hist(time,status)~1,data=dat) print(fit\$model.response) fit <- prodlim(Hist(time,status,cens.code="2")~1,data=dat) print(fit\$model.response) plot(fit) plot(fit,cause="5") ##------------delayed entry---------------------- ## left-truncated event times with competing risk endpoint dat <- data.frame(entry=c(7,3,11,12,11,2,1,7,15,17,3),time=10:20,status=c(1,0,2,2,0,0,1,2,0,2,0)) fitd <- prodlim(Hist(time=time,event=status,entry=entry)~1,data=dat) summary(fitd) plot(fitd) ```