dpweib: Dirichlet process mixture/Dependent Dirichlet process model...
In DPWeibull: Dirichlet Process Weibull Mixture Model for Survival Data

Description Usage Arguments Details Value Source Examples

Use Dirichlet process mixture/dependent Dirichlet process Weibull model for survival data with/without competing risks. When regression covariates are present, the model is a dependent Dirichlet process model. For competing risks data we only consider two potential causes of events and the user can combine events of secondary interests. In competing risks regression model, the estimates provided focus on the primary cause (cause 1), and the user can switch the event indicator to get the estimates for the secondary cause.

dpweib(formula,data, high.pct = NULL, predtime = NULL, comp = FALSE,
alpha = 0.05, simultaneous = FALSE, burnin = 8000, iteration = 2000,
alpha00 = 1.354028, alpha0 = 0.03501257, lambda00 = 7.181247,
alphaalpha = 0.2, alphalambda = 0.1, a = 1, b = 1, gamma0 = 1, 
gamma1 = 1, thin = 10, betasl = 2.5, addgroup = 2)

formula

A formula written in regular y \sim x_1+x_2+ … +x_p regression format. y is a Surv object for survival data (including interval censored data) and Hist object for competing risks data. The regression covaraites can be continuous or factors. Since the model is flexible enough, interaction terms are not necessary.

`data`	an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which dpweib is called.
`high.pct`	The estimated high percentile (95th) percentile of the data-generating distribution of the average population given by the user. If the user does not provide this value, we will look into the data. If there is no censoring, we take the 95th percentile of the observed data. If censoring takes less than 15% of the total observations, we use the maximum of the observed time. If the censoring takes more than 15%, we suggest a scaling parameter by first finding the time t corresponding to the observed survival rate at the end of study from the plot of the median of the components (survmedian) generated by our LIO prior on a 0 to 10 scale, then set the scaling parameter to be the largest observation time multiplied by 10/t.
`predtime`	A vector given by the user to specify the time points where the inferences will be made. If the user does not provide it, we will take 40 time points located evenly from the beginning to the high.pct.
`comp`	A logical value indicating whether this is competing risks data or not. The default is FALSE.
`alpha`	1-α is the probability for constructing credible intervals. The default α is 0.05.
`simultaneous`	A logical value indicating whether to provide simultaneous credible intervals. The default is FALSE.
`burnin`	Number of burnin iterations. The default is 5000.
`iteration`	Number of iterations. The default is 5000.
`alpha00`	Parameter for the base distribution of λ in non-competing risks data model and λ_1, λ_2 in competing risks data model. The default is 1.354028.
`alpha0`	Parameter for the base distribution of λ in non-competing risks data model and λ_1, λ_2 in competing risks data model. The default is 0.03501257.
`lambda00`	Parameter for the base distribution of λ in non-competing risks data model and λ_1, λ_2 in competing risks data model. The default is 7.181247.
`alphaalpha`	Parameter for the base distribution of α in non-competing risks data model and α_1, α_2 in competing risks data model. The default is 0.2.
`alphalambda`	Parameter for the base distribution of α in non-competing risks data model and α_1, α_2 in competing risks data model. THe default value is 0.1.
`a`	Parameter for the gamma prior of the concentration parameter of DP. The default is 1.
`b`	Parameter for the gamma prior of the concentration parameter of DP. The default is 1.
`gamma0`	Parameter for the base distribution of p in competing risks data model. The default value is 1.
`gamma1`	Parameter for the base distribution of p in competing risks data model. The default value is 1.
`thin`	Thinning. The default value is 10.
`betasl`	Parameter for the base distribution of the regression coefficients β in non-competing risks data model and β_1 and β_2 in competing risks data model. The default value is 2.5.
`addgroup`	Number of new parameters proposed for each cluster assignment. The default is 2 (suggested by Neal).

For no regression, no competing risks data, the function dpweib implements dirichlet process Weibull mixture model. The basic form of model is the following.

\begin{array}{rl} y_i|α_i,λ_i&\sim Weib(t_i|α_i,λ_i),\quad i=1,...,n\\ (α_i,λ_i)|G&\sim G,\quad i=1,...,n\\ G&\sim DP(G_0,ν)\\ G_0&=Ga(λ|α_0,λ_0) I_{(f(λ),∞)}(α) Ga(α_{α},λ_{α})\\ λ_0&\sim Ga(α_{00},λ_{00})\\ ν&\sim Ga(a,b)\\ \end{array}

wheref(λ)=max(0,\log\{\log(20)/λ\}/\log(25)).

For regression data without competing risks, the method is a mixture of Cox model.

\begin{array}{rl} y_i|α_i,λ_i,\boldsymbol{β_i}, \mathbf{Z_i}&\sim Weib(y_i|α_i,λ_i\exp(\mathbf{Z_i^T}\boldsymbol{β_i})),\quad i=1,...,n\\ (α_i,λ_i,\boldsymbol{β_i})|G&\sim G,\quad i=1,...,n\\ G&\sim DP(G_0,ν)\\ G_0&=Ga(λ|α_0,λ_0) I_{(f(λ),u)}(α) Ga(α_{α},λ_{α}) q(\boldsymbol{β})\\ λ_0&\sim Ga(α_{00},λ_{00})\\ ν&\sim Ga(a,b)\\ \end{array}

The density function corresponding to this Weibull notation is p(y_i|α_i,λ_i)=λ_iα_i y_i^{α_i-1}e^{-λ_i y_i^{α_i}},\quad y_i>0,\quad α_i>0,\quad λ_i>0. [x]=Ga(α,λ) denotes that the density function of x is \displaystyle\frac{λ^{α}}{Γ(α)}x^{α-1}e^{-λ x}, α>0, λ>0, x>0. q(β) is the base distribution for regression coefficients.The details of the choice of base distribution is described in our coming paper.

In competing risks data, the likelihood for each individual can be written as

L=\{f_1(t_i)\}^{I(c_i=1)}\{f_2(t_i)\}^{I(c_i=2)}\{1-F_1(t_i)-F_2(t_i)\}^{I(c_i=0)},

where f_1(\cdot) and f_2(\cdot) are the cause-specific density functions for cause 1 and 2 and survival function for the ith observation can be expressed as 1-F_1(t_i)-F_2(t_i). In order to model it, we introduce a parameter p, which is the cumulative incidence function of primary cause at ∞, p=F_1(∞). The likelihood can be written as

L=\{pd_1(t_i)\}^{I(c_i=1)}\{(1-p)d_2(t_i)\}^{I(c_i=2)}\{1-pD_1(t_i)-(1-p)D_2(t_i)\}^{I(c_i=0)} .

Here the D_{1}, D_{2}, d_{1}, d_{2} are the normalized baseline cumulative incidence functions and cause-specific density functions and are modeled with Weibull mixtures as above, while p is the normalizing parameter for the baseline distribution. When regression covariates are present in a competing risks data, we modify the above likelihood with respect to the value of covaraites, such that

F_1(t|\mathbf{Z},\boldsymbol{β_1},p) = 1-(1-pD_{01}(t))^{\exp(\mathbf{Z^T}\boldsymbol{β_1})}.

The cause-specific density function for cause 1 is

f_1(t|\mathbf{Z},\boldsymbol{β_1},p)=\exp(\mathbf{Z^T}\boldsymbol{β_1})[1-pD_{01}(t)]^{\exp(\mathbf{Z^T}\boldsymbol{β_1})-1}pd_{01}(t).

The model for the secondary cause is defined as

F_2(t|\mathbf{Z},\boldsymbol{β_1},\boldsymbol{β_2},p)=(1-p)^{\exp(\mathbf{Z^T}\boldsymbol{β_1})} (1-(1-D_{02}(t))^{\exp(\mathbf{Z^T}\boldsymbol{β_2})}),

which leads to the cause-specific subdensity function for cause 2 as

f_2(t|\mathbf{Z},\boldsymbol{β_2},p)=(1-p)^{\exp(\mathbf{Z^T}\boldsymbol{β_1})}(1-D_{02}(t))^{\exp(\mathbf{Z^T}\boldsymbol{β_2})-1}\exp(\mathbf{Z^T}\boldsymbol{β_2})d_{02}(t).

This function can generate 4 different kinds of output based on the data set given. They all share,

`c`	a vector, the cluster assignment in the last iteration, useful for the resumption of MCMC iteration
`nm`	a vector, the number of observations in each cluster from the last iteration, useful for the resumption of MCMC iteration
`emptybasket`	only useful for the resumption of MCMC iteration
`allbaskets`	only useful for the resumption of MCMC iteration
`ngrp`	a vector, the number of clusters in each iteration, useful for the resumption of MCMC iteration
`predtime`	the time points where the inferences are made
`high.pct`	the scaling parameter of observations used in the model
`usertime`	a logic value, whether user provides time for estimation or not

1-α is the probability for constructing credible intervals.

simultaneous

Whether give simultaneous credible intervals.

For non-competing risks data, dpweib can generate two classes of output, dpm and ddp, for data with and without covariates separately. They both have

`alpharec`	a matrix, saved samples of αs, the rows correspond to the iterations saved, the columns correspond to the observations
`lambdarec`	a matrix, saved samples of λs, the rows correspond to the iterations saved, the columns correspond to the observations
`lambda0rec`	a matrix, saved samples of λ_0s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambdascaled`	a matrix, saved samples of λs under 0 to 10 scale, the rows correspond to the iterations saved, the columns correspond to the observations, only useful for the resumption of MCMC iteration
`tl`	the left end point
`tr`	the right end point
`pi`	right censoring indicator
`delta`	exact observation indicator

For dpm output, it has

`S`	a matrix, the estimated survival function for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`Spred`	a vector, the estimated survival function at specified time points
`Spredu`	a vector, the estimated pointwise upper credible interval for survival function at specified time points
`Spredl`	a vector, the estimated pointwise lower credible interval for survival function at specified time points
`d`	a matrix, the estimated density function for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`dpred`	a vector, the estimated density function at specified time points
`dpredu`	a vector, the estimated pointwise upper credible interval for density function at specified time points
`dpredl`	a vector, the estimated pointwise lower credible interval for density function at specified time points
`h`	a matrix, the estimated hazard function for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`hpred`	a vector, the estimated hazard function at specified time points
`hpredu`	a vector, the estimated pointwise upper credible interval for hazard function at specified time points
`hpredl`	a vector, the estimated pointwise lower credible interval for hazard function at specified time points

When simultaneous is specified TRUE, the function also provides

`Sbandu`	a vector, the estimated simultaneous upper credible interval for survival function at specified time points
`Sbandl`	a vector, the estimated simultaneous lower credible interval for survival function at specified time points
`dbandu`	a vector, the estimated simultaneous upper credible interval for density function at specified time points
`dbandl`	a vector, the estimated simultaneous lower credible interval for density function at specified time points
`hbandu`	a vector, the estimated simultaneous upper credible interval for hazard function at specified time points
`hbandl`	a vector, the estimated simultaneous lower credible interval for hazard function at specified time points

For ddp output, it also has

`betarec`	a matrix, saved samples of βs, which is consist of horizontal-merged blocks. One block corresponds to one observation. Inside each block, the rows correspond to the iterations saved, the columns correspond to the covariates.
`x`	the covariate matrix
`xmean`	a vector, the mean for each covariate(including created binary dummy covariates)
`xsd`	a vector, the standized deviation for each covariate, if the covariate is binary, then it is set to be 0.5.(including created binary dummy covariates)
`xscale`	The matrix used to scale log hazard ratio
`loghr`	a matrix, the estimated log hazard ratio for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`loghr.est`	a vector, the estimated log hazard ratio at specified time points
`loghru`	a vector, the estimated pointwise upper credible interval for log hazard ratio at specified time points
`loghrl`	a vector, the estimated pointwise lower credible interval for log hazard ratio at specified time points
`indicator`	a vector, whether a covariate is binary
`covnames`	a vector, the names of covariates

When simultaneous is specified TRUE, the function also provides

`loghrbandu`	a vector, the estimated simultaneous upper credible interval for log hazard ratio at specified time points
`loghrbandl`	a vector, the estimated simultaneous lower credible interval for log hazard ratio at specified time points

For competing risks data, dpweib can generate two classes of output, dpmcomp and ddpcomp, for data with and without covariate separately. They both have

`alpharec1`	a matrix, saved samples of α_1s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambdarec1`	a matrix, saved samples of λ_1s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambda0rec1`	a matrix, saved samples of λ_{01}s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambdascaled1`	a matrix, saved samples of λ_1s under 0 to 10 scale, the rows correspond to the iterations saved, the columns correspond to the observations, only useful for the resumption of MCMC iteration
`alpharec2`	a matrix, saved samples of α_2s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambdarec2`	a matrix, saved samples of λ_2s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambda0rec2`	a matrix, saved samples of λ_{02}s, the rows correspond to the iterations saved, the columns correspond to the observations
`lambdascaled2`	a matrix, saved samples of λ_2s under 0 to 10 scale, the rows correspond to the iterations saved, the columns correspond to the observations, only useful for the resumption of MCMC iteration
`prec`	a matrix, saved samples of p, the rows correspond to the iterations saved, the columns correspond to the observations
`t`	the observed time
`event`	the event indicator

For dpmcomp output, it has

`CIF1`	a matrix, the estimated cumulative incidence function for cause 1 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`CIF1.est`	a vector, the estimated cumulative incidence function of cause 1 at specified time points
`CIF1u`	a vector, the estimated pointwise upper credible interval for cumulative incidence function of cause 1 at specified time points
`CIF1l`	a vector, the estimated pointwise lower credible interval for cumulative incidence function of cause 1 at specified time points
`d1`	a matrix, the estimated cause-specific density function for cause 1 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`d1.est`	a vector, the estimated cause-specific density function of cause 1 at specified time points
`d1u`	a vector, the estimated pointwise upper credible interval for cause-specific density function of cause 1 at specified time points
`d1l`	a vector, the estimated pointwise lower credible interval for cause-specific density function of cause 1 at specified time points
`h1`	a matrix, the estimated subdistribution hazard function for cause 1 at specified time points, the columns correspond to time points, the rows correspond to saved iterations
`h1.est`	a vector, the estimated subdistribution hazard function of cause 1 at specified time points
`h1u`	a vector, the estimated pointwise upper credible interval for subdistribution hazard function of cause 1 at specified time points
`h1l`	a vector, the estimated pointwise lower credible interval for subdistribution hazard function of cause 1 at specified time points
`CIF2`	a matrix, the estimated cumulative incidence function for cause 2 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`CIF2.est`	a vector, the estimated cumulative incidence function of cause 2 at specified time points
`CIF2u`	a vector, the estimated pointwise upper credible interval for cumulative incidence function of cause 2 at specified time points
`CIF2l`	a vector, the estimated pointwise lower credible interval for cumulative incidence function of cause 2 at specified time points
`d2`	a matrix, the estimated cause-specific density function for cause 2 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`d2.est`	a vector, the estimated cause-specific density function of cause 2 at specified time points
`d2u`	a vector, the estimated pointwise upper credible interval for cause-specific density function of cause 2 at specified time points
`d2l`	a vector, the estimated pointwise lower credible interval for cause-specific density function of cause 2 at specified time points
`h2`	a matrix, the estimated subdistribution hazard function for cause 2 for each saved iteration, the columns correspond to time points, the rows correspond to saved iterations
`h2.est`	a vector, the estimated subdistribution hazard function of cause 2 at specified time points
`h2u`	a vector, the estimated pointwise upper credible interval for subdistribution hazard function of cause 2 at specified time points
`h2l`	a vector, the estimated pointwise lower credible interval for subdistribution hazard function of cause 2 at specified time points

When simultaneous is specified TRUE, the function also provides

`CIF1bandu`	a vector, the estimated simultaneous upper credible interval for cumulative incidence function of cause 1 at specified time points
`CIF1bandl`	a vector, the estimated simultaneous lower credible interval for cumulative incidence function of cause 1 at specified time points
`d1bandu`	a vector, the estimated simultaneous upper credible interval for cause-specific density function of cause 1 at specified time points
`d1bandl`	a vector, the estimated simultaneous lower credible interval for cause-specific density function of cause 1 at specified time points
`h1bandu`	a vector, the estimated simultaneous upper credible interval for subdistribution hazard function of cause 1 at specified time points
`h1bandl`	a vector, the estimated simultaneous lower credible interval for subdistribution hazard function of cause 1 at specified time points
`CIF2bandu`	a vector, the estimated simultaneous upper credible interval for cumulative incidence function of cause 2 at specified time points
`CIF2bandl`	a vector, the estimated simultaneous lower credible interval for cumulative incidence function of cause 2 at specified time points
`d2bandu`	a vector, the estimated simultaneous upper credible interval for cause-specific density function of cause 2 at specified time points
`d2bandl`	a vector, the estimated simultaneous lower credible interval for cause-specific density function of cause 2 at specified time points
`h2bandu`	a vector, the estimated simultaneous upper credible interval for subdistribution hazard function of cause 2 at specified time points
`h2bandl`	a vector, the estimated simultaneous lower credible interval for subdistribution hazard function of cause 2 at specified time points

For ddpcomp output, it also has

`betarec1`	a matrix, saved samples of β_1s, which is consist of horizontal-merged blocks. One block corresponds to one observation. Inside each block, the rows correspond to the iterations saved, the columns correspond to the covariates.
`betarec2`	a matrix, saved samples of β_2s, which is consist of horizontal-merged blocks. One block corresponds to one observation. Inside each block, the rows correspond to the iterations saved, the columns correspond to the covariates.
`xmean`	a vector, the mean for each covariate(including created dummy covariates)
`xsd`	a vector, the standized deviation for each covariate, if the covariate is binary, then it is set to be 0.5(including created dummy covariates).
`x`	the covariate matrix
`xscale`	The matrix used to scale log hazard ratio
`covnames`	a vector, the names of covariates
`loghr.est`	the estimated log subdistribution hazard ratio at specified time points for cause 1
`loghru`	the estimated pointwise upper credible interval for log subdistribution hazard ratio at specified time points for cause 1
`loghrl`	the estimated pointwise lower credible interval for log subdistribution hazard ratio at specified time points for cause 1
`indicator`	a vector, whether a covariate is binary

When simultaneous is specified TRUE, the function also provides

`loghrbandu`	a vector, the estimated simultaneous upper credible interval for log subdistribution hazard ratio at specified time points
`loghrbandl`	a vector, the estimated simultaneous lower credible interval for log subdistribution hazard ratio at specified time points

Gilks,W.R. and Best,N.G. and Tan,K.K.C. (1995) Adaptive rejection Metropolis sampling within Gibbs sampling, Applied Statistics, 455-472 doi:10.2307/2986138

Neal,R.M (2000) Markov chain sampling methods for Dirichlet process mixture models,Journal of computational and graphical statistics, 9, Num 2, 249-265 doi: 10.1080/10618600.2000.10474879

Kottas,A. (2006) Nonparametric Bayesian survival analysis using mixtures of Weibull distributions, Journal of Statistical Planning and Inference, 136, Num 3, 578-596 doi: 10.1016/j.jspi.2004.08.009

Shi, Y. Martens, M., Banerjee, A. and Laud, P. (2019) Low Information Omnibus (LIO) Priors for Dirichlet Process Mixture Models. Bayesian Analysis 14, Num 3, 677-702. doi:10.1214/18-BA1119. https://projecteuclid.org/euclid.ba/1560240023

Shi,Y. and Laud,P. and Neuner,J (2021) A Dependent Dirichlet Process Model for Survival Data With Competing Risks Lifetime Data Analysis 27, 156-176. https://doi.org/10.1007/s10985-020-09506-0

## Not run: 
library(survival)
library(DPWeibull)
data(veteran)

DPresult1<-dpweib(Surv(time,status)~1,data=veteran)
summary(DPresult1)
opar<-par(mfrow=c(1,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult1)
par(opar)

DPresult2<-dpweib(Surv(time,status)~factor(trt)+age,data=veteran)
summary(DPresult2)
opar<-par(mfrow=c(1,2),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult2)
par(opar)

newdata<-NULL
newdata$trt<-veteran$trt[c(1,70)]
newdata$age<-veteran$age[c(2,87)]
newdata<-data.frame(newdata)
DPpredict<-predict(DPresult2,newdata)
summary(DPpredict)
opar<-par(mfrow=c(2,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPpredict)
par(opar)

############################################################################
# Competing Risks Data
# Competing Risks Data
library(survival)
library(prodlim)
library(riskRegression)
library(DPWeibull)
data(Paquid)

Paquid<-Paquid[1:500,]
DPresult1<-dpweib(Hist(time, status)~1,data=Paquid,
                  predtime = seq(from=min(Paquid$time),to=max(Paquid$time),length=200))
opar<-par(mfrow=c(1,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult1)
par(opar)

DPresult2<-continue(DPresult1,simultaneous=TRUE)
summary(DPresult2)

DPresult3<-dpweib(Hist(time, status)~DSST+MMSE,data=Paquid,
                  predtime = seq(from=min(Paquid$time),to=max(Paquid$time),length=200))
summary(DPresult3)
opar<-par(mfrow=c(1,2),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPresult3)
par(opar)

newdata<-NULL
newdata$DSST<-Paquid$DSST[c(1,70)]
newdata$MMSE<-Paquid$MMSE[c(2,87)]
newdata<-data.frame(newdata)

DPpredict<-predict(DPresult3,newdata)
summary(DPpredict)
opar<-par(mfrow=c(2,3),
          mar=c(3.1, 3.1, 3.1, 5.1),
          mgp=c(2, 0.5, 0),
          oma=c(0, 0, 0, 4))
plot(DPpredict)
par(opar)

###############################################################

# An example of interval censored data
library(KMsurv)
library(survival)
library(DPWeibull)
data("bcdeter")

DPresult<-dpweib(Surv(lower, upper, type="interval2") ~ treat, data = bcdeter)
summary(DPresult)
plot(DPresult)

## End(Not run)