\fontsize{12}{22} \selectfont \newcommand{\bX}{\textbf{X}} \newcommand{\bY}{\textbf{Y}} \newcommand{\bbeta}{\boldsymbol{\beta}}
Package discfrail (discrete frailty) is
an R
package that provides a novel, flexible model for hierarchical time-to-event data in which a clustering structure of groups is suspected. The random effects that define the group-level frailties follow a nonparametric discrete distribution, and the baseline hazards are left unspecified, as in the Cox model. The package contains a function to fit the Cox model with nonparametric discrete frailty term and two functions for simulating data. Full details of the methods are given in @gasperoni2018nonparam. This vignette gives a brief overview of the methods, and worked examples of using the functions.
Consider a random sample with a hierarchical structure, i.e, where each individual subject, or statistical unit, belongs to one group. Define $T_{ij}^$ as the survival time and $C_{ij}$ as the censoring time of subject $i$, $i = 1,...,n_j$, in the $j$-th group, $j=1,...,J$. Let $\bX_{ij} = (X_{ij1}, . . . , X_{ijp} )^T$ be a vector of covariates, assumed constant over time, for subject $i$ in group $j$. Then, we define $T_{ij} = min(T_{ij}^, C_{ij})$, $t_{ij}$ its realization and $\delta_{ij} = \mathbf{1}{(T{ij}^* \leq C_{ij} )}$.
Let $\tilde{\textbf{w}}$ be a vector of shared random effects, and $\textbf{w}=\exp{\tilde{\textbf{w}}}$, be the corresponding vector of shared frailties that represent unexplained relative hazards between the groups. Conventionally, frailties are assumed to follow a parametric distribution, typically the Gamma. However in discfrail
, the frailties are assumed to have a nonparametric, discrete distribution, with an unknown number of points in the support. This allows an arbitrarily flexible random effects distribution, and gives a method for clustering. In particular, we assume that each group $j$ can belong to one latent population $k$, $k=1,...,K$, with probability $\pi_{k}$. In this case, $w_1,...,w_K$ are the corresponding distinct values for the frailties in each population, where $\mathbb{P}{w=w_k}=\pi_k$.
We introduce also an auxiliary (latent) indicator random variable $z_{jk}$ which is equal to $1$ if the $j$-th group belongs to the $k$-th population, so, $z_{jk} \overset{i.i.d}{\sim} Bern(\pi_{k})$. Since each group belongs to only one population, $\sum_{k=1}^K z_{jk} = 1$ for each $j$.
Therefore, in the nonparametric discrete frailty Cox model of discfrail
, the hazard for individual $i$ in group $j$, conditional on $\textbf{w}$ and on the $z_{jk}$ is:
\begin{equation}
\lambda(t;\bX_{ij},w_k,z_{jk})=\prod_{k=1}^{K}\left[\lambda_0(t) w_k \exp(\bX_{ij}^T\bbeta)\right]^{z_{jk}},
\label{eq:hazardnonparam}
\end{equation}
\noindent where $\lambda_0(t)$ represents the baseline hazard, $\bbeta$ is the vector of regression coefficients and $w_k$ is the frailty term shared among groups of the same latent population $k$. Since the baseline hazard is unspecified, and the remaining parameters are estimated by maximum partial likelihood, model \eqref{eq:hazardnonparam} is an extension of a proportional hazard Cox model.
The vector $\mathbf{z}_{j}$ has a multinomial distribution. Note that there are two levels of grouping. The first level is known, for example, healthcare providers as groups of patients) and we refer to these clusters as groups. The second level is the unknown clustering of groups (e.g. groups of healthcare providers with similar outcomes) that we want to detect, and we refer to these clusters as latent populations.
We assume that censoring is noninformative, thus that $T_{ij}^*$ and $C_{ij}$ are conditionally independent, given $\bX_{ij}$, $w_k$ and $z_{jk}$.
An application of this model to a real dataset is shown in Section 5 of @gasperoni2018nonparam. In this case, the event of interest is the second admission in hospital of patients that suffer from Heart Failure (patients are grouped in healthcare providers).
Two functions in the discfrail
package simulate hierarchical time-to-event data in which there are clusters of groups with the same frailty.
The input parameters of the simulation algorithm are:
Several steps are needed for building the dataset. The simulation procedure uses the method presented by @bender2005generating. We use two probability results to obtain Eq.\eqref{invthm}: the inverse probability method and the fact that if a generic random variable V is distributed as $\mathcal{U}[0,1]$ then $1-V$ is still a $\mathcal{U}[0,1]$.
\begin{equation} U_{ij} = 1-F(t_{ij}; \bX_{ij},w) = \exp{ -\Lambda_0(T_{ij}) w_k \exp{X_{ij}^T \beta}} \sim \mathcal{U}[0,1] \label{invthm} \end{equation}
Then, we are able to compute the survival times with Eq.\eqref{survtime} by inverting Eq.\eqref{invthm}. \begin{equation} T_{ij} = \Lambda_0^{-1}\left(\frac{-\log(U_{ij})}{w_k\exp{\bX_{ij}^T \beta}}\right) \label{survtime} \end{equation} The main difference between equations \eqref{invthm}, \eqref{survtime} and the ones showed in @bender2005generating is the frailty term, $w_k$.
The covariates are randomly generated from a normal distribution, $\bX_{ij} \overset{i.i.d.}{\sim} N(0,1)$.
Exploiting equations \eqref{invthm} and \eqref{survtime} we are able to simulate the event times $T_{ij}$. Then, we generate the censoring times according to a Normal distribution with standard deviation equal to one, independently from the $T_{ij}$. The censoring scheme is reproduced according to @wan2017simulating so that $cens_{perc}$ of the simulated times are censored.
Finally, we obtain the status variable as: $\delta_{ij} = \mathbf{1}{(T{ij}^* \leq C_{ij} )}$.
The package includes two general choices of baseline survival function:
sim_weibdf
, a parametric baseline, specifically a Weibull distribution characterised by two parameters $\lambda$ and $\rho$, with cumulative hazard function $\Lambda_0(t) = \lambda \cdot t^{\rho}$.
sim_npdf
, a generic baseline, where the user can supply $\Lambda_0^{-1}$ as an R function of time.
In the first case, equation \eqref{survtime} becomes: \begin{equation} T_{ij} = \left(\frac{-\log(U_{ij})}{w_k \cdot \lambda \exp{\bX_{ij}^T \beta}}\right)^{1/\rho} \end{equation}
In the following example we simulate data with a Weibull baseline.
library( discfrail ) J <- 100 # total number of groups N <- 40 # number of units per group lambda <- 0.5 # Weibull scale parameter rho <- 1.4 # Weibull shape parameter beta <- c( 1.6, 0.4 ) # log hazard ratios for covariates p <- c( 0.7, 0.3 ) # mixing proportions w_values <- c( 1.2, 2.1 ) # frailty values cens_perc <- 0.1 # percentage of censored events set.seed(1200) data_weib <- sim_weibdf( J, N, lambda, rho, beta, p, w_values, cens_perc) head( data_weib )
The simulated data object contains components that include
family
, the group indicator. In this case there are 40 individuals in each simulated group, and this can be checked here with table(data_weib$family)
.
status
: the survival status $\delta_{ij}$. Here we can check the proportion of censored events:
table(data_weib$status) / sum(table(data_weib$status))
belong
: $\textbf{w}$ the frailty values. Here we can check that the proportion of groups belonging to each latent population agrees with the values we specified for p
:table(data_weib$belong) / sum(table(data_weib$belong))
We can visualize the simulated group-specific survivor curves by Kaplan-Meier estimation, as follows. The curves are coloured according to the latent population to which each group belongs.
fit1 <- coxph( Surv( time, status ) ~ family, data_weib, method = 'breslow') sfit1 <- survfit(Surv( time, status ) ~ family, data_weib) lat_pop = rep( 2, J ) lat_pop[ which( data_weib$belong[ seq( 1, dim(data_weib)[1], N ) ] %in% w_values[2]) ] = 1 plot( sfit1, col = lat_pop, xlab = 'time', ylab = 'Survival probability', lwd = 1.5 ) legend( 'topright', paste('Latent population', 1:2), col = 2:1, lty = rep(1,2), lwd = 1.5, bty = 'n' )
This illustrates the substantial heterogeneity between groups.
The following code simulates the same form of data, but this time with a user-specified inverse cumulative hazard function $\Lambda_0(t)$ supplied as an R function.
J <- 100 N <- 40 Lambda_0_inv = function( t, c=0.01, d=4.6 ) ( t^( 1/d ) )/c beta <- 1.6 p <- c( 0.8, 0.2 ) w_values <- c( 0.8, 1.6 ) cens_perc <- 0.2 data_np <- sim_npdf( J, N, beta, Lambda_0_inv, p, w_values, cens_perc)
And again, the simulated survivor functions can be visualized as follows:
fit1 <- coxph( Surv( time, status ) ~ family, data_np, method = 'breslow') sfit1 <- survfit(Surv( time, status ) ~ family, data_np) lat_pop = rep( 2, J ) lat_pop[ which( data_np$belong[ seq( 1, dim(data_np)[1], N ) ] %in% w_values[2]) ] = 1 plot( sfit1, col = lat_pop, xlab = 'time', ylab = 'Survival probability', lwd = 1.5 ) legend( 'bottomleft', paste('Latent population', 1:2), col = 2:1, lty = rep(1,2), lwd = 1.5, bty = 'n' )
The function npdf_cox
can be used to fit the Cox model with nonparametric discrete frailty. This is based on maximum partial likelihood using an EM algorithm -- see Section 3 of @gasperoni2018nonparam. We illustrate the use of this function to fit this model to the dataset data_weib
that was previously simulated.
The first argument specifies the outcome as a Surv
object, and any covariates, in the standard R formula syntax used by coxph
in the standard survival
package. Another required argument is groups
, which names a variable in the data containing the group membership indicators.
The number of latent populations can be specified in two alternative ways. If npdf_cox
is called with estK=FALSE
, then one model is fitted with the number of latent populations fixed at the number specified by the K
argument to npdf_cox
.
Alternatively, if npdf_cox
is called with estK=TRUE
(which is the default choice if estK
is omitted) then multiple models are fitted with the number of latent groups ranging from 1 to K
. This is done in the following example:
test_weib <- npdf_cox( Surv(time, status) ~ x.1 + x.2, groups = family, data = data_weib, K = 4, eps_conv=10^-4)
Printing the fitted model object shows, firstly, the model comparison statistics for the models fitted. The estimates from the best-fitting model according to the preferred criterion (specified in the criterion
argument to npdf_cox
, by default, this is BIC) are then presented.
test_weib
According to AIC and BIC, the optimal $\hat{K}$ is equal to $2$, which was the true value used to generate the data.
The criterion of @laird1978nonparametric works as follows. At the end of the model fitting algorithm, each group is assigned to a latent population. The number of distinct "fitted" latent populations for each model is displayed in the K_fitted
column of the model comparison table. Then the best-fitting model according to the @laird1978nonparametric criterion is the maximum K
among the fitted models for which K=K_fitted
, that is, for which every latent population has at least one individual assigned to it. In this example, the optimal $\hat{K}$ is 4
.
The estimates $\hat{\boldsymbol{\pi}}$ are $[0.665,0.335]$, close to the true values of $[0.7,0.3]$. The estimated ratio $\hat{w_2}/\hat{w_1} = 1.747$, again close to the true $1.75$ ($2.1/1.2$). Exact standard errors and standard errors according to the method of @louis1982finding are also reported. See the Supplementary Material of Gasperoni et al.[-@gasperoni2018nonparam] for more information about the computation of standard errors. Note that if K_fitted
is less than K
then the standard errors cannot be computed under these methods.
The function npdf_cox
returns a list with the following components, illustrated for this example as follows:
test_weib$model
is a list with one element for each fitted model, in this case $4$ elements. The $i^{th}$ element of the list contains the estimates with $i$ latent populations. We show the output related to $K=3$. test_weib$model[[3]]
test_weib$comparison
is a matrix in which the model comparison information, including the log-likelihood, AIC and BIC, is reported for each value of $K$.test_weib$comparison
test_weib$Kopt
reports the optimal $K$ according to Laird, AIC and BIC.test_weib$Kopt
test_weib$criterion
is the criterion used to choose the optimal $K$.test_weib$criterion
test_weib$mf
is the \emph{model frame}, the data frame used to fit the model.Since the data are simulated, we are also able to check whether the groups are correctly assigned. We note that only $2$ groups are misclassified (42 and 98).
# true distinct frailty for each group w <- data_weib$belong[!duplicated(data_weib$family)] # true latent population for each group real_lat_pop <- match(w, sort(unique(w))) table( test_weib$models[[2]]$belonging, real_lat_pop ) misc_ind = which( test_weib$models[[2]]$belonging != real_lat_pop ) misc_ind test_weib$models[[2]]$alpha[ misc_ind, ]
We can investigate the posterior probabilities of being part of a specific latent population ($\alpha_{jk}$, see Section 3.1 of @gasperoni2018nonparam). For some groups there is great uncertainty about the latent population to which they belong (i.e., group 98 is assigned to latent population 1 and 2 with probabilities of 0.66, 0.34 respectively). The following code identifies those groups that have probabilities between 0.05 and 0.95 of being assigned to latent population one and two (note that this list includes the two "misclassified" groups highlighted above).
#alpha matrix: alpha_{jk} #is the probability that group j is assigned to latent population k. #test_weib$models[[2]]$alpha assign_not_sure1 = which( test_weib$models[[2]]$alpha[ ,1 ] < 0.95 & test_weib$models[[2]]$alpha[ ,1 ] > 0.05 & test_weib$models[[2]]$alpha[ ,2 ] < 0.95 & test_weib$models[[2]]$alpha[ ,2 ] > 0.05 ) assign_not_sure1
We can visualize the fitted survivor curves for each group through the plot
method. The default, shown below, uses Kaplan-Meier estimates. It is also possible to plot Nelson-Aalen estimates (choosing type = 'na'
). These plots automatically colour the group-specific curves according to the group's fitted latent population. This demonstrates how the group-specific frailties are clustered, with the higher-frailty population shown in black.
plot( test_weib, type = 'km', lwd = 1.5 ) legend( 'topright', paste( 'Latent population', c( 1, 2 ) ), lwd = 1.5, col = c( 2, 1 ), lty = rep( 1, 2 ), bty = 'n' )
The following plot highlights in green the survival curves for the three groups in the simulated data that are "misclassified" by the model.
colors_misc = test_weib$models[[2]]$belonging + 1 colors_misc[ colors_misc == 3 ] = 1 colors_misc[ test_weib$models[[2]]$belonging != real_lat_pop ] = 3 plot( test_weib, type = 'km', col = colors_misc, lwd = 1.5 ) legend( 'topright', paste( 'Latent population', c( 1, 2 ) ), lwd = 1.5, col = c( 2, 1 ), lty = rep( 1, 2 ), bty = 'n' )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.