In tagteam/riskRegression: Risk Regression Models and Prediction Scores for Survival Analysis with Competing Risks

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette demonstrates how the function synthesize can be used to make an estimated lava object [@lava, see the next section] based on a formula in R (or a lava object). This can then be used to simulate data that should be similar to the data that we used to estimate the lava object with. There are 3 kinds of responses that can be used in the synthesize object:

A survival-type response, i.e. either a response of the form Surv(time,event) or Hist(time, event).
A categorical response (for now only binary responses are supported corresponding to logistic regression).
A normal/gaussian response.

We first give a mathematical specification of the model assumed for the data:

Introduction to the model

Assume first that we are not dealing with survival data. For a general input formula y ~ x1 + ... + xk, with recursive = FALSE, we assume the following model for the data $(Y_i,X_{1i},\dots, X_{ki})_{i=1, \dots n}$ (the observations are assumed iid). The response variable $Y$ is assumed to have a linear predictor $\eta ^T \bf{X}$ (either in the case of a logistic regression model or a linear model). The covariates $X_i$ are assumed to each have their own distribution, depending on the type of variable that they are. Continuous variables variables are assumed to be Gaussian, i.e. if the $j$'th covariate is continuous then $X_j \sim N(\mu_j, \sigma^2_j)$ - with categorical/binary variables a similar assumption is made.

In the case of survival outcome data, we change the above setup to the data $(T_i, \Delta_i,X_{1i},\dots, X_{ki})_{i=1, \dots n}$ (this is assumed to be iid) where $T_i$ is the right-censored lifetime, $\Delta_i$ is the indicator telling us whether $T_i$ is uncensored or censored. Then the model is that the uncensored lifetime and censorship times follow a Weibull model (this is such that we can simulate survival data) with the same linear predictor as before. The covariates have also the same assumption as before. With competitive risks, a similar setup is made to this one.

With recursive = TRUE, we allow a type of structural equation models that states that we should also include the relationships with the covariates (indicating the corresponding formulas in R): \begin{align} X_1 &\sim X_2 + \dots + X_k \ X_2 &\sim X_3 + \dots + X_k \ ... \ X_{k-1} &\sim X_k \ \end{align} When you specify a lava object rather than a formula, the idea is that you can specify these relationships yourself.

Synthesize then estimates by maximum likelihood estimation. We consider a few examples of our function. We first consider an example of the case where the response is a survival-type response.

Survival repsonse

Survival outcome case

We consider first the outcome survival case and load the Melanoma data set (please include all the variables in the data set in the model (otherwise synthesize might not work as intended)):

library(riskRegression)
library(survival)
library(lava)
data("Melanoma")
Melanoma <- subset(Melanoma, select = c("time","status","thick","sex","age","invasion"))
knitr::kable(head(Melanoma))

The data concerns the survival of patients after an operation. We consider the variables of the data set: - time meaning the censored lifetime. - status indicating whether the person died from the disease, 0 means censored, 1 means death by melanoma, and 2 means that the patient died from other causes. - thick meaning the thickness of the tumor (in mm). - sex meaning the sex of the patient (M/F). - age meaning the age of the patient at the time of operation. - invasion meaning the level of invasion.

There are two kinds of survival responses we can consider; one with competing risks outcome and one without. We first consider how to use synthesize survival outcome, but the other case is the same; we just don't use the line below. In this case we need to make the event variable binary, i.e.

Melanoma$eventBin <- 1*(Melanoma$status!=0)

corresponding to whether the patient survived (0) or not (1). Now we can synthesize a lava object (note that it is possible to specify how many unique values a variable can have to be considered either categorical or normal with max.levels (if not specified, it is 10)):

ms <- synthesize(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=Melanoma)
summary(ms)

In this case, a Weibull survival model is used to model the censorship time and uncensored lifetime with the given covariates. The model then specifies that the censored survival time and the indicator for whether the data was censored or not is given deterministically. Under Exogeonous variables, we find the distributions for the covariates. Further down in "Regression parameters", we find the estimated parameters the regression (we note that the parameters for covariates are not as easily found).

Let us consider a basic use case scenario where we simulate:

set.seed(15)
d <- sim(ms,10000)
fit <- coxph(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=Melanoma)
fit.s <- coxph(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=d)

res <- as.data.frame(cbind(coef(fit),coef(fit.s)))
colnames(res) <- c("Parameters of the model from original data", "Parameters of the model from simulated data")
knitr::kable(res)

First we simulate from our estimated model, then we fit a corresponding model to the data and the simulated data. Hopefully the numbers between the 2 columns should be very similar which they are.

Competing risk case

Let us now consider how to do the same with the competing risk case. In the instance, we don't make a binary event variable:

ms.comp <- synthesize(Surv(time,status)~log(thick)+sex+age+invasion,data=Melanoma)
summary(ms.comp)

This gives the same as before except that the model of survival is changed. Here a Weibull survival model is used to model the censorship time, uncensored lifetime and death from other causes with the given covariates.

Let us again consider a basic use case scenario where we simulate (the competing risk model should correspond to several Cox regression models; corresponding to modeling the intensity for the case where the patient died from something else):

set.seed(15)
d <- sim(ms.comp,10000)
fit.comp <- coxph(Surv(time,status==2)~log(thick)+sex+age+invasion,data=Melanoma)
fit.comp.s <- coxph(Surv(time,status==2)~log(thick)+sex+age+invasion,data=d)

res <- as.data.frame(cbind(coef(fit.comp),coef(fit.comp.s)))
colnames(res) <- c("Parameters of the model from original data", "Parameters of the model from simulated data")
knitr::kable(res)

Including recursion (and specifying the model manually)

If we specify that we want to have recursion then, we:

ms.rec <- synthesize(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=Melanoma,recursive = TRUE)
ms.rec

Here we see that (in the sense of formulas) \begin{align} \verb|logthick| \ &\verb|~| \ \verb|sex + age + invasion| \ \verb|sex| \ &\verb|~| \ \verb|age + invasion| \ \verb|age| \ &\verb|~| \ \verb|invasion| \end{align} in addition to what we had before. We could simulate again:

set.seed(15)
d <- sim(ms.rec,10000)
fit.rec.s <- coxph(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=d)

res <- as.data.frame(cbind(coef(fit),coef(fit.rec.s),coef(fit.s)))
colnames(res) <- c("Parameters of the model from original data", "Parameters of the model from simulated data of the recursive model","Parameters of the model from simulated of the non-recursive model")
knitr::kable(res)

The coefficients of the model with the simulated data from the recursive model are not particularly close to the original data. Let us simulate to see if these differences are systematic:

library(ggplot2)
#library(gridExtra)

make.boxplot.simulation <- function(object, object.rec, data, sample.sizes, n.sim){
  #Copare to model from original data
  coef.compare <- coef(coxph(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=data))

  # make dataframe with columnnames of the form: parameters, sample.size, recursion
  df <- data.frame(matrix(ncol = length(coef.compare)+2, nrow = 0))
  column.names <- c(names(coef.compare),"sample.size", "recursive")
  colnames(df) <- column.names

  #simulate
  for (sample.size in sample.sizes){
    for (i in 1:n.sim){
      d <- sim(object, sample.size)
      coeff <- coef(coxph(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=d))
      df[nrow(df)+1,] <- c(coeff,sample.size,"non recursive")

      d.rec <- sim(object.rec, sample.size)
      coeff.rec <- coef(coxph(Surv(time,eventBin)~log(thick)+sex+age+invasion,data=d.rec))
      df[nrow(df)+1,] <- c(coeff.rec,sample.size,"recursive")
    }
  }

  #make variables of the correct type; otherwise ggplot doesn't understand
  for (c in names(df)[0:length(coef.compare)]){
    df[[c]] <- as.numeric(df[[c]])
  }
  df$sample.size <- as.factor(df$sample.size)
  df$recursive <- as.factor(df$recursive)

  ggplot(df, aes(x=sample.size, y=age, fill=recursive)) +geom_boxplot() + geom_hline(yintercept =coef.compare[["age"]])

  #plots <- c()

  #for (c in names(df)[0:length(coef.compare)]){
  #  plots <- c(plots,ggplot(df, aes(x=sample.size, y=c, fill=recursive)) +geom_boxplot() + geom_hline(yintercept =coef.compare[[c]]))
  #}
  #do.call("grid.arrange", c(plots, ncol=length(coef.compare)))
}

make.boxplot.simulation(ms,ms.rec, Melanoma, c(100,200,500,1000),1000)

Which looks pretty reasonable.

We could also check one of the other relationships in the recursive model, i.e logthick ~ sex + age + invasion:

set.seed(15)
d <- sim(ms.rec,10000)
fit.rec.alt <- lm(log(thick)~sex+age+invasion,data=Melanoma)
fit.rec.alt.s <- lm(log(thick) ~ sex + age + invasion, data= d)

res <- as.data.frame(cbind(coef(fit.rec.alt),coef(fit.rec.alt.s)))
colnames(res) <- c("Parameters of the model (with formula: logthick ~ sex + age + invasion) from original data", "Parameters of the model (with formula: logthick ~ sex + age + invasion) from simulated data")
knitr::kable(res)

The parameters look to be quite close to each other. In view of the above discussion, we could also have specified that: $$ \begin{align} \verb|logthick| \ &\verb|~| \ \verb|sex + age + invasion| \ \verb|age| \ &\verb|~| \ \verb|invasion| \end{align} $$ Here we have to specify this with a lava object rather than a formula, since this can't be done with a formula:

library(lava)
u <- lvm()
# specify covariate
distribution(u,~sex) <- binomial.lvm() #specify binary covariate
distribution(u,~age) <- normal.lvm() #specify continuous covariate
distribution(u,~logthick) <- normal.lvm() #note: remove paranthesis around log
categorical(u,K=3) <- "invasion" #specify categorical covariate
#include event time, so it knows it is the survival case
u <-eventTime(u,time~min(time.cens=0,time.death=1,time.other=2), "status")
#specify regression relationships
lava::regression(u,logthick~sex+age+invasion) <- 1
lava::regression(u,age~invasion) <- 1
#specify relationships for censored variables (could also change them)
lava::regression(u,time.other~logthick+sex+age+invasion) <- 1
lava::regression(u,time.death~logthick+sex+age+invasion) <- 1
lava::regression(u,time.cens~logthick+sex+age+invasion) <- 1
#finally synthesize object 
u_synt <- synthesize(u,data=Melanoma)

#do a simulation study
d <- sim(u_synt,10000)
fit.rec.alt <- lm(log(thick)~sex+age+invasion,data=Melanoma)
fit.rec.alt.s <- lm(logthick ~ sex + age + invasion, data= d)
res <- as.data.frame(cbind(coef(fit.rec.alt),coef(fit.rec.alt.s)))
colnames(res) <- c("Parameters of the model from original data", "Parameters of the model from simulated data")
knitr::kable(res)

Other parameters of note for the synthesize function

We also have the parameters:

logtrans/fromFormula. This should not be specified by the user, unless you know what you are doing. Primarily used for internal use in synthesize.
fixNames. Fix the names of problematic covariates. Try this if you discover that your factors have names with odd characters in them.
verbose. This gives detailed messages of potential problems. Could be useful, if synthesize does not work as you expect.

A linear/logistic regression model

To make a simple linear regression model (the data set cars is a included in R), we do:

data(cars)
m.cars <- synthesize(dist ~ speed, data=cars)
summary(m.cars)

The cars data set consists of a number of observations of the speed of cars and their stopping distances. Checking that the model makes sense is another matter, and we will not consider it here. This example is just to show that it is simple to use synthesize when the response variable is not of survival type, but we could simulate as before:

set.seed(15)
d <- sim(m.cars,10000)
fit <- lm(dist ~ speed, data = cars)
fit.s <- lm(dist ~ speed, data = d)

res <- as.data.frame(cbind(coef(fit),coef(fit.s)))
colnames(res) <- c("Parameters of the model from original data", "Parameters of the model from simulated data")
knitr::kable(res)

Finally for a binary response, we could make a logistic regression which is done the same way as the linear regression. We first find some data online suited for logistic regression:

UCLA.dat <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
knitr::kable(head(UCLA.dat))

The dataset has the variables: - admit indicating whether a student was admitted as graduates to UCLA. - gre indicating the students GRE score. - gpa indicating what the student's grade point average was. - rank indicating the rank of their undergraduate school.

A logistic regression could then try to model whether students were admitted by using the other variables in the dataset via:

UCLA.syn <- synthesize(admit ~ gre + gpa + rank, data = UCLA.dat)

We could again make a simulation analysis similar to the ones before:

set.seed(15)
d <- sim(UCLA.syn,10000)
fit <- glm(admit ~ gre + gpa + factor(rank), data = UCLA.dat)
fit.s <- glm(admit ~ gre + gpa + rank, data = d)

res <- as.data.frame(cbind(coef(fit),coef(fit.s)))
colnames(res) <- c("Parameters of the model from original data", "Parameters of the model from simulated data")
knitr::kable(res)