In topycyao/Xsurv: A package used to model survival data from advanced machine learning algorithms

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  fig.width = 6,
  fig.height = 5
)

Survival data

Cox model

In biomedical studies, the Cox proportional hazard model (Cox 1972) is the most commonly used model for the regression analysis of survival outcomes. Let's consider the standard time-to-event(e.g death) data $(t_i,\mathbf{x_i},\delta_i) $ where $t_i$ is the observed time,$\mathbf{x_i}$ is the covariate vector and $\delta_i$ is the censoring indicator. Censoring is a type of missing data for event time which means the observed time may not be the same as the actual event time. There are different reasons for censoring occurence such as the lost of follow-up for the subject ,the termination of study before the event happens and so on. The Cox model gives the expression of hazard function of a subject at time t in terms of a baseline hazard $\lambda_0 (t)$ and an exponential function of $\mathbf{x_i}$:

$$ \lambda (t \mid x) = \lambda_{0}(t) \exp{H(\mathbf{x_i})} $$ where H(.) is a risk score function that relates covariates and regression coefficients. In the standard Cox model, the risk score is a linear term, i.e., $ H(\mathbf{x_i})=\mathbf{x_i}^{T}\beta$ . In particular, Xsurv_sim_data can generate data for linear model.

library(Xsurv)
set.seed(981)
beta=c(0.1,0.5,0.2,1,0.01) #set beta value

sim_dat<-Xsurv_sim_data(size=100,dim=5,lambda=2,vu=1, 
                        beta=beta, c_rate=0.25)

Automatically tunning the parameter with Xsurv.cv function

 X<-sim_dat[,1:5]
 y<-sim_dat[,6:7]

xs_cv<-Xsurv.cv(X,y,option = 'lgb',search = 'rd',top_n = 1) #random search tunning parameter with lightgbm model

# both xgboost and lightgbm can be automatically tunning with Xsurv.cv
# xs_cv<-Xsurv.cv(X,y,option = 'xgb',search = 'grid') 
xmod<-xs_cv$model
cindx<--xs_cv$cindex
xs_cv$SHAP
#The only selected feature is V4 (if you remember the simulation setting beta=c(0.1,0.5,0.2,1,0.01), V4 is the only one not close to 0.)

Simple case study

library(survival)
data(lung) #Using lung data as an example

mydata<-(lung[,-1])
mydata[,2]<-mydata[,2]-1
length(mydata[,1])
names(mydata)<-colnames(mydata)

datay_train<-mydata[1:180,c(1,2)] 
datax_train<-mydata[1:180,-c(1,2)]


datay_test<-mydata[181:228,c(1,2)]
datax_test<-mydata[181:228,-c(1,2)]

#A survival tree plot will be generated. cp here is the pre-specified complexity of tree.
xs<-Xsurv.cv(datax_train,datay_train,top_n=5,cp=0.01)
xm<-xs$model
shap=xs$SHAP
shap

Risk

library(survminer)
risk=xs$risk
#plot the kaplan-meier curve for different risk levels
risk_data=risk$data
fit=survival::survfit(Surv(time,status)~risk,data=risk_data)
ggsurvplot(fit,pval = TRUE,palette = c('coral','burlywood1','cadetblue1'),size=2,
                          legend=c(0.85,0.85),legend.title='',font.x=c(15,"plain","black"),
                           font.y=c(18,"plain","black"))