R/survivalpath.R

Defines functions getonenodePatients treetoexcel df2newick traverse structureResult branchfun isbranch stepbyselectfeature pairwise selectvariables kmselectfeature pvalue survivalpath

Documented in survivalpath

# Some useful keyboard shortcuts for package authoring:
#
#   Build and Reload Package:  'Ctrl + Shift + B'
#   Check Package:             'Ctrl + Shift + E'
#   Test Package:              'Ctrl + Shift + T'



#'@title Build Survival Path Model Using Dynamic Time Series Data (DTSD) object
#'@description Survival Path Mapping for Dynamic Prediction of Cancer Patients Using Time-Series Survival Data.This is the core function that build
#'survival path tree model based on Akaike information criterion (AIC) and self-designed arguments.
#'@usage survivalpath(
#'DTSD,
#'time_slices,
#'treatments=NULL,
#'num_categories=2,
#'p.value=0.05,
#'minsample = 15,
#'degreeofcorrelation=0.7,
#'rates=365
#')
#'
#'@param DTSD A DTSD class object. See function \code{generatorDTSD()} for details.
#'@param time_slices numeric, define the total number of time slices (starting from the front) needed to be included in the survival path model
#'@param treatments A list object, with default value of NULL. This argument is used to specify the intervention measures/exposure taken by
#'the observation at different time slices. The treatment or exposure variables specified will not be utilized in construction of the survival path model
#'@param num_categories Numeric, the default value is 2. The maximum number of branches that each node can divide
#'@param p.value \code{p.value} for hypothesis testing; variables with p value less than p.value in univariate analysis are significant candidate
#'variables and will undergo further feature selection
#'@param minsample Minimum sample size for branching
#'@param degreeofcorrelation default 0.7;When the correlation between variables is greater than this value, the variables are considered
#'to have collinearity. The pair of variables that exceed the correlation coefficient will automatically compare their Akaike information criterion (AIC)
#'values when each of two serve as the only predictor for outcome; the variable with the smaller AIC value will be removed.
#'@param rates Numeric value. Calculate the rate of the outcome for the nodes in the survival path model at the time point of the argument \code{rates}
#'@details After the pre-processing of data, under a user-defined parameters on covariates, significance level, minimum bifurcation sample size and number of time slices
#'for analysis, survival paths can be computed using the main function, which can be visualized as a tree diagram.
#'@return The survivalpath function returns an object, which includes data, tree and df.
#'\item{data}{\code{data} describes the grouping variables and values for each observation at different time slices.}
#'\item{tree}{A \code{treedata} object \code{tree},which facilitate creation of tree diagram and mapping of patients' personalized survival path}
#'\item{df}{A Data.frame object containing the node numbers corresponding to each observation at different time slices in survival path tree model
#'tree. The dataframe added three new columns, the parent_node correspond to the upper node that the observation belongs to, which indicate the group
#'of participants for modeling and feature selection; the sub_node indicates the node that the corresponding observation represent after subdivision
#'from the parent_node, the information of sub_node is used for model evaluation and comparison. The variable_value indicate the reason for transfer
#'from the parent_node to the sub_node.}
#'\item{maxpath}{The longest path length in the survival path model.}
#'
#'@note The idea of developing the SurvivalPath R package stems from our previous exploratory work, in which we attempted to achieve dynamic prognosis
#'prediction by establishing survival paths based on the time-series data of patients with hepatocellular carcinoma (HCC). The survival path approach
#'we proposed provide a potential solution for dynamic prognosis prediction and management of cancer patients by constructing survival path maps using
#'returned key prognostic factors after analysis of structured time-series survival data. More importantly, the survival path model could be easily
#'understood and utilized by clinicians when compared to black-box models.
#'The SurvivalPath R package is a newly developed tool to facilitate fast building of survival path models, with an aim of promoting standardization
#'of this methodology. In this package we optimized the feature selection process. Oneto one collinearity analysis was embedded (as an argument) to
#'screen out noncollinear candidate variables before formal feature selection in the main function to reduces the confounding impact of potential
#'collinearity on feature selection in the Cox model. In addition, the SurvivalPath R package is now compatible with continuous variable. The classifydata
#'function enabling automatic binary classification of continuous variables and their entry into the model.
#'This methodology is still young, and we welcome efforts from all the world to improve it.

#'@author Lujun Shen and Tao Zhang
#' @references Lujun Shen. (2018)
#' \emph{Dynamically prognosticating patients with hepatocellular carcinoma through survival paths mapping based on time-series data},
#' \url{https://www.nature.com/articles/s41467-018-04633-7.pdf}\cr
#' \emph{Nat Commun. 2018 Jun 8;9(1):2230. doi: 10.1038/s41467-018-04633-7. PMID: 29884785; PMCID: PMC5993743.}
#'@export
#'@importFrom dplyr src
#'@importFrom treeio mask
#'@importFrom dplyr summarize
#'@importFrom stats pchisq
#'@importFrom stats drop1
#'@importFrom stats step
#'@importFrom stats cor
#'@importFrom Hmisc rcorr.cens
#'@importFrom stats predict
#'@import survminer
#'@import survivalROC
#'@import treeio
#'@examples
#'library(dplyr)
#'data("DTSDHCC")
#'#Randomly select a proportion of cases for demo
#'id = DTSDHCC$ID[!duplicated(DTSDHCC$ID)]
#'set.seed(123)
#'id = sample(id,500)
#'miniDTSDHCC <- DTSDHCC[DTSDHCC$ID %in% id,]
#'#Convert multiple rows time series data into time-slices data
#'dataset = timedivision(miniDTSDHCC,"ID","Date",period = 90,left_interval = 0.5,right_interval=0.5)
#'#Create DTSD object using time-slices data
#'resu <- generatorDTSD(dataset,periodindex="time_slice",IDindex="ID" ,timeindex="OStime_day",
#'  statusindex="Status_of_death",variable =c( "Age", "Amount.of.Hepatic.Lesions",
#'  "Largest.Diameter.of.Hepatic.Lesions",
#'  "New.Lesion","Vascular.Invasion" ,"Local.Lymph.Node.Metastasis",
#'  "Distant.Metastasis" , "Child_pugh_score" ,"AFP"),predict.time=365*1)
#'#Construction of survival path using this function, takes minutes
#'result <- survivalpath(resu,time_slices =9)
#'
#'#Draw Suvival Path Tree
#'library(ggplot2)
#'library(ggtree)
#'mytree <- result$tree
#'
#'ggtree(mytree, color="black",linetype=1,size=1.2,ladderize = TRUE )+
#'  theme_tree2() +
#'  geom_text2(aes(label=label),hjust=0.6, vjust=-0.6 ,size=3.0)+
#'  geom_text2(aes(label=paste(node,size,mytree@data$survival,mytree@data$survivalrate,sep = "/")),
#'  hjust=0.6, vjust=-1.85 ,size=3.0)+
#'  #geom_point2(aes(shape=isTip, color=isTip), size=mytree1@data$os/40)+
#'  geom_point2(aes(shape=isTip, color=isTip), size=mytree@data$size%/%200+1,show.legend=FALSE)+
#'  #guides(color=guide_legend(title="node name/sample number/Median survival time/Survival rate")) +
#'  labs(size= "Nitrogen",
#'       x = "TimePoints",
#'       y = "Survival",
#'       subtitle = "node_name/sample number/Median survival time/Survival rate",
#'       title = "Survival Tree") +
#'  theme(legend.title=element_blank(),legend.position = c(0.1,0.9))
#'


# data.tree


survivalpath <- function(DTSD,time_slices,treatments=NULL,num_categories=2,
                              p.value=0.05,minsample = 15,degreeofcorrelation=0.7,rates=365) {
  if(!  "DTSD" %in% class(DTSD) ){
    stop('The data type of the input object DTSD is not "DTSD" ')
  }


  time <- DTSD$time
  status <- DTSD$status
  timeslicedata <- DTSD$tsdata
  tspatientid <- DTSD$tsid

  l_time <- length(time)
  l_status <- length(status)
  l_timesl <- length(timeslicedata)
  l_id <- length(tspatientid)


  ######################################################################################################


  if(!is.null(treatments)){
    l_tr <- length(treatments)
    if (!is.list(treatments)) {
      stop('treatments should be a list of data.frame!')
    }

    if (l_time!=l_status | l_time!= l_timesl | l_time != l_id |l_time != l_tr){
      stop(sprintf('The length of time,status,timeslicedata , tspatientid,treatments are:%d, %d, %d, %d,%d,
                                but they should be the same length.',l_time,l_status,l_timesl,l_id,l_tr))
    }
  }


  #A patient's unique coding id required,judgement
  patientID <- tspatientid[[1]]


  patientID_a <-unique(patientID[duplicated(patientID),])

  lenpid <- length(as.numeric(unlist(patientID_a)))
  if (lenpid!=0){
    stop(paste('The patient ID repeat. Every patient ID should be unique:',patientID_a))
  }
  ###############################################################################################################
  #id in patientID


  #Number of time nodes, equal to the length of input data
  #time_slices <- 9

  #all patients ID
  branchID <- list(tspatientid[[1]])

  #init
  ts_var <- vector("list", time_slices)
  ts_num <- vector("list", time_slices)
  ts_id <- vector("list", time_slices)
  ts_value <- vector("list", time_slices)

  alldata <- list()
  for(i in 1:l_time){
    ts_data <- data.frame( tspatientid[[i]], time[[i]],status[[i]],class=1)
    names(ts_data) <- c("ID","time","status","class")
    if(!is.null(treatments[[i]])){
      ts_data <- data.frame(ts_data,treatments[[i]],timeslicedata[[i]])
    }
    else{
      ts_data <- data.frame(ts_data,timeslicedata[[i]])
    }

    alldata$data <- rbind(alldata$data,ts_data)
  }
  alldata$treatmentsnum <- dim(treatments[[i]])[2]


  for (time_slice in 1:time_slices){
    #print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')
    #print(paste('time_slice:',time_slice,sep = ''))

    previousbranchID <- branchID  #list


    #Full sample time at current time node
    current_time <- time[[time_slice]]

    #Full sample status of current time node
    current_status <- status[[time_slice]]

    #Full sample id of current time node
    current_id <- tspatientid[[time_slice]]

    #All sample variables for the current time node: data frames
    current_variables <- timeslicedata[[time_slice]]

    #Print variable name
    #print(paste("datavar:",names(current_variables)))


    #Determine whether the dimensions of the four data are consistent
    if (dim(current_time)[1]!=dim(current_status)[1] | dim(current_time)[1]!= dim(current_variables)[1] |
        dim(current_time)[1] != dim(current_id)[1]){stop(sprintf('The number of the %d-th timeslice of time,status,
                        timeslicedata and tspatientid are:%d, %d, %d, %d, but they should be the same ',time_slice,
                                                                 l_time,l_status,l_timesl,l_id))}


    # id,time,status,varibales
    current_ts_data <- data.frame(current_id,current_time,current_status,current_variables)

    #Dimensional Judgment
    if (dim(current_time)[2]!=1){
      stop(sprintf('the %d-th timeslice of survival time should be unique, but the dim is:%d',2,dim(current_time)[2]))
    }

    if (dim(current_status)[2]!=1){
      stop(sprintf('the %d-th timeslice of survival status should be unique, but the dim is:%d',2,dim(current_status)[2]))
    }

    if (dim(current_id)[2]!=1){
      stop(sprintf('the %d-th timeslice of patient id should be unique, but the dim is:%d',2,dim(current_id)[2]))
    }


    ############################################# Former branch  ##########################################################

    # Number of branches retained at the previous moment previousbranchid-> the id,list() of each branch,data.frame,brnum:number of branches
    brnum <- length(previousbranchID)

    if(brnum<1){
      ts_var = ts_var[-c(time_slice:time_slices)]

      ts_num = ts_num[-c(time_slice:time_slices)]

      ts_id = ts_id[-c(time_slice:time_slices)]

      ts_value = ts_value[-c(time_slice:time_slices)]
      next
    }


    #Current time, variable name of current branch
    ts_br_var <- vector('list',brnum)
    #ts_br_v_names <- vector('list',brnum)


    # The number of branches of the current branch
    ts_br_num <- vector('list',brnum)


    # ??????
    branchID <- list()
    branchID_var_value <- list()

    for (i_id in 1:brnum) {

      #print('########################################')

      #print(paste('current branch:',i_id,sep = ''))


      #ts_br_v_names[[i_id]] <- paste('ts',time_slice,'_br',i_id)

      #id of the previous time node to the current branch
      #print(i_id)


      br_id <- data.frame(previousbranchID[[i_id]])

      nam <- names(current_ts_data)

      names(br_id) <- nam[1]
      #print(names(br_id)[1])

      #Dimensions of current survival data

      #print("pre branch:")
      #print(dim(current_ts_data))


      #To extract current_ts_data data rows in the br_id by name nam[1]"
      current_ts_b_data <- merge(current_ts_data,br_id,by= nam[1],all=FALSE)

      #print("after merge, current branch:")
      #print(dim(current_ts_b_data))


      #id of current time node branches
      current_ts_b_id <- subset(current_ts_b_data, select = nam[1])

      current_names <- isbranch(current_ts_b_id,minsample = minsample) #Empty or stop

      #Judge whether to proceed
      if (!is.null(current_names)) {
        ts_br_var[[i_id]] <- current_names
        ts_br_num[[i_id]] <- 0
        next
      }

      #ts_br_var variable name list, the value is stop, the branch stops the branch

      #Time of branch at current moment
      current_ts_b_time <- current_ts_b_data[,2]

      #Status of branches at the current moment
      current_ts_b_status <- current_ts_b_data[,3]

      #All variables of the current time branch
      current_ts_b_variables <- subset(current_ts_b_data, select = -c(1,2,3))


      #Choose a meaningful variable from all variables, and audition returns cox$variables;cox$pvalue
      cox <- kmselectfeature(current_ts_b_time,current_ts_b_status,current_ts_b_variables,num_categories,p.value)

      #print("cox: ")
      #print(dim(cox$variables))
      #print(length(current_ts_b_time))


      #Determine the number of variables, if the number is 0, the current branch stops branching
      if (dim(cox$variables)[2]==0) {

        ts_br_var[[i_id]] <- "all"

        branch <- list(current_ts_b_id)

        b_varname = "all"

        ts_br_num[[i_id]] <- length(branch)

        branchID <- c(branchID,branch)
        branchID_var_value <- c(branchID_var_value,b_varname)

        next
      }


      #Determine the correlation of variables, group and select the main variables
      ###############################################################################################################
      #print('Determine the correlation of variables, group and select the main variables')

      #Returns the main variable of the current branch

      #degreeofcorrelation=0.7
      #

      mainvar <- selectvariables(current_ts_b_time,current_ts_b_status,cox,degreeofcorrelation)

      #cox.variables <- selectnoncorrvariables(current_ts_b_time,current_ts_b_status,mainvar)

      #print(paste('variables:',names(cox.variables)))

      #Selection of the most important variables

      #print(names(mainvar$variables))
      #print(dim(mainvar$variables))
      #print(names(mainvar$variables))
      #print(dim(mainvar$variables)[2])

      if(dim(mainvar$variables)[2]>1){
        mainvalue <- stepbyselectfeature(current_ts_b_time,current_ts_b_status, mainvar)
        curr_names <- names(mainvalue$variables)
      }else{
        curr_names <- names(mainvar$variables)
      }

      if(length(curr_names)==0 ){
        ts_br_var[[i_id]] <- "all"

        branch <- list(current_ts_b_id)

        b_varname = "all"

        ts_br_num[[i_id]] <- length(branch)

        branchID <- c(branchID,branch)
        branchID_var_value <- c(branchID_var_value,b_varname)
        next()
      }

      #print("curr_names")
      #print(curr_names)

      ts_br_var[[i_id]]  <- curr_names[1]

      #branch
      ts_b_v_status <- subset(cox$variables, select = curr_names[1])


      v_status <- unique(unlist(ts_b_v_status))
      v_status <- sort(v_status)


      #model <- cph(Surv(current_ts_b_time,current_ts_b_status)~ts_b_v_status)

      num = length(v_status)
      #print(paste('current branch num:',num,sep = ''))



      #go to branch
      branch <- lapply(1:num, branchfun,ts_b_v_status,v_status,current_ts_b_id)

      #####################################
      b_varname = list(v_status)
      #####################################

      ts_br_num[[i_id]] <- length(branch)

      branchID <- c(branchID,branch)
      branchID_var_value <- c(branchID_var_value,b_varname)

      #print(paste('current branch variable:',unlist(ts_br_var[[i_id]]),sep = ''))

    }

    #ts_br_var
    ts_var[[time_slice]] <- ts_br_var

    ts_num[[time_slice]] <- ts_br_num

    #print(paste('current branch num:',unlist(ts_br_num),sep = ''))

    ts_id[[time_slice]] <- branchID

    ts_value[[time_slice]]  <- branchID_var_value

  }

  survivalpathresults <- list(ts_var,ts_num,ts_id,ts_value)

  result  <- list()

  ## (result,time,status,tsid){
  result$data <- structureResult(survivalpathresults,time,status,tspatientid)
  dfdata <- result$data
  result$tree <-  df2newick(result$data,innerlabel = TRUE, period=rates)
  dftree <- result$tree
  #result$df <- treetoexcel(dfdata,dftree ,time,status,timeslicedata,tspatientid)
  dfpath <- treetoexcel(dfdata,dftree,DTSD)
  result$df <- dfpath$df
  result$maxpath <- dfpath$maxpath

  return(result)
}

#--------------------------------------------------------------------------------------------------------------
pvalue <- function(x,time,status,variables,v_names,num_categories){



  time = as.numeric(unlist(time))
  status = unlist(status)

  variable <- variables[x]
  #print(names(variable))
  variable <- unlist(variable)
  num = length(unique(variable))
  if (num > num_categories){

    stop(paste('The number of classes in ',x,'-th variable:',v_names[x],
               ' is ',num,',is higher than num_categories:',num_categories,sep = ''))
    return(1000)
  }

  if(num<=1){
    warning(paste('The number of classes in ',x,'-th variable:',v_names[x],
                  ' is ',num,',is lower than',num_categories,sep = ''))
    return(1000)
  }

  else{
    S <-survdiff(Surv(time,status)~variable)
  }

  #print(S)
  p.value <- 1 - pchisq(S$chisq, length(S$n) - 1)
  #print(p.value)

  return(p.value)
}


#------------------------------------------------------------------------------------------------------------

kmselectfeature <- function(time,status,variables,num_categories,p.value){

  #print(paste("*************************kmselectfeature***************************"))

  #Gets the variable name
  v_names <- names(variables)


  #lapply:list  pvalue calculated p values
  var_s <- lapply(1:dim(variables)[2], pvalue, time,status,variables,v_names,num_categories)

  cox_v_pvalue <- unlist(var_s[var_s<p.value])

  #If the variable is num_categories category, the variable is cleared
  cox_v_names <- unlist(v_names[var_s<p.value])

  cox <- list()

  #print(cox$names)

  #Select meaningful variables
  cox$variables <- subset(variables, select = unlist(cox_v_names))

  #p values of meaningful variables
  cox$pvalue <- cox_v_pvalue

  return(cox)

}


#------------------------------------------------------------------------------------------------------------
#Batch cycles, each eliminating a major sample
selectvariables <- function(time,status,coxvar,degreeofcorrelation){

  #print(paste("*************************selectnoncorrvariables***************************"))
  #Calculation of correlation matrix
  r <- cor(as.matrix(coxvar$variables))

  #r correlation coefficient
  #r <- corr$r
  rshape <- dim(r)
  #Remove the polygonal line, its own correlation coefficient
  for(ri in 1:rshape[1]){
    r[(ri-1)*rshape[1]+ri] <- 0
  }

  #Remove null variables
  r[is.na(r)] <- 0
  FLAG <- r[which.max(r)]>= degreeofcorrelation

  # Remove the correlated variables
  if(FLAG) {
    coxvar <- pairwise(time, status,coxvar,degreeofcorrelation)
  }

  return(coxvar)
}

#--------------------------------------------------------------

#The correlation matrix is grouped, the most relevant pair of variables is
# selected, and a main variable is filtered by backward method
pairwise <- function(time, status,coxvar,degreeofcorrelation){

  #Calculation of correlation matrix
  r <- cor(as.matrix(coxvar$variables))

  #r correlation coefficient
  #r <- corr$r

  rshape <- dim(r)

  #Remove the polygonal line, its own correlation coefficient
  for(ri in 1:rshape[1]){
    r[(ri-1)*rshape[1]+ri] <- 0
  }

  #Remove null variables
  r[is.na(r)] <- 0

  wh <- dim(r)


  #Find the largest pair or pairs and return the corresponding id
  FLAG <- r[which.max(r)]>= degreeofcorrelation

  if(!FLAG){
    return(coxvar)
  }

  max_ind <- which(r==r[which.max(r)],arr.ind = TRUE)
  #print(max_ind)

  #The largest, take all the largest front
  max_ind <- max_ind[1,] #integer

  max <- list()

  #Temporary two pairs of related variables
  max$variables <- subset(coxvar$variables, select = max_ind)

  max$pvalue <- coxvar$pvalue[max_ind]

  #Selection of main variables
  maxvar <- stepbyselectfeature(time, status, max)

  #print(maxvar$index)


  lenvar <- 1:length(max_ind)
  lenindex <- lenvar[-maxvar$index]

  index <- max_ind[lenindex]

  coxvar$variables <- subset(coxvar$variables,select = -index)
  coxvar$pvalue <- coxvar$pvalue[-index]

  return(coxvar)
}


#------------------------------------------------------------------------------------------------------------
#Backward selection of variables
stepbyselectfeature <- function(time,status,coxvar){
  #print(paste("*************************stepbyselectfeature***************************"))

  S <- try({
    #var_ <-data.frame(coxvar$variables)
    S<-coxph(Surv(time,status)~.,coxvar$variables,singular.ok = TRUE)
  }
  ,silent=FALSE)


  # Determines whether the expression in the try statement of the current loop is running correctly
  if('try-error' %in% class(S))
  {
    max <- list()
    max$index <- which.min(coxvar$pvalue)
    max$variables <- subset(coxvar$variables,select = max$index)
    return(max)
  }

  S_step <- step(S, direction = 'backward',trace = 0)

  step_aic <- drop1(eval(S_step))


  max <- list()

  row_names <- row.names(step_aic)
  #print(row_names)

  varname <- names(coxvar$variables)
  #print(step_aic[,"AIC"])
  max$index <- which.max(step_aic[,"AIC"])
  #max$variables <- subset(coxvar$variables,select = max$index)
  max$names <- row_names[max$index]

  max$index <- which(varname==max$names)
  #print(max$index)
  max$variables <- subset(coxvar$variables,select = max$index)
  #print("*************************stepbyselectfeature end***************************")


  return(max)

}


#------------------------------------------------------------------------------------------------------------
isbranch <- function(c_ts_b_id,minsample=15){
  if (dim(c_ts_b_id)[1]<minsample){
    current_names <- 'STOP'
  }else{
    return(NULL)
  }
}


#------------------------------------------------------------------------------------------------------------
#branch
branchfun <- function(x,status,value_status,id){
  status_id <- data.frame(status,id)

  index <- value_status[x]
  #print(paste("condition:",x,sep = ""))
  #print(index)

  b <- status_id[status_id[1]==index,2]
  b <- data.frame(b)

  #print(dim(b))
  return(b)
}



# Converts list-structured data to a data format that can be converted to-structured data
structureResult <- function(result,time,status,tsid){

  ts_var <- result[[1]]
  ts_num <- result[[2]]
  ts_id <- result[[3]]
  ts_varname <- result[[4]]

  ID = rbind(ts_id[[1]][[1]],ts_id[[1]][[2]])

  ##get ostime status by id

  new_data <- cbind(tsid[[1]],time[[1]],status[[1]])
  names(new_data) <- c("ID","time","status")

  names(ID) <- "ID"

  loc = match(ID$ID,new_data$ID)
  new_data <- new_data[loc,]

  # All branches and sort

  #Full time node of flashback cycle
  #Number of required categories less than 10

  Class =  rep("0", times=dim(new_data)[1])

  for (ts in length(ts_var):1){

    tp_var = vector("list", 0)
    tp_varname = vector("list", 0)


    #Class[idi] = paste("0",Class[idi],sep = "")

    #Grading each id of each time node
    for(idi in 1:dim(new_data)[1]){
      id = new_data$ID[idi]
      #print(id)
      #print(idi)


      tp_var = c(tp_var,list("None"))
      tp_varname = c(tp_varname,list(list("None")))


      isexist = FALSE
      #br: Each branch variable
      #print(paste("length:",length(ts_var[[ts]])))

      new_ts_var_ = ts_var[[ts]]
      new_ts_var_ = new_ts_var_[new_ts_var_!="STOP"]

      brni = 0
      if (length(new_ts_var_)==0){
        next
      }
      for (br in 1:length(new_ts_var_)){
        #print(paste("br:",br))

        if (isexist){break}

        #brn: Number of branches per branch variable
        #print(paste(ts,br,ts_varname[[ts]][[br]]))
        for (brn in 1:length(ts_varname[[ts]][[br]])){

          brni = brni+1
          #Determine whether id branch
          #print(paste("ts_brn:",ts,brni))
          #print(ts_id[[ts]][[brni]])

          isexist = id %in% ts_id[[ts]][[brni]][,1]
          if (isexist){


            tp_var[idi] <- new_ts_var_[[br]]
            tp_varname[idi] <- ts_varname[[ts]][[br]][[brn]]

            Class[idi] = paste(as.character(brn),substr(Class[idi],2,nchar(Class[idi])),sep = "")

            break

          }

        }

      }
      #print("----------------------------------------")
      #print(tp_var[1:10])
      #print(tp_varname[1:10])
      #print(Class[1:10])

    }
    #print("iiiiiiiiiiiiiiiiii")
    #print(length(tp_var))
    #print(length(tp_varname))
    assign(paste("time_slices_",ts,"_variable"),tp_var)
    assign(paste("time_slices_",ts,"_varvalue"),tp_varname)

    df1 = as.data.frame(matrix(unlist(tp_var),  byrow=1))
    names(df1) = paste("time_slices_",ts,"_variable",sep = "")

    df2 = as.data.frame(matrix(unlist(tp_varname),  byrow=1))
    names(df2) = paste("time_slices_",ts,"_varvalue",sep = "")


    new_data = cbind(new_data,df1,df2)

  }

  new_data = cbind(new_data,Class)
}


#------------------------------------------------------------------------------------------------------------

## recursion function
traverse <- function(data,a,i,innerl,survivaltime,period){
  res = list()

  desc <- NULL
  size <- NULL
  survival <- NULL
  survivalrate <- NULL
  if(i >4){
    df2 = data[which(as.character(paste(data[,i+1],data[,i+2],sep = "_"))==a),]
    alevelinner <- as.character(unique(paste(df2[,i-1],df2[,i],sep = "_")))
    #print(alevelinner)

    #drop NA data,no show
    if ("None_None" %in% alevelinner) {
      alevelinner <- alevelinner[-which(alevelinner=="None_None")]
      #print(alevelinner)
    }

    desc <- NULL
    size <- NULL
    survival <- NULL
    survivalrate <- NULL

    #
    new_df2 = df2
    #new_df2$time = new_df2$time-survivaltime
    if(length(alevelinner) == 1) {
      il <- NULL; if(innerl==TRUE) il <- a
      if ("all_all" %in% alevelinner){
        (newickout <- paste("(all_all:1)",il,":",1,sep=""))

        (size <-  paste("(",dim(df2)[1],":1)",dim(df2)[1],":",1,sep="") )
        fit <- surv_fit(Surv(time,status)~1,data=df2)
        median <- round(surv_median(fit)$median,1)
        #print(surv_median(fit))
        (survival <-  paste("(",median,":1)",median,":",1,sep="") )
        rate <- round(fit$surv[fit$time>period][1],3)
        (survivalrate <-  paste("(",rate,":1)",rate,":",1,sep="") )
      }
      #else if ( "None_None"  %in% alevelinner){(newickout <- paste(a,":",0.1,sep = "")) }
      else{
        r <- traverse(new_df2,alevelinner,i-2,innerl,survivaltime,period)
        (newickout <- r$newickout) #traverse(df2,alevelinner,i-2,innerl)
        (size <- r$size)
        (survival <- r$survival)
        (survivalrate <- r$survivalrate)

      }
    }
    else if(length(alevelinner) > 1) {


      # for  branch
      for(b in alevelinner){
        r <- traverse(new_df2,b,i-2,innerl,survivaltime,period)
        desc <- c(desc,r$newickout) #c(desc,traverse(df2,b,i-2,innerl))
        size <- c(size,r$size)
        survival <- c(survival,r$survival)
        survivalrate <- c(survivalrate,r$survivalrate)
      }
      il <- NULL; if(innerl==TRUE) il <- a
      #(newickout <- paste("(",paste(desc,collapse=","),")",il,":",i*0.05,sep=""))
      (newickout <- paste("(",paste(desc,collapse=","),")",il,":",1,sep=""))

      (size <- paste("(",paste(size,collapse=","),")",dim(df2)[1],":",1,sep=""))

      fit <- surv_fit(Surv(time,status)~1,data=df2)
      median <- round(surv_median(fit)$median,1)
      #print(surv_median(fit))
      (survival <- paste("(",paste(survival,collapse=","),")",median,":",1,sep=""))
      rate <- round(fit$surv[fit$time>period][1],3)
      (survivalrate <- paste("(",paste(survivalrate,collapse=","),")",rate,":",1,sep=""))
      #(size <- c(size,dim(df2)[1]))
    }
    else{
      (newickout <- paste(a,":",1,sep = ""))
      (size <- paste(dim(df2)[1],":",1,sep = ""))
      fit <- surv_fit(Surv(time,status)~1,data=df2)
      median <- round(surv_median(fit)$median,1)
      #print(surv_median(fit))
      (survival <- paste(median,":",1,sep = ""))
      rate <- round(fit$surv[fit$time>period][1],3)
      (survivalrate <- paste(rate,":",1,sep = ""))
      #(size <- c(size,dim(df2)[1]))
    }

  }
  #else { (newickout <- paste(a,":",0.1,sep = "")) }
  else {
    (newickout <- paste(a,":",1,sep = ""))
    (size <- paste(dim(data)[1],":",1,sep = ""))

    fit <- surv_fit(Surv(time,status)~1,data=data)
    median <- round(surv_median(fit)$median,1)
    #print(surv_median(fit))
    (survival <- paste(median,":",1,sep = ""))
    rate <- round(fit$surv[fit$time>period][1],3)
    (survivalrate <- paste(rate,":",1,sep = ""))
    #(size <- c(size,dim(data)[1]))
  }

  (res$newickout <- newickout)
  (res$size <- size)
  (res$survival <- survival)
  (res$survivalrate <- survivalrate)
  res
}


#result$data,innerlabel = T, period=365
## data.frame to newick function
df2newick <- function(df, innerlabel=FALSE,survivaltime=3,period=365){
  df <- data.frame(df)
  len <- dim(df)[2]+1
  res <- list()
  alevel <- as.character(unique(paste(df[,len-3],df[,len-2],sep = "_")))
  newick <- NULL
  size <- NULL
  survival <- NULL
  survivalrate <- NULL

  # survival time period reduce
  new_df = df
  #new_df$time = new_df$time-survivaltime
  for(x in alevel){
    r <- traverse(new_df,x,len-4,innerlabel,survivaltime,period)
    newick <- c(newick,r$newickout)
    size <- c(size,r$size)
    survival <- c(survival,r$survival)
    survivalrate <- c(survivalrate,r$survivalrate)
  }

  (newick <- paste("(",paste(newick,collapse=","),")ALL;",sep=""))
  (size <- paste("(",paste(size,collapse=","),")",dim(df)[1],";",sep=""))
  fit <- surv_fit(Surv(time,status)~1,data=df)
  median <- round(surv_median(fit)$median,1)
  #print(surv_median(fit))
  rate <- round(fit$surv[fit$time>period][1],3)
  (survival <- paste("(",paste(survival,collapse=","),")",median,";",sep=""))
  (survivalrate <- paste("(",paste(survivalrate,collapse=","),")",rate,";",sep=""))
  res$newick <- newick
  res$size <- size
  res$survival <- survival
  res$survivalrate <- survivalrate


  mytree <- read.tree(text=res$newick)
  sizetree <- read.tree(text=res$size)
  survivaltree <- read.tree(text=res$survival)
  survivalratetree <- read.tree(text=res$survivalrate)
  #plot(mytree,family='STXihei')

  x = as_tibble(mytree)
  sizex = as_tibble(sizetree)
  survivalx = as_tibble(survivaltree)
  survivalratex = as_tibble(survivalratetree)
  #x[which(x$label!="" | x$branch.length==NaN),3]=0.1

  x$size = as.numeric(sizex$label)
  x$survival = as.numeric(survivalx$label)
  x$survivalrate = as.numeric(survivalratex$label)

  mytree1 = as.treedata(x)
}

treetoexcel <- function(exceltree,tree,DTSD){

  time <- DTSD$time
  status <- DTSD$status
  timeslicedata <- DTSD$tsdata
  tspatientid <- DTSD$tsid

  node = tree@phylo[["edge"]][,"node"]
  parent = tree@phylo[["edge"]][,"parent"]

  tip.point = sort(setdiff(node,parent))

  node.point = sort(unique(parent))

  #get root node
  non.tip.point = sort(setdiff(node,tip.point))
  rootnode = setdiff(parent,non.tip.point)


  timesize <- length(time) #length(status),length(timeslicedata),length(tspatientid)
  df <- data.frame()
  for (ii in 1:timesize) {
    dataf <- cbind(tspatientid[[ii]],time[[ii]],status[[ii]],timeslicedata[[ii]])
    dataf$time_slice<- ii-1
    df <- rbind(df,dataf)
  }
  ##-----------------------------------------warnings
  df <- df%>% arrange(df$ID,df$time_slice)
  #survpath <- list()

  nodeindex <- which(df$time_slice==0,)
  df$parent_node = rep(NA, nrow(df))
  df$parent_node[nodeindex] <- rootnode
  df$variable_value = rep(NA, nrow(df))
  df$variable_value[nodeindex] <- "ALL"

  dfpath <- list()
  dfpath$df <- df
  dfpath$maxpath <- 0
  for (treepoint in tip.point) {
    dfpath <- getonenodePatients(exceltree,treepoint,tree,dfpath)
  }

  return(dfpath)

}



getonenodePatients <- function(exceltree,treepoint,mytree,dfpath){

  df <- dfpath$df


  node = mytree@phylo[["edge"]][,"node"]
  parent = mytree@phylo[["edge"]][,"parent"]

  tip.point = sort(setdiff(node,parent))
  tip.label = mytree@phylo[["tip.label"]]

  node.point = sort(unique(parent))
  node.label= mytree@phylo[["node.label"]]

  #get root node
  non.tip.point = sort(setdiff(node,tip.point))
  rootnode = setdiff(parent,non.tip.point)

  #treelabel <- c(tip.label,node.label)

  if(treepoint> max(node) | treepoint< min(node)){
    stop("Out of node range")
  }

  #get treepoint survival path
  path <- c(treepoint)

  current_node <- treepoint
  while (current_node!=rootnode) {
    index <- which(node==current_node)
    current_node <- parent[index]
    path <- c(path,current_node)
  }
  #get  treepoint survival periods
  time_slice <- length(path)

  dfpath$maxpath <- max(c(time_slice,dfpath$maxpath))

  for (i in time_slice:1){

    if (i==time_slice){
      newdf <- exceltree

    }else{

      if (path[i] %in% node.point){
        variable <- node.label[which(node.point==path[i])]

        if("all_all" %in% variable){
          next
        }
        varname <- substr(variable, 1,nchar(variable)-2)

        value <- substr(variable, nchar(variable),nchar(variable))
        newdf <- newdf[which(newdf[,paste("time_slices_",time_slice-i,"_variable",sep = "")]==varname,),]
        newdf <- newdf[which(newdf[,paste("time_slices_",time_slice-i,"_varvalue",sep = "")]==value,),]

        varindex <- which(df$ID %in% newdf$ID & df$time_slice == time_slice-i-1 )
        df[varindex,"sub_node"]=path[i]

        varindex <- which(df$ID %in% newdf$ID & df$time_slice == time_slice-i )
        df[varindex,"parent_node"]=path[i]
        df[varindex,"variable_value"]=variable


      }
      if (path[i] %in% tip.point){
        variable <- tip.label[which(tip.point==path[i])]

        if("all_all" %in% variable){

          varindex <- which(df$ID %in% newdf$ID & df$time_slice == time_slice-i-1 )
          df[varindex,"sub_node"]=path[i]

          varindex <- which(df$ID %in% newdf$ID & df$time_slice == time_slice-i )
          df[varindex,"parent_node"]=path[i]
          df[varindex,"variable_value"]=variable

          next
        }
        varname <- substr(variable, 1,nchar(variable)-2)

        value <- substr(variable, nchar(variable),nchar(variable))

        newdf <- newdf[which(newdf[,paste("time_slices_",time_slice-i,"_variable",sep = "")]==varname,),]
        newdf <- newdf[which(newdf[,paste("time_slices_",time_slice-i,"_varvalue",sep = "")]==value,),]

        varindex <- which(df$ID %in% newdf$ID & df$time_slice == time_slice-i-1 )
        df[varindex,"sub_node"]=path[i]

        varindex <- which(df$ID %in% newdf$ID & df$time_slice == time_slice-i )
        df[varindex,"parent_node"]=path[i]
        df[varindex,"variable_value"]=variable

      }

    }
  }
  dfpath$df <- df

  return(dfpath)
}

Try the SurvivalPath package in your browser

Any scripts or data that you put into this service are public.

SurvivalPath documentation built on July 4, 2022, 1:05 a.m.