In celehs/PETLER: Multi-modal Automated Survival Time Annotation

knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE
)
options(width=100)

The step I of the MASTA algorithm involves calculating the mean density function and covariance functions from large unlabeled data. To accelerate this calculation, we have implemented a fast and memory-efficient approach via multicore parallel computing.

library(MASTA)
parallel::detectCores()

TrainCode <- data_org$TrainCode
TrainCode[, monthstd := month / fu_time]  # standardized time (range: 0-1)
TrainCode

In the demo, the TrainCode data is used in the FPCA step and it contains longitudinal encounter records for 20,600 patients (20,000 unlabeled & 600 labeled).

# non-zero
NZ <- TrainCode[, 4:6] != 0
nz.pct <- round(colMeans(NZ) * 100)
data.frame(nz = colSums(NZ), 
           nz.pct = paste0(nz.pct, "%"))

timing <- function(code, ngrid, n_core) {
  x <- seq(0, 1, length.out = ngrid)  
  pred <- TrainCode[[code]]
  TrainNP <- pred[pred > 0] 
  t <- rep(TrainCode$monthstd, pred)
  h1 <- bw.nrd(t)
  h2 <- bw.nrd(t)^(5/6)  
  system.time(FPC.Kern.S(x, t, TrainNP, h1, h2, n_core = n_core))["elapsed"]
}

# timing(code = "pred1", ngrid = 101, n_core = 4)

DF <- expand.grid(
  codes = paste0("pred", 1:3), 
  ngrid =  c(100, 200, 400) + 1, 
  n_core = c(1, 2, 4),
  time = NA)
for (i in 1:nrow(DF))  {
  code <- as.character(DF$code[i])
  DF$time[i] <- timing(code, DF$ngrid[i], DF$n_core[i])
}

The figure below shows the elapsed time (in seconds) for three event codes under different settings. Even for ngrid = 401, the elapsed time is less than one minute for each setting. The linear scalability allows the MASTA algorithm to be feasible even in studies with very large unlabeled data.

library(ggplot2)
ggplot(DF, aes(x = factor(n_core), y = time)) + 
  facet_grid(paste("ngrid =", ngrid) ~ codes) + 
  geom_bar(stat = "identity") + theme_bw() +
  xlab("Number of CPU Cores") + ylab("Elapsed Time (seconds)")