R/29-PROSSTT.R

Defines functions PROSSTT_simulation PROSSTT_estimation

Documented in PROSSTT_estimation PROSSTT_simulation

#' Estimate Parameters From Real Datasets by PROSSTT
#'
#' This function is used to estimate useful parameters from a real dataset by
#' using \code{make_trees} function in simutils package.
#'
#' @param ref_data A count matrix. Each row represents a gene and each column
#' represents a cell.
#' @param verbose Logical.
#' @param other_prior A list with names of certain parameters. Some methods need
#' extra parameters to execute the estimation step, so you must input them. In
#' simulation step, the number of cells, genes, groups, batches, the percent of
#' DEGs are usually customed, so before simulating a dataset you must point it out.
#' See `Details` below for more information.
#' @param seed An integer of a random seed.
#' @importFrom simutils make_trees
#' @return A list contains the estimated parameters and the results of execution
#' detection.
#' @export
#' @details
#' In PROSSTT, users can input cell group information by `other_prior = list(group.condition = xxx)`.
#' If this information is not available, we will detect the potential groups in reference data automatically.
#' For more information about PROSSTT, see `Examples` and `References`.
#'
#' @references
#' Papadopoulos N, Gonzalo P R, Söding J. PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics, 2019, 35(18): 3517-3519. <https://doi.org/10.1093/bioinformatics/btz078>
#'
#' Github URL: <https://github.com/soedinglab/prosstt/>
#'
#' Document URL: <http://wwwuser.gwdg.de/~compbiol/prosstt/doc/>
#'
#' @examples
#' \dontrun{
#' ref_data <- simmethods::data
#' ## estimation
#' estimate_result <- simmethods::PROSSTT_estimation(
#'   ref_data = ref_data,
#'   other_prior = NULL,
#'   verbose = TRUE,
#'   seed = 111
#' )
#' }
#'
PROSSTT_estimation <- function(ref_data,
                               other_prior = NULL,
                               verbose = FALSE,
                               seed){

  ##############################################################################
  ####                            Environment                                ###
  ##############################################################################
  if(!requireNamespace("simutils", quietly = TRUE)){
    message("Splatter is not installed on your device")
    message("Installing simutils...")
    devtools::install_github("duohongrui/simutils")
  }
  ##############################################################################
  ####                               Check                                   ###
  ##############################################################################
  if(!is.matrix(ref_data)){
    ref_data <- as.matrix(ref_data)
  }
  if(!is.null(other_prior[["group.condition"]])){
    group <- other_prior[["group.condition"]]
  }else{
    group <- NULL
  }
  ##############################################################################
  ####                            Estimation                                 ###
  ##############################################################################
  if(verbose){
    message("Estimating parameters using PROSSTT")
  }
  # Seed
  set.seed(seed)
  # Estimation
  estimate_detection <- peakRAM::peakRAM(
    estimate_result <- simutils::make_trees(ref_data,
                                            group = group,
                                            is_Newick = TRUE)
  )
  estimate_result <- list(newick_tree = estimate_result,
                          data_dim = dim(ref_data))
  ##############################################################################
  ####                           Ouput                                       ###
  ##############################################################################
  estimate_output <- list(estimate_result = estimate_result,
                          estimate_detection = estimate_detection)
  return(estimate_output)
}


#' Simulate Datasets by PROSSTT
#'
#' @param parameters A object generated by [simutils::make_trees()]
#' @param other_prior A list with names of certain parameters. Some methods need
#' extra parameters to execute the estimation step, so you must input them. In
#' simulation step, the number of cells, genes, groups, batches, the percent of
#' DEGs are usually customed, so before simulating a dataset you must point it out.
#' See `Details` below for more information.
#' @param return_format A character. Alternative choices: list, SingleCellExperiment,
#' Seurat, h5ad. If you select `h5ad`, you will get a path where the .h5ad file saves to.
#' @param verbose Logical. Whether to return messages or not.
#' @param seed A random seed.
#' @importFrom stringr str_split str_count str_extract_all str_replace
#' @export
#' @details
#' In PROSSTT, `nCells` and `nGenes` are usually customed and users can set
#' `other_prior = list(nCells = 1000, nGenes = 2000)` to simulate 1000 cells and
#' 5000 genes.
#' For more parameters and documents, see `Examples` and `simmethods::get_method("PROSSTT")`.
#'
#' @references
#' Papadopoulos N, Gonzalo P R, Söding J. PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics, 2019, 35(18): 3517-3519. <https://doi.org/10.1093/bioinformatics/btz078>
#'
#' Github URL: <https://github.com/soedinglab/prosstt/>
#'
#' Document URL: <http://wwwuser.gwdg.de/~compbiol/prosstt/doc/>
#'
#' @examples
#' ref_data <- simmethods::data
#' ## estimation
#' estimate_result <- simmethods::PROSSTT_estimation(
#'   ref_data = ref_data,
#'   other_prior = NULL,
#'   verbose = TRUE,
#'   seed = 111
#' )
#'
#' # 1) Simulate with default parameters
#' simulate_result <- simmethods::PROSSTT_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = NULL,
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' ## counts
#' counts <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(counts)
#'
#' # 2) 2000 cells and 5000 genes
#' simulate_result <- simmethods::PROSSTT_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(nCells = 2000,
#'                      nGenes = 5000),
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#'
#' ## counts
#' counts <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(counts)
PROSSTT_simulation <- function(parameters,
                               other_prior = NULL,
                               return_format,
                               verbose = FALSE,
                               seed
){

  ##############################################################################
  ####                            Environment                                ###
  ##############################################################################
  if(!requireNamespace("simutils", quietly = TRUE)){
    message("Splatter is not installed on your device")
    message("Installing simutils...")
    devtools::install_github("duohongrui/simutils")
  }
  ##############################################################################
  ####                               Check                                   ###
  ##############################################################################
  # nGenes
  if(!is.null(other_prior[["nGenes"]])){
    gene_num <- as.integer(other_prior[["nGenes"]])
  }else{
    gene_num <- as.integer(parameters[["data_dim"]][1])
  }
  # nCells
  if(!is.null(other_prior[["nCells"]])){
    ncells <- as.integer(other_prior[["nCells"]])
  }else{
    ncells <- as.integer(parameters[["data_dim"]][2])
  }

  if(is.null(other_prior[["newick_tree"]])){
    data_dim <- parameters[["data_dim"]]
    newick_tree <- parameters[["newick_tree"]]
    inter_cell <- stringr::str_split(newick_tree, pattern = "[)]", simplify = TRUE)
    len_inter <- length(inter_cell)-2
    change_index <- which(as.logical(stringr::str_count(inter_cell, pattern = "^:")))
    inter_cell[change_index] <- paste0(LETTERS[1:len_inter], inter_cell[change_index])
    newick_tree <- paste(inter_cell, collapse = ")")
    edge <- unlist(stringr::str_extract_all(string = newick_tree, pattern = ":[:digit:]+[.]*[:digit:]*"))
    node <- length(edge)+1
    cell_allo <- c(rep(round(ncells/node), node-1), ncells-(round(ncells/node)*(node-1)))
    # Return to users
    message(paste0("nCells: ", ncells))
    message(paste0("nGenes: ", gene_num))
    for(o in 1:length(edge)){
      newick_tree <- stringr::str_replace(newick_tree,
                                          pattern = edge[o],
                                          replacement = paste0(":", as.character(cell_allo[o])))
    }
    newick_tree <- paste0(stringr::str_split(newick_tree, pattern = ";",simplify = T),
                          paste0('Z:', cell_allo[node], ";"))[1]
  }else{
    newick_tree <- other_prior[["newick_tree"]]
  }
  # alpha
  if(!is.null(other_prior[["alpha"]])){
    alpha <- other_prior[["alpha"]]
  }else{
    alpha <- 0.2
  }
  # beta
  if(!is.null(other_prior[["beta"]])){
    beta <- other_prior[["beta"]]
  }else{
    beta <- 3
  }
  # modules
  if(!is.null(other_prior[["modules"]])){
    modules <- other_prior[["modules"]]
  }else{
    modules <- 15
  }

  exec_text <- system.file("python", "PROSSTT_python.py", package = "simmethods")
  if(!requireNamespace("reticulate", quietly = TRUE)){
    message("reticulate is not installed on your device")
    message("Installing reticulate...")
    utils::install.packages("reticulate")
  }
  a <- system("which python", intern = TRUE)
  if(S4Vectors::isEmpty(a)){
    reticulate::use_python(system("which python3", intern = TRUE))
  }else{
    reticulate::use_python(a)
  }
  reticulate::source_python(exec_text)
  simulation_params <- list(newick_string = newick_tree,
                            modules = modules,
                            gene_num = gene_num,
                            seed = as.integer(seed),
                            alpha = alpha,
                            beta = beta)
  ##############################################################################
  ####                            Simulation                                 ###
  ##############################################################################
  if(verbose){
    message("Simulating datasets using PROSSTT")
  }
  # Estimation
  simulate_detection <- peakRAM::peakRAM(
    simulate_result <- do.call("PROSSTT_sim_Python", simulation_params)
  )
  ##############################################################################
  ####                        Format Conversion                              ###
  ##############################################################################
  simulate_result <- simulate_result %>% as.matrix() %>% t()
  colnames(simulate_result) <- paste0("Cell", 1:ncol(simulate_result))
  rownames(simulate_result) <- paste0("Gene", 1:nrow(simulate_result))
  ## col_data
  col_data <- data.frame("cell_name" = colnames(simulate_result))
  ## row_data
  row_data <- data.frame("gene_name" = rownames(simulate_result))
  # Establish SingleCellExperiment
  simulate_result <- SingleCellExperiment::SingleCellExperiment(list(counts = simulate_result),
                                                                colData = col_data,
                                                                rowData = row_data)

  simulate_result <- simutils::data_conversion(SCE_object = simulate_result,
                                               return_format = return_format)

  ##############################################################################
  ####                           Ouput                                       ###
  ##############################################################################
  simulate_output <- list(simulate_result = simulate_result,
                          simulate_detection = simulate_detection)
  return(simulate_output)
}
duohongrui/simmethods documentation built on June 17, 2024, 10:49 a.m.