R/04-SplatPop.R

Defines functions SplatPop_simulation SplatPop_estimation

Documented in SplatPop_estimation SplatPop_simulation

#' Estimate Parameters From Real Datasets by SplatPop
#'
#' This function is used to estimate useful parameters from a real dataset by
#' using `splatPopEstimate` function in Splatter package.
#'
#' @param ref_data A count matrix. Each row represents a gene and each column
#' represents a cell.
#' @param verbose Logical.
#' @param seed An integer of a random seed.
#' @param other_prior A list with names of certain parameters. Some methods need
#' extra parameters to execute the estimation step, so you must input them. In
#' simulation step, the number of cells, genes, groups, batches, the percent of
#' DEGs and other variables are usually customed, so before simulating a dataset
#' you must point it out. See `Details` below for more information.
#' @importFrom splatter splatPopEstimate
#' @importFrom dplyr mutate group_by slice_sample
#'
#' @return A list contains the estimated parameters and the results of execution
#' detection.
#' @export
#' @details
#' When estimating parameters with SplatPop, users can input some extra information (but it is rarely used):
#' 1. means. Matrix of real gene means across a population, where each row is a gene and each column is an individual in the population.
#' 2. eqtl. A data.frame with all or top eQTL pairs from a real eQTL analysis. Must include columns: 'gene_id', 'pval_nominal', and 'slope'.
#' For more information, please check [splatter::splatPopEstimate] or <https://www.bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatPop.html>
#' @references
#' Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome biology, 2017, 18(1): 1-15. <https://doi.org/10.1186/s13059-017-1305-0>
#'
#' Bioconductor URL: <https://bioconductor.org/packages/release/bioc/html/splatter.html>
#'
#' Github URL: <https://github.com/Oshlack/splatter>
#' @examples
#' \dontrun{
#' ref_data <- simmethods::data
#' # Estimate parameters
#' estimate_result <- SplatPop_estimation(ref_data = ref_data,
#'                                        verbose = TRUE,
#'                                        seed = 111)
#' estimate_result <- estimate_result[["estimate_result"]]
#' ## Check the class
#' class(estimate_result) == "SplatPopParams"
#' }
#'
SplatPop_estimation <- function(ref_data,
                                verbose = FALSE,
                                other_prior = NULL,
                                seed
){
  ##############################################################################
  ####                               Check                                   ###
  ##############################################################################
  if(!is.matrix(ref_data)){
    ref_data <- as.matrix(ref_data)
  }
  ##############################################################################
  ####                            Estimation                                 ###
  ##############################################################################
  if(verbose){
    message("Estimating parameters using SplatPop")
  }
  # Seed
  set.seed(seed)
  # Estimation
  estimate_detection <- peakRAM::peakRAM(
    estimate_result <- splatter::splatPopEstimate(ref_data,
                                                  means = other_prior[["means"]],
                                                  eqtl = other_prior[["eqtl"]])
  )
  ##############################################################################
  ####                           Ouput                                       ###
  ##############################################################################
  estimate_output <- list(estimate_result = estimate_result,
                          estimate_detection = estimate_detection)
  return(estimate_output)
}


#' Simulate Datasets by SplatPop
#'
#' This function is used to simulate datasets from learned parameters by `splatPopSimulate`
#' function in Splatter package.
#'
#' @param parameters A object generated by [splatter::splatPopEstimate()]
#' @param other_prior A list with names of certain parameters. Some methods need
#' extra parameters to execute the estimation step, so you must input them. In
#' simulation step, the number of cells, genes, groups, batches, the percent of
#' DEGs and other variables are usually customed, so before simulating a dataset
#' you must point it out.
#' @param return_format A character. Alternatives choices: list, SingleCellExperiment,
#' Seurat, h5ad. If you select `h5ad`, you will get a path where the .h5ad file saves to.
#' @param verbose Logical. Whether to return messages or not.
#' @param seed A random seed.
#' @importFrom splatter splatPopSimulate mockVCF
#' @importFrom BiocGenerics as.data.frame
#'
#' @details
#' In addtion to simulate datasets with default parameters, users want to simulate
#' other kinds of datasets, e.g. a counts matrix with 2 or more cell groups. In
#' SplatPop, you can set extra parameters to simulate datasets.
#'
#' The customed parameters you can set are below:
#' 1. nCells. In SplatPop, you can not set nCells directly and should set batchCells instead. For example, if you want to simulate 1000 cells, you can type `other_prior = list(batchCells = 1000)`. If you type `other_prior = list(batchCells = c(500, 500))`, the simulated data will have two batches.
#' 2. nGenes. You can directly set `other_prior = list(nGenes = 5000)` to simulate 5000 genes.
#' 3. nGroups. You can not directly set `other_prior = list(nGroups = 3)` to simulate 3 groups. Instead, you should set `other_prior = list(prob.group = c(0.2, 0.3, 0.5))` where the sum of group probabilities must equal to 1.
#' 4. de.prob. You can directly set `other_prior = list(de.prob = 0.2)` to simulate DEGs that account for 20 percent of all genes.
#' 5. prob.group. You can directly set `other_prior = list(prob.group = c(0.2, 0.3, 0.5))` to assign three proportions of cell groups. Note that the number of groups always equals to the length of the vector.
#' 6. nBatches. You can not directly set `other_prior = list(nBatches = 3)` to simulate 3 batches. Instead, you should set `other_prior = list(batchCells = c(500, 500, 500))` to reach the goal and the total cells are 1500.
#' 7. If users want to simulate datasets for trajectory inference, just set `other_prior = list(paths = TRUE)`. Simulating trajectory datasets can also specify the parameters of group and batch. See `Examples`.
#'
#' For more customed parameters in SplatPop, please check [splatter::SplatPopParams()].
#' For detailed information about SplatPop, go to <https://www.bioconductor.org/packages/release/bioc/vignettes/splatter/inst/doc/splatPop.html>.
#'
#' @export
#' @references
#' Azodi C B, Zappia L, Oshlack A, et al. splatPop: simulating population scale single-cell RNA sequencing data. Genome biology, 2021, 22(1): 1-16. <https://doi.org/10.1186/s13059-021-02546-1>
#'
#' Bioconductor URL: <https://bioconductor.org/packages/release/bioc/html/splatter.html>
#'
#' Github URL: <https://github.com/Oshlack/splatter>
#'
#' @examples
#' \dontrun{
#' # Load data
#' ref_data <- simmethods::data
#' # Estimate parameters
#' estimate_result <- simmethods::SplatPop_estimation(ref_data = ref_data,
#'                                                    verbose = TRUE,
#'                                                    seed = 10)
#'
#' # (1) Simulate 500 cells (Since we can not set nCells directly, so we can set
#' # batchCells (a numeric vector)) and 500 genes
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(batchCells = 500,
#'                      nGenes = 500),
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' count_data <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(count_data)
#'
#'
#' # (2) Simulate one group and one batch
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = NULL,
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' count_data <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(count_data)
#'
#'
#' # (3) Simulate two groups (de.prob = 0.1) and one batch
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(nGenes = 500,
#'                      prob.group = c(0.4, 0.6)),
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' count_data <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(count_data)
#' ## cell information
#' col_data <- simulate_result[["simulate_result"]][["col_meta"]]
#' table(col_data$group)
#' ## gene information
#' row_data <- simulate_result[["simulate_result"]][["row_meta"]]
#' ### The result of Splat contains the factors of different groups and uses can
#' ### calculate the fold change by division. For example, the DEFactors of A gene
#' ### in Group1 and Group2 are respectively 2 and 1, and the fold change of A gene
#' ### is 2/1=2 or 1/2=0.5.
#' fc_group1_to_group2 <- row_data$DEFacGroup2/row_data$DEFacGroup1
#' table(fc_group1_to_group2 != 1)[2]/500 ## de.prob = 0.1
#' ### number of all DEGs
#' table(row_data$de_gene)
#'
#'
#' # (4) Simulate two groups (de.prob = 0.2) and two batches
#' ## Since we can not set nBatches directly, so we can set batchCells (a numeric vector)
#' ## to determin the number of batches and cells simutaniously.
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(prob.group = c(0.4, 0.6),
#'                      batchCells = c(80, 80),
#'                      nGenes = 500,
#'                      de.prob = 0.2),
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' count_data <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(count_data)
#' col_data <- simulate_result[["simulate_result"]][["col_meta"]]
#' table(col_data$group)
#' table(col_data$batch)
#'
#'
#' # (5) Simulate three groups (de.prob = 0.2) and two batches
#' ## Since we can not set nBatches directly, so we can set batchCells (a numeric vector)
#' ## to determin the number of batches and cells simutaniously.
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(prob.group = c(0.4, 0.3, 0.3),
#'                      batchCells = c(80, 80),
#'                      nGenes = 500,
#'                      de.prob = 0.2),
#'   return_format = "list",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' count_data <- simulate_result[["simulate_result"]][["count_data"]]
#' dim(count_data)
#' col_data <- simulate_result[["simulate_result"]][["col_meta"]]
#' table(col_data$group)
#' table(col_data$batch)
#' ## row data
#' row_data <- simulate_result[["simulate_result"]][["row_meta"]]
#' ### DEGs
#' table(row_data$de_gene)
#' ### fold change of Group1 to Group2
#' fc_group1_to_group2 <- row_data$DEFacGroup2/row_data$DEFacGroup1
#' table(fc_group1_to_group2 > 1)[2]/500
#' ### fold change of Group1 to Group3
#' fc_group1_to_group3 <- row_data$DEFacGroup3/row_data$DEFacGroup1
#' table(fc_group1_to_group3 > 1)[2]/500
#' ### fold change of Group2 to Group3
#' fc_group2_to_group3 <- row_data$DEFacGroup3/row_data$DEFacGroup2
#' table(fc_group2_to_group3 > 1)[2]/500
#'
#' # 6) Simulate trajectory (only one group is simulated by default)
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(nGenes = 500,
#'                      paths = TRUE),
#'   return_format = "SingleCellExperiment",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' ## plot
#' result <- scater::logNormCounts(simulate_result[["simulate_result"]])
#' result <- scater::runPCA(result)
#' plotPCA(result, colour_by = "group")
#'
#' # 7) Simulate trajectory (three groups)
#' simulate_result <- simmethods::SplatPop_simulation(
#'   parameters = estimate_result[["estimate_result"]],
#'   other_prior = list(paths = TRUE,
#'                      nGenes = 500,
#'                      group.prob = c(0.3, 0.4, 0.3),
#'                      de.facLoc = 0.5,
#'                      de.prob = 0.5),
#'   return_format = "SingleCellExperiment",
#'   verbose = TRUE,
#'   seed = 111
#' )
#' ## plot
#' result <- scater::logNormCounts(simulate_result[["simulate_result"]])
#' result <- scater::runPCA(result)
#' plotPCA(result, colour_by = "group")
#' }
#'
SplatPop_simulation <- function(parameters,
                                other_prior = NULL,
                                return_format,
                                verbose = FALSE,
                                seed
){
  ##############################################################################
  ####                               Check                                   ###
  ##############################################################################
  assertthat::assert_that(class(parameters) == "SplatPopParams")
  if(!is.null(other_prior)){
    parameters <- simutils::set_parameters(parameters = parameters,
                                           other_prior = other_prior,
                                           method = "SplatPop")
  }
  if(!is.null(other_prior[["prob.group"]])){
    parameters <- splatter::setParam(parameters,
                                     name = "group.prob",
                                     value = other_prior[["prob.group"]])

  }
  # Get params to check
  params_check <- splatter::getParams(parameters, c("nBatches",
                                                    "batchCells",
                                                    "nGroups",
                                                    "group.prob",
                                                    "de.prob",
                                                    "nCells",
                                                    "nGenes"))

  # check batches
  if(length(params_check[["batchCells"]]) > 1){
    assertthat::assert_that(params_check[["nBatches"]] > 1)
  }else{
    assertthat::assert_that(params_check[["nBatches"]] == 1)
  }
  # check groups
  if(length(params_check[["group.prob"]]) > 1){
    assertthat::assert_that(params_check[["nGroups"]] > 1)
  }else{
    assertthat::assert_that(params_check[["nGroups"]] == 1)
  }
  # DEGs proportion
  de.prob <- params_check[["de.prob"]]
  # Return to users
  message(paste0("nCells: ", params_check[['nCells']]))
  message(paste0("nGenes: ", params_check[['nGenes']]))
  message(paste0("nGroups: ", params_check[['nGroups']]))
  message(paste0("de.prob: ", de.prob))
  message(paste0("nBatches: ", params_check[['nBatches']]))
  ##############################################################################
  ####                            Simulation                                 ###
  ##############################################################################
  if(verbose){
    message("Simulating datasets using SplatPop")
  }
  # Seed
  parameters <- splatter::setParam(parameters, name = "seed", value = seed)
  # de.prob
  parameters <- splatter::setParam(parameters,
                                   name = "de.prob",
                                   value = de.prob/params_check[['nGroups']])
  # vcf
  if(is.null(other_prior[["vcf"]])){
    if(params_check[['nBatches']] == 1){
      vcf <- splatter::mockVCF(n.samples = 1,
                               seed = seed)
    }else{
      vcf <- splatter::mockVCF(n.samples = 10,
                               seed = seed)
    }
  }else vcf <- other_prior[["vcf"]]
  # gff
  if(is.null(other_prior[["gff"]])){
    gff <- splatter::mockGFF(n.genes = params_check[['nGenes']],
                             seed = seed)
  }else gff <- other_prior[["gff"]]
  # Simulation
  if(!is.null(other_prior[["paths"]])){
    cat("Simulating trajectory datasets by SplatPop \n")
    submethod <- "paths"
    if(!is.null(other_prior[["path.from"]])){
      parameters <- splatter::setParam(parameters,
                                       name = "path.from",
                                       value = other_prior[["path.from"]])
    }else{
      parameters <- splatter::setParam(parameters,
                                       name = "path.from",
                                       value = seq(1:params_check[['nGroups']])-1)
    }
    simulate_detection <- peakRAM::peakRAM(
      simulate_result <- splatter::splatPopSimulate(parameters,
                                                    method = "paths",
                                                    vcf = vcf,
                                                    gff = gff,
                                                    eqtl = other_prior[["eqtl"]],
                                                    means = other_prior[["means"]],
                                                    key = other_prior[["key"]],
                                                    verbose = verbose)
    )
  }else{
    simulate_detection <- peakRAM::peakRAM(
      simulate_result <- splatter::splatPopSimulate(parameters,
                                                    vcf = vcf,
                                                    gff = gff,
                                                    eqtl = other_prior[["eqtl"]],
                                                    means = other_prior[["means"]],
                                                    key = other_prior[["key"]],
                                                    verbose = verbose)
    )
  }

  ##############################################################################
  ####                        Format Conversion                              ###
  ##############################################################################
  # col_data
  col_data <- as.data.frame(SummarizedExperiment::colData(simulate_result))
  if(params_check[['nBatches']] != 1){
    col_data <- col_data %>%
      dplyr::mutate("n" = 1:nrow(col_data)) %>%
      dplyr::group_by("Group") %>%
      dplyr::slice_sample(prop = 0.1) %>%
      BiocGenerics::as.data.frame()
    cell_index <- col_data$n
  }
  col_data <- col_data[, c("Cell", "Batch", "Group")]
  colnames(col_data) <- c("cell_name", "batch", "group")
  col_data$cell_name <- paste0("Cell", 1:nrow(col_data))
  rownames(col_data) <- col_data$cell_name
  # counts
  counts <- SingleCellExperiment::counts(simulate_result)
  if(params_check[['nBatches']] != 1){
    counts <- counts[, cell_index]
  }
  colnames(counts) <- paste0("Cell", 1:ncol(counts))
  rownames(counts) <- paste0("Gene", as.numeric(stringr::str_split(rownames(counts), "_", simplify = TRUE)[, 2]))
  counts <- counts[paste0("Gene", 1:nrow(counts)), ]
  # row_data
  row_data <- as.data.frame(SummarizedExperiment::rowData(simulate_result))
  if(params_check[['nGroups']] == 1){
    row_data <- data.frame("gene_name" = paste0("Gene", 1:nrow(counts)))
    rownames(row_data) <- row_data$gene_name
  }else{
    group_fac <- row_data[, grep(colnames(row_data), pattern = "^GroupDE")]
    total_sum <- rowSums(group_fac)
    de_gene <- ifelse(total_sum == params_check[['nGroups']], "no", "yes")
    row_data <- data.frame("gene_name" = paste0("Gene", 1:nrow(counts)),
                           "de_gene" = de_gene)
    row_data <- cbind(row_data, group_fac)
    colnames(row_data) <- c("gene_name", "de_gene", paste0("DEFacGroup", 1:params_check[['nGroups']]))
    rownames(row_data) <- row_data$gene_name
  }
  # Establish SingleCellExperiment
  simulate_result <- SingleCellExperiment::SingleCellExperiment(list(counts = counts),
                                                                colData = col_data,
                                                                rowData = row_data)
  simulate_result <- simutils::data_conversion(SCE_object = simulate_result,
                                               return_format = return_format)

  ##############################################################################
  ####                           Ouput                                       ###
  ##############################################################################
  simulate_output <- list(simulate_result = simulate_result,
                          simulate_detection = simulate_detection)
  return(simulate_output)
}
duohongrui/simmethods documentation built on June 17, 2024, 10:49 a.m.