simulate_data: simulate_data

View source: R/methods.R

simulate_dataR Documentation

simulate_data

Description

This function simulates counts from a linear model.

Usage

simulate_data(
  .data,
  .estimate_object,
  formula_composition,
  formula_variability = NULL,
  .sample = NULL,
  .cell_group = NULL,
  .coefficients = NULL,
  variability_multiplier = 5,
  number_of_draws = 1,
  mcmc_seed = sample(1e+05, 1),
  cores = detectCores()
)

Arguments

.data

A tibble including a cell_group name column | sample name column | read counts column | factor columns | Pvalue column | a significance column

.estimate_object

The result of sccomp_estimate execution. This is used for sampling from real-data properties.

formula_composition

A formula. The sample formula used to perform the differential cell_group abundance analysis

formula_variability

A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors.

.sample

A column name as symbol. The sample identifier

.cell_group

A column name as symbol. The cell_group identifier

.coefficients

The column names for coefficients, for example, c(b_0, b_1)

variability_multiplier

A real scalar. This can be used for artificially increasing the variability of the simulation for benchmarking purposes.

number_of_draws

An integer. How may copies of the data you want to draw from the model joint posterior distribution.

mcmc_seed

An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed()#' @param cores Integer, the number of cores to be used for parallel calculations.

cores

Integer, the number of cores to be used for parallel calculations.

Value

A tibble (tbl) with the following columns:

  • sample - A character column representing the sample name.

  • type - A factor column representing the type of the sample.

  • phenotype - A factor column representing the phenotype in the data.

  • count - An integer column representing the original cell counts.

  • cell_group - A character column representing the cell group identifier.

  • b_0 - A numeric column representing the first coefficient used for simulation.

  • b_1 - A numeric column representing the second coefficient used for simulation.

  • generated_proportions - A numeric column representing the generated proportions from the simulation.

  • generated_counts - An integer column representing the generated cell counts from the simulation.

  • replicate - An integer column representing the replicate number for each draw from the posterior distribution.

Examples


message("Use the following example after having installed install.packages(\"cmdstanr\", repos = c(\"https://stan-dev.r-universe.dev/\", getOption(\"repos\")))")


  if (instantiate::stan_cmdstan_exists()) {
    data("counts_obj")
    library(dplyr)

    estimate = sccomp_estimate(
      counts_obj,
      ~ type, ~1, sample, cell_group, count,
      cores = 1
    )

    # Set coefficients for cell_groups. In this case all coefficients are 0 for simplicity.
    counts_obj = counts_obj |> mutate(b_0 = 0, b_1 = 0)

    # Simulate data
    simulate_data(counts_obj, estimate, ~type, ~1, sample, cell_group, c(b_0, b_1))
  }



stemangiola/sccomp documentation built on Nov. 15, 2024, 8 a.m.