simData: simData
In HelenaLC/muscat: Multi-sample multi-group scRNA-seq data analysis tools

View source: R/simData.R

simData

R Documentation

simData

Description

Simulation of complex scRNA-seq data

Usage

simData(
  x,
  ng = nrow(x),
  nc = ncol(x),
  ns = NULL,
  nk = NULL,
  probs = NULL,
  dd = TRUE,
  p_dd = diag(6)[1, ],
  paired = FALSE,
  p_ep = 0.5,
  p_dp = 0.3,
  p_dm = 0.5,
  p_type = 0,
  lfc = 2,
  rel_lfc = NULL,
  phylo_tree = NULL,
  phylo_pars = c(ifelse(is.null(phylo_tree), 0, 0.1), 3),
  force = FALSE
)

Arguments

`x`	a `SingleCellExperiment`.
`ng`	number of genes to simulate. Importantly, for the library sizes computed by `prepSim` (= `exp(x$offset)`) to make sense, the number of simulated genes should match with the number of genes in the reference. To simulate a reduced number of genes, e.g. for testing and development purposes, please set `force = TRUE`.
`nc`	number of cells to simulate.
`ns`	number of samples to simulate; defaults to as many as available in the reference to avoid duplicated reference samples. Specifically, the number of samples will be set to `n = nlevels(x$sample_id)` when `dd = FALSE`, `n` per group when `dd, paired = TRUE`, and `floor(n/2)` per group when `dd = TRUE, paired = FALSE`. When a larger number samples should be simulated, set `force = TRUE`.
`nk`	number of clusters to simulate; defaults to the number of available reference clusters (`nlevels(x$cluster_id)`).
`probs`	a list of length 3 containing probabilities of a cell belonging to each cluster, sample, and group, respectively. List elements must be NULL (equal probabilities) or numeric values in [0, 1] that sum to 1.
`dd`	whether or not to simulate differential distributions; if TRUE, two groups are simulated and `ns` corresponds to the number of samples per group, else one group with `ns` samples is simulated.
`p_dd`	numeric vector of length 6. Specifies the probability of a gene being EE, EP, DE, DP, DM, or DB, respectively.
`paired`	logical specifying whether a paired design should be simulated (both groups use the same set of reference samples) or not (reference samples are drawn at random).
`p_ep`, `p_dp`, `p_dm`	numeric specifying the proportion of cells to be shifted to a different expression state in one group (see details).
`p_type`	numeric. Probability of EE/EP gene being a type-gene. If a gene is of class "type" in a given cluster, a unique mean will be used for that gene in the respective cluster.
`lfc`	numeric value to use as mean logFC (logarithm base 2) for DE, DP, DM, and DB type of genes.
`rel_lfc`	numeric vector of relative logFCs for each cluster. Should be of length `nlevels(x$cluster_id)` with `levels(x$cluster_id)` as names. Defaults to factor of 1 for all clusters.
`phylo_tree`	newick tree text representing cluster relations and their relative distance. An explanation of the syntax can be found here. The distance between the nodes, except for the original branch, will be translated in the number of shared genes between the clusters belonging to these nodes (this relation is controlled with `phylo_pars`). The distance between two clusters is defined as the sum of the branches lengths separating them.
`phylo_pars`	vector of length 2 providing the parameters that control the number of type genes. Passed to an exponential PDF (see details).
`force`	logical specifying whether to force simulation when `ng` and/or `ns` don't match the number of available reference genes and samples, respectively.

Details

simData simulates multiple clusters and samples across 2 experimental conditions from a real scRNA-seq data set.

The simulation of type genes can be performed in 2 ways; (1) via p_type to simulate independent clusters, OR (2) via phylo_tree to simulate a hierarchical cluster structure.

For (1), a subset of p_type % of genes are selected per cluster to use a different references genes than the remainder of clusters, giving rise to cluster-specific NB means for count sampling.

For (2), the number of shared/type genes at each node are given by a*G*e^(-b*d), where

a – controls the percentage of shared genes between nodes. By default, at most 10% of the genes are reserved as type genes (when b = 0). However, it is advised to tune this parameter depending on the input prep_sce.
b – determines how the number of shared genes decreases with increasing distance d between clusters (defined through phylo_tree).

Value

a SingleCellExperiment containing multiple clusters & samples across 2 groups as well as the following metadata:

cell metadata (colData(.))

a DataFrame containing, containing, for each cell, it's cluster, sample, and group ID.

gene metadata (rowData(.))

a DataFrame containing, for each gene, it's class (one of "state", "type", "none") and specificity (specs; NA for genes of type "state", otherwise a character vector of clusters that share the given gene).

experiment metadata (metadata(.))

experiment_info: a data.frame summarizing the experimental design.
n_cells: the number of cells for each sample.
gene_info: a data.frame containing, for each gene in each cluster, it's differential distribution category, mean logFC (NA for genes for categories "ee" and "ep"), gene used as reference (sim_gene), dispersion sim_disp, and simulation means for each group sim_mean.A/B.
ref_sids/kidskids: the sample/cluster IDs used as reference.
args: a list of the function call's input arguments.

Author(s)

Helena L Crowell & Anthony Sonrel

References

Crowell, HL, Soneson, C, Germain, P-L, Calini, D, Collin, L, Raposo, C, Malhotra, D & Robinson, MD: On the discovery of population-specific state transitions from multi-sample multi-condition single-cell RNA sequencing data. bioRxiv 713412 (2018). doi: https://doi.org/10.1101/713412

Examples

data(example_sce)
library(SingleCellExperiment)

# prep. SCE for simulation
ref <- prepSim(example_sce)

# simulate data
(sim <- simData(ref, nc = 200,
  p_dd = c(0.9, 0, 0.1, 0, 0, 0),
  ng = 100, force = TRUE,
  probs = list(NULL, NULL, c(1, 0))))

# simulation metadata
head(gi <- metadata(sim)$gene_info)

# should be ~10% DE  
table(gi$category)

# unbalanced sample sizes
sim <- simData(ref, nc = 100, ns = 2,
  probs = list(NULL, c(0.25, 0.75), NULL),
  ng = 10, force = TRUE)
table(sim$sample_id)

# one group only
sim <- simData(ref, nc = 100,
  probs = list(NULL, NULL, c(1, 0)),
  ng = 10, force = TRUE)
levels(sim$group_id)

# HIERARCHICAL CLUSTER STRUCTURE
# define phylogram specifying cluster relations
phylo_tree <- "(('cluster1':0.1,'cluster2':0.1):0.4,'cluster3':0.5);"
# verify syntax & visualize relations
library(phylogram)
plot(read.dendrogram(text = phylo_tree))

# let's use a more complex phylogeny
phylo_tree <- "(('cluster1':0.4,'cluster2':0.4):0.4,('cluster3':
  0.5,('cluster4':0.2,'cluster5':0.2,'cluster6':0.2):0.4):0.4);"
plot(read.dendrogram(text = phylo_tree))

# simulate clusters accordingly
sim <- simData(ref, 
  phylo_tree = phylo_tree, 
  phylo_pars = c(0.1, 3), 
  ng = 500, force = TRUE)
# view information about shared 'type' genes
table(rowData(sim)$class)

HelenaLC/muscat documentation built on June 9, 2025, 9:33 a.m.