Scaffold
is an R package for generating scRNA-seq data by
statistically modelling each step of the experimental
process. Simulation parameters can be estimated from real datasets,
and a comprehensive plotting function is provided to generate summary
figures comparing the real and simulated data. This vignette provides details and examples using the Scaffold package.
To install Scaffold
via GitHub:
devtools::install_github("rhondabacher/scaffold")
After successful installation, the package must be loaded into the working space:
library(scaffold)
The most common way to use Scaffold is to use an input dataset from which various parameters are estimated and then used to simulate data. The input dataset can be formatted as a data matrix or of the SingleCellExperiment
class. The
scaffold
package provides the uneq_ec_data
dataset to demonstrate formatting data as a SingleCellExperiment
class:
# Loading the example data data(uneq_ec_data) # Loading the SingleCellExperiment package if (!requireNamespace("SingleCellExperiment", quietly = TRUE)) BiocManager::install("SingleCellExperiment") library(SingleCellExperiment) # Creating the SingleCellExperiment class object: sce <- SingleCellExperiment(list(counts = uneq_ec_data)) sce
Now that we have the appropriate format for our input dataset, we first run the estimateScaffoldParameters
function to set or estimate simulation parameters. The default parameters are set to simulate nonUMI data from the C1/Smart-seq protocol. To change this, adjust the parameters useUMI
(logical; whether to make the protocol UMI based) and protocol
("C1", "droplet" and "10X" are all acceptable). If the input dataset already contains UMI counts, then this should be specified by setting parameters sceUMI = TRUE
.
scaffoldParams <- estimateScaffoldParameters(sce, sceUMI=FALSE, useUMI = FALSE, protocol="C1")
The estimateScaffoldParameters
function documentation describes all the parameter options. A table of all the parameter options with notes is also given in this vignette in Section 6.
Once the ScaffoldParams
object is constructed, simulations can be
run with the simulateScaffold
function. The inputs to this function are the parameter object and the original sce dataset.
sce_sim <- simulateScaffold(scaffoldParams, sce)
The output sce_sim
is an object of class SingleCellExperiment
that
contains the simulated gene counts.
To access the simulated counts:
simcounts <- counts(sce_sim) simcounts[1:5,1:5]
To simulate data with UMIs or from the droplet/10X protocols the following parameters can be adjusted:
## The following will set the parameters as if the C1 protocol was UMI based: scaffoldParams <- estimateScaffoldParameters(sce, sceUMI = FALSE, useUMI = TRUE, protocol="C1") ## The following will set the parameters as if the droplet/10X protocol was used. scaffoldParams <- estimateScaffoldParameters(sce, sceUMI = FALSE, useUMI = TRUE, protocol="10X")
Let's look at an example generating UMI counts. When the UMI option is used, the output of simulateScaffold
contains two matrices, one having the read counts and the other with the umi counts.
scaffoldParams <- estimateScaffoldParameters(sce, sceUMI = FALSE, useUMI = TRUE, protocol="C1") sce_sim <- simulateScaffold(scaffoldParams, sce) sce_sim
To access the umi counts:
simumis <- sce_sim@assays@data$umi_counts simumis[1:5,1:5]
While many of the parameters can be estimated from the data and set based on a specific protocol, it may be of interest to compare how changing a parameter affects downstream properties of the data. To do this, a base simulation should be run so that the initial mRNA counts are generated and all parameters are pre-estimated.
Below we demonstrate code to simulate a baseline and a single parameter change:
# Baseline simulation scaffoldParams <- estimateScaffoldParameters(sce) sce_sim <- simulateScaffold(scaffoldParams, sce)
Now we want to examine the effect of only a single parameter, e.g. equalization amount. The previous output contains metadata and cell-specific data that can be passed to new simulations. The metadata object contains the simulated initial mRNA counts, this ensures all simulations are on the same starting population. The capture efficiency is estimated within simulateScaffold and output as column data. It is accessed via colData()$capEfficiency
.
# Some estimates are finalized in simulateScaffold so we want to pull those out: scaffoldParams@captureEfficiency <- colData(sce_sim)$capEfficiency # Change one parameter of interest print(scaffoldParams@equalizationAmount) # Previous setting scaffoldParams@equalizationAmount <- 0 # Update setting sce_sim_eq <- simulateScaffold(scaffoldParams, sce, inputInitial = sce_sim@metadata$initialSimCounts)
Scaffold can simulate multiple populations through the numCells
and usePops
parameters in the estimateScaffoldParameters
function. First, specify a vector with the number of cells for each population in numCells
. The parameter usePops
is a list object and should have the following elements:
propGenes
- The proportion of genes having distinct expression compared to the first (reference) population. The first population will be simulated based only on the input dataset and estimated/set parameters.fc_mean
- The average fold-change for expression differences compared to the first population.fc_sd
- Standard deviation of fold-change for expression differences compared to the first population.Fold-changes are simulated from a Normal(fc_mean
, fc_sd
) and the direction is chosen at random. These three elements are vectors with length equal to the number of populations. The first value for each of these elements should be zero to indicate no changes should be made to the reference population.
Below is an example simulating three cell populations each having 50 cells. Setting popHet
ensures that we are simulating a homogeneous population as the reference, this is not required but just for demonstration.
scaffoldParams <- estimateScaffoldParameters(sce, numCells = c(50,50,50), popHet = c(1,1), usePops = list(propGenes = c(0, .6, .4), fc_mean = c(0, 2, 1.5), fc_sd = c(0, .4, .4))) multipop.sce <- simulateScaffold(scaffoldParams, sce)
Below we can visualize the three distinct populations in a t-SNE plot using the scater
R package:
if (!requireNamespace("scater", quietly = TRUE)) BiocManager::install("scater") library(scater) multipop.sce <- logNormCounts(multipop.sce) multipop.sce <- runPCA(multipop.sce, name="PCA", ncomponents=15) multipop.sce <- runTSNE(multipop.sce, perplexity=10, dimred="PCA", n_dimred=10) plotReducedDim(multipop.sce, dimred = "TSNE", colour_by = "cellPopulation", point_alpha = .8, point_size=4) + theme(text = element_text(size=20)) + scale_colour_discrete(name="Population")
Scaffold can additionally simulate continuous cell populations or those having dyanmically expressed genes (e.g. cell development/differentiation) using the useDynamic
parameter. We assume some proportion of genes drive the dynamic process and we simulate their initial mRNA counts via a B-spline. The parameter useDynamic
is a list object and should have the following elements:
propGenes
- The proportion of genes to be simulated with dynamic expression.degree
- The degree of the B-spline (default is 2).knots
- Knots for the B-spline or 'locations' of major change along the continuous path of cells. The default is two knots and the locations are drawn from Uniform(0, .5) and Uniform(.5,1). The knot values must be within (0,1). theta
- the directional changes between the edges and knots, this is a vector of length equal to knots+degree+1. The default is to generate all directions from Normal(5, 5).User-specified values for knots
and theta
must be matrices as these are different for each gene.
set.seed(14) RcppZiggurat::zsetseed(14)
# Say we want 15% of genes dynamic ngenes <- ceiling(.15 * nrow(sce)) myknots <- matrix(runif(2*ngenes, 0, 1), ncol=2, nrow=ngenes) mytheta <- matrix(rnorm(5, 5, 5), ncol=5, nrow=ngenes) scaffoldParams <- estimateScaffoldParameters(sce, numCells = 50, popHet = c(1,1), useDynamic = list(propGenes = .15, degree = 2, knots = myknots, theta = mytheta)) multipop.sce <- simulateScaffold(scaffoldParams, sce)
Let's plot one of the genes that we simulated to be dynamic and a non-dynamic gene:
# Scaffold outputs the dynamic genes at the bottom of the output data matrix cells <- 1:ncol(multipop.sce) par(mfrow=c(1,2)) plot(cells, counts(multipop.sce)[17150,], xlab="Cells", ylab="Simulated Counts", main="Dynamic Gene") plot(cells, counts(multipop.sce)[192,], xlab="Cells", ylab="Simulated Counts", main="non-Dynamic Gene")
To simulate more than 5k cells, it's best to run Scaffold in batches especially when using UMIs as tracking the unique transcripts is memory intensive. Similar to the section for altering parameters, we can similarly estimate the baseline simulation parameters and then batches can be generated consecutively or in parallel (e.g. across high throughput compute cores).
# Baseline simulation scaffoldParams <- estimateScaffoldParameters(sce) sce_sim <- simulateScaffold(scaffoldParams, sce) # To keep all parameters the same, pull out the capture efficiencies and the initial mRNA counts. ## Then this can be used to cells e.g 2k at a time. scaffoldParams@captureEfficiency <- colData(sce_sim)$capEfficiency scaffoldParams@numCells <- 2000 sce_sim_eq <- simulateScaffold(scaffoldParams, sce, inputInitial = sce_sim@metadata$initialSimCounts)
scaffold
offers a function for generating summary figures of
the simulation results called makePlots
. The input is the output of simulateScaffold
and the original sce object.
par(mfrow=c(3,2)) makePlots(sce_sim, sce)
Parameter | Usage/Default | Notes/Tips ------------- | ------------- | ------------- Protocol to simulate from | C1/Smart-seq (default), 10X, droplet | Droplet and 10X have the same default settings. Whether to use UMI | True, False (default) | This is automatically set to TRUE for 10X/droplet protocols. Number of cells | Estimated from input data | Number of genes | Estimated from input data | Gene means | Estimated from input data | Total transcripts | Assumes 300,000 per cell | Degree of heterogeneity | Estimated from input data | A cell-specific scale factor simulate underlying cell population heterogeneity. Capture efficiency | Estimated from input data | Number of pre-amplification cycles | For C1/Smart-seq default is 18 | Usually this is specified in the experimental protocol Number of amplification cycles | Default is 12 for all protocols | Usually this is specified in the experimental protocol Pre-amplification efficiency | Generated from N(0.95, 0.02) for C1/Smart-seq data | This step is not used in 10X/droplet protocols. Amplification efficiency | Generated from N(0.95, 0.02) for all protocols | Cell-specific values for C1/Fluidigm and only one value for 10X/droplet Equalization amount | Value in [0,1] | 0 (complete equalization), 1 (no equalization; default), any other value is partial equalization, see manuscript Methods for details. Tagmentation efficiency | Generated from N(0.95, 0.02) for all protocols | Cell-specific values for C1/Fluidigm and only one value for 10X/droplet Total sequencing depth | Default is the sum of all counts in the input data |
Below are time trials for different protocol settings and number of cells. For UMI protocols, tracking the unique molecules is memory intensive and presents a limitation. We have included a section on how to generate larger numbers of cells in Section 2.5.
Number of Cells | Runtime (minutes) | Protocol ------------- | ------------- | ------------- 100 | 0.14 | C1, non-UMI 1000 | 1.55 | C1, non-UMI 5000 | 9.41 | C1, non-UMI 1000 | 3.73 | UMI/10X 5000 | 24.09 | UMI/10X
Here is the output of sessionInfo
on the system on which this document was compiled:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.