splatPop: simulating single-cell data for populations

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
suppressPackageStartupMessages({
  library("splatter")
  library("scater")
  library("VariantAnnotation")
  library("ggplot2")
})

splatPop logo

Introduction

splatPop is an extension of the splat model that allows you to simulate single cell count data for an entire population of individuals. Like with splat, these simulations resemble real single-cell data because they use parameters estimated from empirical data. Provided with genotype information (VCF) for a population as input, splatPop simulates gene counts for multiple cells for all individuals in the population. Realistic population structure (the pattern of genetic relatedness between individuals in the population) in the simulations is achieved by modelling expression Quantitative Trait Loci (eQTL) effects, where the expression of a gene is associated with the genotype of the individual at a specific loci. Finally, splatPop allows for the simulation of complex datasets with cells from multiple groups (e.g. cell types), cells along differentiation trajectories, and cells from different batches.

The splatPop model

The primary simulation function is splatPopSimulate, which runs through the two main phases:

  1. splatPopSimulateMeans: the simulation of means for all genes for all individuals in the population.
  2. splatPopSimulateSC: the simulation of single-cell counts for all cells for all genes for all individuals.

The second phase is essentially a wrapper around the original splatSimulate() function, which is described in detail here. The figure below describes the first phase. Input parameters that can be estimated from real data have double borders and are shaded by the type of data used (blue = single-cell counts, yellow = population scale bulk/sc-aggregated RNA-seq data, and green = eQTL mapping results). The final output (red) is a matrix of means for each gene and each individual that is used as input to the second phase.

The splatPop model for estimating gene means.

To get started with splatPop, you need genotype information for the population you want to simulate (i.e. a VCF). Genotype information should be provided as a VariantAnnotation object. A mock VariantAnnotation object can be produced using the mockVCF() function. Here we simulate single-cell RNA-sequencing counts for 100 random genes for 6 random samples:

vcf <- mockVCF(n.samples = 6)

sim <- splatPopSimulate(vcf = vcf, "nGenes" = 100)

sim <- logNormCounts(sim)
sim <- runPCA(sim, ncomponents = 10)
plotPCA(sim, colour_by = "Sample")

Detailed look into splatPop

Step 1: Parameter Estimation

The parameters used in splatPop have sensible default values, but can also be estimated from real data provided by the user. For example, gene mean and variance levels are sampled from gamma distributions derived from real population scale RNA-seq data and eQTL effect sizes from a gamma distribution derived from real eQTL mapping results. The default parameters were derived from GTEx data (v7, thyroid tissue). However, they can also be estimated from user provided data using splatPopEstimate(). You can also provide splatPopEstimate() with real single-cell RNA-sequencing data to estimate single-cell parameters as in splatEstimate().

All parameters needed for splatPop simulations are stored in a SplatPopParams object. In addition to the compartments in SplatParams described in detail in the Splat parameters vignette and the parameters that are set manually (described below), SplatPopParams also contains the following parameters that can be estimated from real data:

Let's take a look at the default parameters...

params <- newSplatPopParams()
params

This tells us we have "a Params object of class SplatPopParams" and shows the values of these parameters. As with SplatParams, the parameters that can be estimated by splatPopEstimate are in parentheses, those that can't be estimated are in brackets, and those that have been changed from their default are in ALL CAPS.

For example, we can estimate new parameter values from user provided data...

bulk.means <- mockBulkMatrix(n.genes=100, n.samples=100)
bulk.eqtl <- mockBulkeQTL(n.genes=100)
counts <- mockSCE()

params.est <- splatPopEstimate(means = bulk.means,
                               eqtl = bulk.eqtl,
                               counts = counts)
params.est

Note that splatPopEstimate() will only estimate new parameters if the data required is provided. For example, if you want to simulate data using default gene means and eQTL parameters, but from single-cell parameters estimated from your own real single-cell counts data, you could run splatPopEstimate() with only the counts argument provided.

Step 2: Simulate gene means

The splatPopSimulate() function runs both phases of splatPop, however we can run these two phases separately to highlight their unique functions. The first phase is run using splatPopSimulateMeans().

Input data

This function requires two pieces of input data: genotypes and genes. Mock genotype and gene data can be provided using mockVCF() and mockGFF(), respectively. These mock functions generate random SNP and gene annotation data for chromosome 22. To simulate populations with realistic population structure, the user should provide real (or simulated) genotypes as a VCF file read in as a VariantAnnotation object.

splatPop takes in information about what genes to simulate in three ways:

  1. GFF/GTF (-gff data.frame): Provide a GFF/GTF file as a data.frame object. splatPop will filter out all non-gene features (3rd column != gene). This method uses real gene names and locations, but will randomly assign expression values and eQTL effects to these genes.
  2. Key (-key data.frame): Provide a data.frame object including information about genes you want to simulate. This object must include the gene's name (geneID), chromosome (chromosome), and location (geneMiddle). With just those columns, splatPop will function the same as if a GFF was provided. However, you can also use this object to specify other information. For example, if you provide a desired mean (meanSampled) and variance (cvSampled) for each gene, splatPop will use these instead of randomly sampled values. Finally, if you provide the type (eQTL.type, e.g. NA or global), SNP identifier (eSNP.ID), and effect size (eQTL.EffectSize), splatPop will simulate gene means with these eQTL associations instead of generating eQTL associations randomly.
  3. Randomly (-gff NULL -key NULL): This option will call mockGFF() to generate a random GFF file for a specified chromosome. This is the default option if neither gff or key is provided.

Control parameters

In addition to the parameters estimated from real data, the SplatPopParams object also includes control parameters that must be set by the user. The following SplatPopParams control parameters can be changed using setParams():

In addition to the group specific eQTL effects, each group will have group specific differential expression effects, which are not associated with a genetic variant). These parameters are estimated from real single-cell data as described in splatter.

Output

The output of splatPopSimulateMeans() is a list containing:

Note that when splatPopSimulate() is run, these to objects are contained in the output SingleCellExperiment object (details below). Let's look at a snapshot of some simulated means and the corresponding key...

vcf <- mockVCF(n.samples = 6)
gff <- mockGFF(n.genes = 100)

sim.means <- splatPopSimulateMeans(vcf = vcf, gff = gff,
                                   params = newSplatPopParams())

round(sim.means$means[1:5, 1:6], digits = 2)

print(sim.means$key[1:5, ], digits = 2)

Other examples

Replicate a simulation by providing a gene key

As described above, information about genes can also be provided in a data.frame using the key argument. If you provide splatPopSimulateMeans() with the key output from a previous run, it will generate a new population with the same properties, essentially creating a replicate. Here is a snapshot of such a replicate using the key simulated above:

sim.means.rep2 <- splatPopSimulateMeans(vcf = vcf, key=sim.means$key,
                                        params = newSplatPopParams())

round(sim.means.rep2$means[1:5, 1:6], digits = 2)

Use real population-scale bulk expression data

An important step of splatPopSimulate() is the quantile normalization of simulated gene means for each sample to match a gamma distribution estimated from real single-cell RNA-seq data using splatEstimate() or splatPopEstimate(). This step ensures that even if bulk sequencing data are used to estimate population parameters, the means output from splatPopSimulateMeans() will be distributed like a single-cell dataset.

If you already have bulk expression data for a population, you can use this quantile normalization function directly on that data and use the output as input to splatPopSimulateSC(). Note that this will not simulate eQTL or group effects, just simulate single-cell counts using the bulk means provided.

bulk.qnorm <- splatPopQuantNorm(newSplatPopParams(), bulk.means)
round(bulk.qnorm[1:5, 1:5], 3)

Step 3: Simulate single cell counts

Finally, single cell level data is simulated using splatPopSimulateSC(). Running this function on its own requires the SplatPopParams object, and the two outputs from splatPopSimulateMeans(): the key and the simulated means matrix (or list of matrices if nGroups > 1). The user can also provide additional parameters for the single-cell simulation, for example how many cells to simulate.

Looking at the output of splatPopSimulateSC() we see that it is a single SingleCellExperiment object with a row for each feature (gene) and a column for each cell. The simulated counts are accessed using counts. although it can also hold other expression measures such as FPKM or TPM. Information about each cell (e.g. sample, group, batch) is held in the colData and information about each gene (e.g. location, eQTL effects, and other data from the splatPop key) is held in the rowData.

sim.sc <- splatPopSimulateSC(params=params, 
                             key = sim.means$key,
                             sim.means=sim.means$means, 
                             batchCells=50)
sim.sc

We can visualize these simulations using plotting functions from scater like plotPCA...

sim.sc <- logNormCounts(sim.sc)
sim.sc <- runPCA(sim.sc, ncomponents = 10)
plotPCA(sim.sc, colour_by = "Sample")

splatPop with group, batch, and path effects

Using the same methods as splat, splatPop allows you to simulate single-cell counts for a population with group (e.g. cell-types), batch, and path (e.g. developmental series) effects. Group effects are simulated by splatPopSimulateMeans() and applied to the single cell simulations in splatPopSimulateSC(). Path and batch effects are simulated by splatPopSimulateSC().

Simulating population scale single-cell data with group effects

The population simulated above is an example of a dataset with a single cell type across many samples. However, splatPop also allows you to simulate population-scale data for a mixture of cell-types (i.e. groups).

Two types of group effects are included: group-eQTL and group-differential expression (DE) effects. The number of groups to simulate is set using the group.prob parameter in SplatPopParams. The DE effects are implemented as in the splat simulation, with the user able to control splatPopParam parameters including de.prob, de.downProb, de.facLoc, and de.facScale. For group-specific eQTL, the proportion of eQTL to designate as group-specific eQTL is set using eqtl.group.specific.

When used to simulate single-cell data with group-specific effects, splatSimulatePop also outputs:

params.group <- newSplatPopParams(nGenes = 50,
                                  batchCells = 40,
                                  group.prob = c(0.5, 0.5))

sim.sc.gr2 <- splatPopSimulate(vcf = vcf, params = params.group)

sim.sc.gr2 <- logNormCounts(sim.sc.gr2)
sim.sc.gr2 <- runPCA(sim.sc.gr2, ncomponents = 10)
plotPCA(sim.sc.gr2, colour_by = "Group", shape_by = "Sample")

From the PCA plot above you can see that in this simulation the sample effect outweighs the group effect. But we can tune these parameters to change the relative weight of these effects. First we can decrease the sample effect by increasing the similarity.scale parameter. And second we can increase the group effect by adjusting the eqtl.group.specific and de parameters:

params.group <- newSplatPopParams(batchCells = 40,
                                  nGenes = 50,
                                  similarity.scale = 6,
                                  eqtl.group.specific = 0.6,
                                  de.prob = 0.5,
                                  de.facLoc = 0.5, 
                                  de.facScale = 0.4,
                                  group.prob = c(0.5, 0.5))

sim.sc.gr2 <- splatPopSimulate(vcf = vcf, params = params.group)

sim.sc.gr2 <- logNormCounts(sim.sc.gr2)
sim.sc.gr2 <- runPCA(sim.sc.gr2, ncomponents = 10)
plotPCA(sim.sc.gr2, colour_by = "Group", shape_by = "Sample")

Simulate SC data for population with path and batch effects

Like splat, splatPop also allows you to simulate single-cell data with path or batch effects using the method tag in splatSimulatePop. Note that you can also set method = group, but this is done automatically by setting the group.prob parameter. For more information about these settings, see the Splat parameters vignette.

Batch effects

params.batches <- newSplatPopParams(batchCells = c(20, 20),
                                    nGenes = 50,
                                    similarity.scale = 5,
                                    batch.facLoc = 0.3,
                                    batch.facScale = 0.3)

sim.pop.batches <- splatPopSimulate(vcf = vcf, params = params.batches)
sim.pop.batches <- logNormCounts(sim.pop.batches)
sim.pop.batches <- runPCA(sim.pop.batches, ncomponents = 10)
plotPCA(sim.pop.batches, colour_by = "Batch", shape_by = "Sample",
        ncomponents = 5:6)

Path effects

params.paths <- newSplatPopParams(batchCells = 40,
                                  nGenes = 50,
                                  similarity.scale = 6,
                                  de.facLoc = 0.5,
                                  de.facScale = 0.5,
                                  de.prob = 0.5)

sim.pop.paths <- splatPopSimulate(vcf = vcf, params = params.paths,
                                  method = "paths")
sim.pop.paths <- logNormCounts(sim.pop.paths)
sim.pop.paths <- runPCA(sim.pop.paths, ncomponents = 10)
plotPCA(sim.pop.paths, colour_by = "Step", shape_by = "Sample", 
        ncomponents = 5:6)


Try the splatter package in your browser

Any scripts or data that you put into this service are public.

splatter documentation built on Dec. 3, 2020, 2:01 a.m.