simMultSamples: Simulate paired end reads for multiple future samples based...
In casper: Characterization of Alternative Splicing based on Paired-End Reads

Description Usage Arguments Details Value Author(s) References Examples

Simulate true expression levels and observed data (casper expression estimates) for future samples within each group.

These simulations serve as the basis for sample size calculation: if one were to sequence nsamples new RNA-seq samples, what data would we expect to see? The simulation is posterior predictive, i.e. based on the current available data x.

1 2	simMultSamples(nsim, nsamples, nreads, readLength, fragLength, x, groups='group', distrs, genomeDB, model='LNNMV', verbose=TRUE, mc.cores=1)

`nsim`	Number of simulations to obtain
`nsamples`	Vector indicating number of future samples per group, e.g. `nsamples=c(5,5)` to simulate 5 new samples for 2 groups.
`nreads`	Desired number of paired-end reads per sample. The actual number of aligned reads for any given sample differs from this amount, see details.
`readLength`	Read length, i.e. in an experiment with paired reads at 100bp each, `readLength=100`.
`fragLength`	Desired average insert size (size of RNA molecules after fragmentation). If missing, insert sizes are as obtained from `distrs`
`x`	`ExpressionSet` containing pilot data. `x[[group]]` indicates groups to be compared
`groups`	Name of column in `pData(x)` indicating the groups
`distrs`	Fragment start and length distributions. It can be either an object or a list of objects of class readDistrs. In the latter case, an element is chosen at random for each individual sample to consider uncertainty in these distributions. If not specified, it defaults to data(distrsGSE37704).
`genomeDB`	annotatedGenome object
`model`	Set to `'LNNMV'` to simulate from log-normal normal with modified variance model (Yuan and Kendsiorski, 2006), or to `'GaGa'` to simulate from the GaGa model (Rossell, 2009). See details.
`verbose`	Set to `TRUE` to print progress
`mc.cores`	Number of cores to use in function. `mc.cores>1` requires package `parallel`

The posterior predictive simulations is based on four steps: (1) simulate true expression for each group (mean and SD), (2) simulate true expression for future samples, (3) simulate paired reads for each future sample, (4) estimate expression from the reads via Casper. Below are some more details.

1. Simulate true mean expression in each group and residual variance for each gene. If model=='LNNMV' this is based on the log-normal normal with modified variance model in package EBarrays (Yuan & Kendziorski 2006), if model=='GaGa' this is based on the GaGa model (Rossell, 2009). adapted to take into account that the expression estimates in the pilot data x are noisy (which is why simMultSamples requires the SE / posterior SD associated to exprs(x)). The simulated values are returned in component "simTruth" of the simMultSamples output.

2. Simulate true isoform expression for each of the future samples. These are independent Normal draws with mean and variance generated in step 1. True gene expression is derived from the isoform expressions.

3. Determine the number of reads to be simulated for each gene based on its true expression (generated in step 2) and a Multinomial sampling model. For each sample:

- The number of reads yielded by the experiment is Unif(.8*nreads,1.2*nreads) - A proportion of non-mappable reads is discarded using the power law in Li et al (2014) - Amongst remaining reads, we assume that a proportion Unif(0.6,0.9) were aligned (consistenly with reports from ENCODE project)

The final number of simulated reads is reported in component "simExpr" of the simMultSamples output.

4. Obtain expression estimates from the path counts produced in step 3 via calcExp. These are reported in component "simExpr" of the simMultSamples output.

Object of class simulatedSamples, which extends a list of length nsim. See the class documentation for some helpful methods (e.g. coef, exprs, mergeBatches). Each element is itself a list containing an individual simulation.

`simTruth`	`data.frame` indicating the mean and standard deviation of the Normal distribution used to generate data from each group
`simExpr`	`ExpressionSet` with Casper expression estimates, as returned by `calcExp`. `pData(simExpr)` indicates group information, and `fData(simExpr)` the number of simulated reads for each sample (in columns `'explCnts'`) and across all samples (in column `'readCount'`)

Victor Pena, David Rossell

Rossell D. (2009) GaGa: a Parsimonious and Flexible Model for Differential Expression Analysis. Annals of Applied Statistics, 3, 1035-1051.

Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2015)

Yuan, M. and Kendziorski, C. (2006). A unified approach for simultaneous gene clustering and differential expression identification. Biometrics, 62, 1089-1098.