simulateDE: Simulate Differential Expression Pipeline
In bvieth/powsimR: Power Simulations for RNA-sequencing

simulateDE

R Documentation

Simulate Differential Expression Pipeline

Description

simulateDE is the main function to simulate differential expression for RNA-seq experiments. The simulation parameters are specified with Setup. The user now needs to specify the RNA-seq Analysis Pipeline including preprocessing, normalisation and differential testing method. The return object contains DE test results from all simulations as well as descriptive statistics. The error matrix calculations will be conducted with evaluateDE.

Usage

simulateDE(SetupRes,
Prefilter = NULL,
Imputation = NULL,
Normalisation = c("TMM", "MR", "PosCounts", "UQ",
"scran", "Linnorm", "sctransform",
"SCnorm", "Census", "depth"),
Label = "none",
DEmethod = c("T-Test", "edgeR-LRT", "edgeR-QL",
"edgeR-zingeR", "edgeR-ZINB-WaVE",
"limma-voom", "limma-trend",
"DESeq2", "DESeq2-zingeR", "DESeq2-ZINB-WaVE",
"ROTS", "baySeq", "NOISeq", "EBSeq",
"MAST", "BPSC", "scDD", "DECENT"),
DEFilter = FALSE,
Counts = FALSE,
NCores = NULL,
verbose = TRUE)

Arguments

`SetupRes`	This object specifies the simulation setup. This must be the return object from `Setup`.
`Prefilter`	A character vector specifying the gene expression filtering method to be used prior to normalisation (and possibly imputation). Default is `NULL`, i.e. no filtering. Please consult the Details section for available options.
`Imputation`	A character vector specifying the gene expression imputation method to be used prior to normalisation. Default is `NULL`, i.e. no imputation. Please consult the Details section for available options.
`Normalisation`	Normalisation method to use. Please consult the Details section for available options.
`Label`	A character vector to define whether information about group labels should be used for normalisation. This is only implemented for scran and SCnorm. Possible options include the default `"none"` which means that no sample group information is considered for normalisation; `"known"` means that the simulated group labels are used and `"clustering"` which applies an unsupervised hierarchical clustering to determine the group labels (see `quickCluster` for details).
`DEmethod`	A character vector specifying the DE detection method to be used. Please consult the Details section for available options.
`DEFilter`	A logical vector indicating whether to run DE testing on filtered and/or imputed count data. Default is `FALSE`.
`Counts`	A logical vector indicating whether the simulated count matrix is also provided as output. Default is `FALSE` since the output can be quite large. Note that if DEFilter is `TRUE`, then the returned count matrix will countain the filtered and/or imputed count data.
`NCores`	integer positive number of cores for parallel processing. Default is `NULL`, i.e. 1 core.
`verbose`	Logical vector to indicate whether to show progress report of simulations. Default is `TRUE`.

Details

Here you can find detailed information about preprocessing, imputation, normalisation and differential testing choices. For recommendations concerning single cell RNA-sequencing pipelines, we kindly refer the user to Vieth, et al (2019). A systematic evaluation of single cell RNA-seq analysis pipelines. Nature Communications, 10(1), 4667.
Prefiltering

CountFilter: removes genes that have a mean expression below 0.2.
FreqFilter: removes genes that have more than 80 % dropouts.

Imputation

scImpute: apply the imputation as implemented in scimpute
DrImpute: apply the imputation as implemented in DrImpute.
SAVER: apply the imputation as implemented in saver.
scone: apply the imputation as implemented in scone, defining 'house keeping genes' for the FNR model estimation as those that have less than 20 % dropouts and small variance (i.e. in the lower 20th quartile). If less than 25 genes could be identified, the genes with less than 5 % dropouts are used. If spike-in data is provided then these are used for the FNR model estimation.
MAGIC: apply the imputation as implemented in magic. Please note that the python package MAGIC needs to be installed to use this implementation.

Normalisation

TMM, UQ: employ the edgeR style normalization of weighted trimmed mean of M-values and upperquartile as implemented in calcNormFactors, respectively.
MR, PosCounts: employ the DESeq2 style normalization of median ratio method and a modified geometric mean method as implemented in estimateSizeFactors, respectively.
scran, SCnorm: apply the deconvolution and quantile regression normalization methods developed for sparse RNA-seq data as implemented in computeSumFactors and SCnorm, respectively. Spike-ins can also be supplied for both methods via spikeData. Note, however that this means for scran that the normalisation as implemented in calculateSumFactors is also applied to genes (general.use=TRUE).
Linnorm: apply the normalization method for sparse RNA-seq data as implemented in Linnorm.Norm. For Linnorm, the user can also supply spikeData.
sctransform: apply the normalization method developed for single-cell UMI RNA-seq data as implemented in vst.
Census: converts relative measures of TPM/FPKM values into mRNAs per cell (RPC) without the need of spike-in standards. Census at least needs Lengths for single-end data and preferably MeanFragLengths for paired-end data. The authors state that Census should not be used for UMI data.
depth: Sequencing depth normalisation.

Differential testing

T-Test: A T-Test per gene is applied using log transformed and normalized expression values (i.e. CPM or TPM).
limma-trend, limma-voom: apply differential testing as implemented in lmFit followed by eBayes on counts transformed by voom or by applying mean-variance trend on log CPM values in eBayes.
edgeR-LRT, edgeR-QL: apply differential testing as implemented in glmFit, glmLRT andglmQLFit, glmQLFTest, respectively.
DESeq2: applies differential testing as implemented in DESeq.
ROTS: applies differential testing as implemented in ROTS with 100 permutations on transformed counts (voom).
baySeq: applies differential testing as implemented in getLikelihoods based on negative binomial prior estimates (getPriors.NB).
NOISeq: applies differential testing as implemented in noiseqbio based on CPM values.
EBSeq: applies differential testing as implemented in EBTest.
MAST: applies differential testing as implemented in zlm for zero-inflated model fitting followed by lrTest on log CPM values.
BPSC: applies differential testing as implemented in BPglm on CPM values.
scDD: applies differential testing as implemented in scDD on CPM values.
DECENT: applies differential testing as implemented in decent.
edgeR-zingeR, DESeq2-zingeR: In a first step, the posterior probabilities of the zero-inflated negative binomial component are estimated (see zeroWeightsLS) and used to define a weight matrix for dispersion estimation in estimateDisp. For the edgeR approach, the generalized model as implemented in glmFit is fitted. This is followed by an adapted LRT for differential testing to account for the weighting (see glmWeightedF). For DESeq2, the generalized linear model coefficients are estimated using nbinomWaldTest and the weighting is done by setting the degrees of freedom for the T distribution.
edgeR-ZINB-WaVE, DESeq2-ZINB-WaVE: In a first step, a zero-inflated negative binomial regression model is fitted (see zinbFit) to estimate observational weights (see computeObservationalWeights) used for dispersion estimation in estimateDisp. For the edgeR approach, the generalized model as implemented in glmFit is fitted. This is followed by an adapted LRT for differential testing to account for the weighting (see glmWeightedF). For DESeq2, the generalized linear model coefficients are estimated using nbinomWaldTest and the weighting is done by setting the degrees of freedom for the T distribution.

Value

SimulateRes: Results of DE simulations

pvalue, fdr: 3D array (ngenes * N * nsims) for p-values and FDR from each simulation. Note that FDR values will be empty and the calculation will be done by evaluateDE whenever applicable.
mu,disp,dropout: 3D (ngenes * N * nsims) array for mean, dispersion and dropout of library size factor normalized read counts.
true.mu,true.disp,true.dropout: 3D (ngenes * N * nsims) array for true mean, dispersion and dropout of simulated read counts.
true.depth,est.depth: True simulated and processed (after prefiltering and/or imputation if defined) sequencing depth per sample.
est.sf: Global library size factor estimates per sample.
est.gsf: 3D array (ngenes * N * nsims) for size factor estimates. These are gene- and sample-wise estimates and only for SCnorm and Linnorm normalisation.
elfc,rlfc: 3D array (ngenes * N * nsims) for log fold changes (LFC): elfc is for the DE tool estimated LFC; rlfc is for the LFC estimated from the normalised read counts.
time.taken: The time taken given by proc.time for each simulation, given for preprocessing, normalisation, differential expression testing and moment estimation.

SetupRes: Simulation specifications

DESetup - ... - estSpikeRes: Reiterating the simulated setup defined by Setup.
Pipeline: A list of chosen pipeline tools defined by above arguments.

Counts: Simulated Count Matrices

Counts: List of lists object where Counts[[Sample Size Setup]][[Simulation Run]] containing simulated counts. Note that this will only be returned when Counts is TRUE. In addition, if DEFilter is TRUE then the filtered/imputed counts are returned.

Author(s)

Beate Vieth

Examples

## Not run: 
# estimate gene parameters
data("SmartSeq2_Gene_Read_Counts")
Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_Gene_Read_Counts),
                                    "_"), "[[", 1),
                     stringsAsFactors = F,
                     row.names = colnames(SmartSeq2_Gene_Read_Counts))
data("GeneLengths_mm10")
estparam_gene <- estimateParam(countData = SmartSeq2_Gene_Read_Counts,
                               readData = NULL,
                              batchData = Batches,
                               spikeData = NULL, spikeInfo = NULL,
                               Lengths = GeneLengths_mm10, MeanFragLengths = NULL,
                               RNAseq = 'singlecell', Protocol = 'Read',
                               Distribution = 'ZINB', Normalisation = "scran",
                               GeneFilter = 0.1, SampleFilter = 3,
                               sigma = 1.96, NCores = NULL, verbose = TRUE)
# estimate spike parameters
data("SmartSeq2_SpikeIns_Read_Counts")
data("SmartSeq2_SpikeInfo")
Batches = data.frame(Batch = sapply(strsplit(colnames(SmartSeq2_SpikeIns_Read_Counts),
                                    "_"), "[[", 1),
                     stringsAsFactors = F,
                     row.names = colnames(SmartSeq2_SpikeIns_Read_Counts))
estparam_spike <- estimateSpike(spikeData = SmartSeq2_SpikeIns_Read_Counts,
                                spikeInfo = SmartSeq2_SpikeInfo,
                                MeanFragLength = NULL,
                                batchData = Batches,
                                Normalisation = 'depth')
# define log fold change
p.lfc <- function(x) sample(c(-1,1), size=x,replace=T)*rgamma(x, shape = 1, rate = 2)
# set up simulations
setupres <- Setup(ngenes = 10000, nsims = 10,
                  p.DE = 0.1, pLFC = p.lfc,
                  n1 = c(20,50,100), n2 = c(30,60,120),
                  Thinning = c(1,0.9,0.8), LibSize = 'given',
                  estParamRes = estparam_gene,
                  estSpikeRes = estparam_spike,
                  DropGenes = FALSE,
                  sim.seed = 52679, verbose = TRUE)
# run simulation
simres <- simulateDE(SetupRes = setupres,
                     Prefilter = "FreqFilter", Imputation = NULL,
                     Normalisation = 'scran', Label = 'none',
                     DEmethod = "limma-trend", DEFilter = FALSE,
                     NCores = NULL, verbose = TRUE)
# quick evaluation
evalderes <- evaluateDE(simRes = simres)
# plot evaluation
plotEvalDE(evalRes = evalderes, rate = "marginal",
           quick = TRUE, Annot = FALSE)

## End(Not run)

bvieth/powsimR documentation built on Aug. 19, 2023, 7:48 p.m.