View source: R/generateSyntheticData.R
generateSyntheticData | R Documentation |
Generate synthetic count data sets, following the simulation strategy detailed in Soneson and Delorenzi (2013).
generateSyntheticData(
dataset,
n.vars,
samples.per.cond,
n.diffexp,
repl.id = 1,
seqdepth = 1e+07,
minfact = 0.7,
maxfact = 1.4,
relmeans = "auto",
dispersions = "auto",
fraction.upregulated = 1,
between.group.diffdisp = FALSE,
filter.threshold.total = 1,
filter.threshold.mediancpm = 0,
fraction.non.overdispersed = 0,
random.outlier.high.prob = 0,
random.outlier.low.prob = 0,
single.outlier.high.prob = 0,
single.outlier.low.prob = 0,
effect.size = 1.5,
output.file = NULL,
tree = NULL,
prop.var.tree = 1,
model.process = c("BM", "OU"),
selection.strength = 0,
id.condition = NULL,
id.species = as.factor(rep(1, 2 * samples.per.cond)),
check.id.species = TRUE,
lengths.relmeans = NULL,
lengths.dispersions = NULL,
lengths.phylo = TRUE
)
dataset |
A name or identifier for the data set/simulation settings. |
n.vars |
The initial number of genes in the simulated data set. Based on the filtering conditions ( |
samples.per.cond |
The number of samples in each of the two conditions. |
n.diffexp |
The number of genes simulated to be differentially expressed between the two conditions. |
repl.id |
A replicate ID for the specific simulation instance. Useful for example when generating multiple count matrices with the same simulation settings. |
seqdepth |
The base sequencing depth (total number of mapped reads). This number is multiplied by a value drawn uniformly between |
minfact , maxfact |
The minimum and maximum for the uniform distribution used to generate factors that are multiplied with |
relmeans |
A vector of mean values to use in the simulation of data from the Negative Binomial distribution, or |
dispersions |
A vector or matrix of dispersions to use in the simulation of data from the Negative Binomial distribution, or |
fraction.upregulated |
The fraction of the differentially expressed genes that is upregulated in condition 2 compared to condition 1. |
between.group.diffdisp |
Whether or not the dispersion should be allowed to be different between the conditions. Only applicable if |
filter.threshold.total |
The filter threshold on the total count for a gene across all samples. All genes for which the total count across all samples is less than the threshold will be filtered out. |
filter.threshold.mediancpm |
The filter threshold on the median count per million (cpm) for a gene across all samples. All genes for which the median cpm across all samples is less than the threshold will be filtered out. |
fraction.non.overdispersed |
The fraction of the genes that should be simulated according to a Poisson distribution, without overdispersion. The non-overdispersed genes will be divided proportionally between the upregulated, downregulated and non-differentially expressed genes. |
random.outlier.high.prob |
The fraction of 'random' outliers with unusually high counts. |
random.outlier.low.prob |
The fraction of 'random' outliers with unusually low counts. |
single.outlier.high.prob |
The fraction of 'single' outliers with unusually high counts. |
single.outlier.low.prob |
The fraction of 'single' outliers with unusually low counts. |
effect.size |
The strength of the differential expression, i.e., the effect size, between the two conditions. If this is a single number, the effect sizes will be obtained by simulating numbers from an exponential distribution (with rate 1) and adding the results to the |
output.file |
If not |
tree |
a dated phylogenetic tree of class |
prop.var.tree |
the proportion of the common variance explained by the tree for each gene. It can be a scalar, in which case the same parameter is used for all genes. Otherwise it needs to be a vector with length |
model.process |
the process to be used for phylogenetic simulations. One of "BM" or "OU", default to "BM". |
selection.strength |
if the process is "OU", the selection strength parameter. |
id.condition |
A named vector, indicating which species is in each condition. Default to first 'samples.per.cond' species in condition '1' and others in condition '2'. |
id.species |
A factor giving the species for each sample. If a tree is used, should be a named vector with names matching the taxa of the tree. Default to |
check.id.species |
Should the species vector be checked against the tree lengths (if provided) ? If TRUE, the function checks that all the samples that share a factor value in |
lengths.relmeans |
An optional vector of mean values to use in the simulation of lengths from the Negative Binomial distribution. Should be of length n.vars. Default to |
lengths.dispersions |
An optional vector of dispersions to use in the simulation of data from the Negative Binomial distribution. Should be of length n.vars. Default to |
lengths.phylo |
If TRUE, the lengths are simulated according to a phylogenetic Poisson Log-Normal model on the tree, with a BM process. If FALSE, they are simulated according to an iid negative binomial distribution. In both cases, |
In the comparison function, only results obtained for data sets with the same value of the dataset
parameter will be compared. Hence, it is important to give the same value of this parameter e.g. to different replicates generated with the same simulation settings.
For more detailed information regarding the different types of outliers, see Soneson and Delorenzi (2013).
Mean and dispersion parameters (if relmeans
and/or dispersions
is set to "auto"
) are sampled from values estimated from the data sets by Pickrell et al (2010) and Cheung et al (2010). The data sets were downloaded from the ReCount web page (Frazee et al (2011)) and processed as detailed by Soneson and Delorenzi (2013).
To get the actual mean value for the Negative Binomial distribution used for the simulation of counts for a given sample, take the column truemeans.S1
(or truemeans.S2
, if the sample is in condition S2) of the variable.annotations
slot, divide by the sum of the same column and multiply with the base sequencing depth (provided in the info.parameters
list) and the depth factor for the sample (given in the sample.annotations
data frame). Thus, if you have a vector of mean values that you want to provide as the relmeans
argument and make sure to use it 'as-is' in the simulation (for condition S1), make sure to set the seqdepth
argument to the sum of the values in the relmeans
vector, and to set minfact
and maxfact
equal to 1.
When the tree
argument is provided (not NULL
),
then the "phylogenetic Poisson log-Normal" model is used for the simulations,
possibly with varying gene lengths across species
(both lengths.relmeans
and lengths.dispersions
must be specified
or set to "auto"
.)
Phylogenetic simulations use the rTrait
function
from package phylolm
.
A compData
object. If output.file
is not NULL
, the object is saved in the given output.file
(which should have an .rds
extension).
Charlotte Soneson
Soneson C and Delorenzi M (2013): A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14:91
Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M and Spielman RS (2010): Polymorphic cis- and trans-regulation of human gene expression. PLoS Biology 8(9):e1000480
Frazee AC, Langmead B and Leek JT (2011): ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12:449
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y and Pritchard JK (2010): Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768-772
Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ and Taylor JM (2012): Efficient experimental design and analysis strategies for the detection of differential expression using RNA-sequencing. BMC Genomics 13:484
Stern DB and Crandall KA (2018): The Evolution of Gene Expression Underlying Vision Loss in Cave Animals. Molecular Biology and Evolution. 35:2005–2014.
## RNA-Seq data
mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000,
samples.per.cond = 5, n.diffexp = 100)
## Inter-species RNA-Seq data
library(ape)
tree <- read.tree(text = "(((A1:0,A2:0,A3:0):1,B1:1):1,((C1:0,C2:0):1.5,(D1:0,D2:0):1.5):0.5);")
id.species <- factor(c("A", "A", "A", "B", "C", "C", "D", "D"))
names(id.species) <- tree$tip.label
mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000,
samples.per.cond = 4, n.diffexp = 100,
tree = tree,
id.species = id.species,
lengths.relmeans = "auto",
lengths.dispersions = "auto")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.