simulateSet: simulateSet

View source: R/Simulate_set.R

simulateSetR Documentation

simulateSet

Description

Simulation of a complete dataset, where the number of each type of differential distributions and equivalent distributions is specified.

Usage

simulateSet(SCdat, numSamples = 100, nDE = 250, nDP = 250, nDM = 250,
  nDB = 250, nEE = 5000, nEP = 4000, sd.range = c(1, 3), modeFC = c(2,
  3, 4), plots = TRUE, plot.file = NULL, random.seed = 284,
  varInflation = NULL, condition = "condition", param = bpparam())

Arguments

SCdat

An object of class SingleCellExperiment that contains normalized single-cell expression and metadata. The assays slot contains a named list of matrices, where the normalized counts are housed in the one named normcounts. This matrix should have one row for each gene and one sample for each column. The colData slot should contain a data.frame with one row per sample and columns that contain metadata for each sample. This data.frame should contain a variable that represents biological condition, which is in the form of numeric values (either 1 or 2) that indicates which condition each sample belongs to (in the same order as the columns of normcounts). Optional additional metadata about each cell can also be contained in this data.frame, and additional information about the experiment can be contained in the metadata slot as a list.

numSamples

numeric value for the number of samples in each condition to simulate

nDE

Number of DE genes to simulate

nDP

Number of DP genes to simulate

nDM

Number of DM genes to simulate

nDB

Number of DB genes to simulate

nEE

Number of EE genes to simulate

nEP

Number of EP genes to simulate

sd.range

Numeric vector of length two which describes the interval (lower, upper) of standard deviations of fold changes to randomly select.

modeFC

Vector of values to use for fold changes between modes for DP, DM, and DB.

plots

Logical indicating whether or not to generate fold change and validation plots

plot.file

Character containing the file string if the plots are to be sent to a pdf instead of to the standard output.

random.seed

Numeric value for a call to set.seed for reproducibility.

varInflation

Optional numeric vector with one element for each condition that corresponds to the multiplicative variance inflation factor to use when simulating data. Useful for sensitivity studies to assess the impact of confounding effects on differential variance across conditions. Currently assumes all samples within a condition are subject to the same variance inflation factor.

condition

A character object that contains the name of the column in colData that represents the biological group or condition of interest (e.g. treatment versus control). Note that this variable should only contain two possible values since scDD can currently only handle two-group comparisons. The default option assumes that there is a column named "condition" that contains this variable.

param

a MulticoreParam or SnowParam object of the BiocParallel package that defines a parallel backend. The default option is BiocParallel::bpparam() which will automatically creates a cluster appropriate for the operating system. Alternatively, the user can specify the number of cores they wish to use by first creating the corresponding MulticoreParam (for Linux-like OS) or SnowParam (for Windows) object, and then passing it into the scDD function. This could be done to specify a parallel backend on a Linux-like OS with, say 12 cores by setting param=BiocParallel::MulticoreParam(workers=12)

Value

An object of class SingleCellExperiment that contains simulated single-cell expression and metadata. The assays slot contains a named list of matrices, where the simulated counts are housed in the one named normcounts. This matrix should have one row for each gene (nDE + nDP + nDM + nDB + nEE + nEP rows) and one sample for each column (numSamples columns). The colData slot contains a data.frame with one row per sample and a column that represents biological condition, which is in the form of numeric values (either 1 or 2) that indicates which condition each sample belongs to (in the same order as the columns of normcounts). The rowData slot contains information about the category of the gene (EE, EP, DE, DM, DP, or DB), as well as the simulated foldchange value.

References

Korthauer KD, Chu LF, Newton MA, Li Y, Thomson J, Stewart R, Kendziorski C. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biology. 2016 Oct 25;17(1):222. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1077-y

Examples


# Load toy example ExpressionSet to simulate from

data(scDatEx)


# check that this object is a member of the ExpressionSet class
# and that it contains 142 samples and 500 genes

class(scDatEx)
show(scDatEx)


# set arguments to pass to simulateSet function
# we will simuate 30 genes total; 5 genes of each type;
# and 100 samples in each of two conditions

nDE <- 5
nDP <- 5
nDM <- 5
nDB <- 5
nEE <- 5
nEP <- 5
numSamples <- 100
seed <- 816


# create simulated set with specified numbers of DE, DP, DM, DM, EE, and 
# EP genes,
# specified number of samples, DE genes are 2 standard deviations apart, and 
# multimodal genes have modal distance of 4 standard deviations

SD <- simulateSet(scDatEx, numSamples=numSamples, nDE=nDE, nDP=nDP, nDM=nDM,
                  nDB=nDB, nEE=nEE, nEP=nEP, sd.range=c(2,2), modeFC=4, 
                  plots=FALSE, 
                  random.seed=seed)

kdkorthauer/scDD documentation built on March 27, 2022, 5:11 a.m.