run_abs_simulation: Run a copy number simulation

Description Usage Arguments Value

Description

This function runs a simulation that explicitly defines copy numbers, which can be used to test whether compositional changes leads to consistent or misleading results based on the analysis done. After simulating an experiment using the copy numbers, those numbers are converted into expected reads to be used for a polyester simulation.

Usage

1
2
3
4
5
6
7
run_abs_simulation(sleuth, fasta_file, sample_index = "mean",
  outdir = ".", num_reps = c(10, 10), denom = NULL, seed = 1,
  num_runs = 1, gc_bias = NULL, de_probs = 0.1, de_type = "normal",
  de_levels = c(1.25, 2, 4), dir_probs = 0.5, mean_lib_size = 20 *
  10^6, single_value = TRUE, polyester_sim = FALSE,
  control_condition = NULL, num_cores = 1, include_spikeins = TRUE,
  spikein_mix = "Mix1", spikein_percent = 0.02)

Arguments

sleuth,

a sleuth object or a character string with an R-Data file containing a sleuth object saved using 'sleuth_save'. This object contains results from a real experiment.

fasta_file,

a multiFASTA file with the transcripts to be used in the simulation (required for polyester)

sample_index,

which sample from the real dataset should be used as the starting point for the simulation? You may use a number or string, as long as it is a valid column index for the dataset. If "mean" is given, the default, then the mean of the control samples will be used.

outdir,

where should the simulated reads be written to?

num_reps,

the number of samples in each condition. Note that this only currently supports two conditions, so this must be length 2.

denom,

the name(s) of transcript(s) that will be used as the denominator for showing how the data will behave after ALR transformation. The default is NULL, which indicates that this function will choose the first feature that is simulated to not change as the denominator.

seed,

the random seed to be used for reproducibility

num_runs,

the number of simulations to run

gc_bias,

integer vector of length sum(num_reps) of the GC bias to be used by polyester. Only numbers between 0 and 7. See ?polyester::simulate_experiment under "gcbias" in the Details section for more information. The default is NULL, which means that all samples will be set to 0 (i.e. no bias).

de_probs,

vector of same length as num_runs, with numbers between 0 and 1 describing the probability of differential expression for each simulation

de_type,

either "discrete" or "normal" (the default) to indicate using discrete levels of differential expression, or to used a truncated normal for a continuum of differential expression. The levels of discrete DE, or the parameters for the truncated normal, are determined by de_levels.

de_levels,

if de_type is "discrete", this is a vector with one or more numbers > 1 to indicate the levels of differential expression (e.g. 50 "normal", this is a vector of length 3 specifying the following parameters for the rtruncnorm function: a (the min of the truncated normal; it should be > 1), mean, and sd. When the direction is down, the inverse of these levels will be used.

dir_probs,

vector of same length as num_runs, with numbers between 0 and 1 describing the probability of differential expression being increased, given a transcript that is changing.

mean_lib_size,

the average number of reads per library to be simulated. Variability in the exact library size per sample will be introduced with a normal using a coefficient of variation of 5 (default is 20 million reads).

single_value,

if TRUE, sizes are calculated for the whole experiment using DESeq2 estimateDispersions; otherwise, sizes are interpolated using the dispersion function from DESeq2 using the mean counts for each condition.

polyester_sim,

should polyester be run? (default to FALSE to save time when you are merely interested in the ground truth)

control_condition,

what factor level should be used to define the control condition? This is used to select control samples to estimate dispersions for a null distribution, i.e. variance of estimated counts in an experiment without an expectation of differential expression. The default, NULL, uses all of the samples in the provided sleuth_file. Note that if this is specified, DESeq2 will estimate dispersions using an intercept only model (~1), whereas if it is left NULL, the full formula from the sleuth object will be used (obj$full_formula).

num_cores

the number of cores to be used to run parallel simulations. the default is to use just one.

include_spikeins

if TRUE, will add spike-ins to the simulated experiment.

spikein_mix

character specifying which mix to use; only accepts "Mix1" or "Mix2". If a different mix is desired for each condition, specify a character vector containing a mix for each condition. The default is "Mix1".

spikein_percent

what percent of the total copy numbers in the control condition should be spike-in controls? The default is 2%.

Value

returns invisibly a list with three members:


warrenmcg/absSimSeq documentation built on May 29, 2019, 9:57 a.m.