simulateData: Simulate RNA-seq bulk gene expression count data

Description Usage Arguments Value Author(s) References Examples

View source: R/simulateData.R

Description

Simulate data based on input simulation parameters. Size factors are custom input or simulated from N(1,0.25)

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
simulateData(
  K = 2,
  B = 1,
  g = 10000,
  n = 50,
  pK = NULL,
  pB = NULL,
  LFCg = 1,
  pDEg = 0.05,
  sigma_g = 0.1,
  LFCb = 1,
  pDEb = 0.5,
  sigma_b = 0,
  beta0 = 12,
  phi0 = 0.35,
  SF = NULL,
  nsims = 25,
  disp = "gene",
  n_pred = 25,
  sim_batch_pred = FALSE,
  LFCb_pred = NULL,
  save_file = TRUE,
  save_dir = NULL,
  save_pref = NULL
)

Arguments

K

integer, number of clusters

B

integer, number of batches

g

integer, number of genes

n

integer, number of samples

pK

vector of length K (optional): proportion of samples in each cluster

pB

vector of length B (optional): proportion of samples in each batch

LFCg

numeric, LFC for cluster-discriminatory genes

pDEg

numeric, proportion of genes that are cluster-discriminatory

sigma_g

numeric, Gaussian noise added to each gene/sample N(0,sigma_g). Default is 0.1

LFCb

numeric, LFC for genes that are differentially expressed across batch. Default is 1.

pDEb

numeric, proportion of genes that are differentially expressed across batch. Default is 0.5.

sigma_b

numeric, batch-specific Gaussian noise (default 0).

beta0

numeric, baseline log2 expression for each gene before LFC is applied

phi0

numeric, baseline overdispersion for each gene

SF

vector of length n (optional), custom size factors from DESeq2. If NULL, simulated from N(1,0.25)

nsims

integer, number of datasets to simulate given the input conditions. Default is 25.

disp

string, either 'gene' or 'cluster' to simulate gene-level or cluster-level dispersions. Default is gene-level. Input phi must be g x K matrix if disp='cluster'

n_pred

integer, number of samples in simulated prediction dataset. Default is 25

sim_batch_pred

boolean: FALSE (no batch effect for prediction samples) or TRUE (batch effect)

LFCb_pred

LFCb for batch-affected genes in prediction set. By default (NULL), = max(batch_effects) + LFCb/2: larger batch effect than training.

save_file

boolean: TRUE (save each set of simulations)

save_dir

string (optional): directory to save files. Default: 'Simulations/<sigma_g>_<sigma_b>/B<B>'

save_pref

string (optional): prefix of file name to save simulated data to. Default: '<K>_<n>_<LFCg>_<pDEg>_<beta0>_<phi0>'

Value

if save_file=TRUE, then saved file in '<save_dir>/<save_pref>_sim<1:nsims>_data.RData'. Otherwise, list of length 'nsims', with a sim.dat list object for each simulation

Author(s)

David K. Lim, deelim@live.unc.edu

References

https://github.com/DavidKLim/FSCseq

Examples

1
sim.dat = FSCseq::simulateData(B=1, g=10000, K=2, n=50, LFCg=1, pDEg=0.05, beta0=12, phi0=0.35, nsims=1, save_file=F)[[1]]

DavidKLim/FSCseq documentation built on Dec. 12, 2021, 3:46 a.m.