getRNAseqMatrix: Transforms a simulation time-point into RNA-seq-like data.

Description Usage Arguments Details Value Examples

View source: R/stochastic_simulation.R

Description

Transforms a time-point of a simulation into RNA-seq-like data, i.e. simulates a read count for each RNA molecule per individual (each individual is considered as a sample).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
getRNAseqMatrix(
  simdf,
  insilicosystem,
  samplingTime = max(simdf$time),
  laneEffect = F,
  nLanes = 2,
  propRnasSampled = 0.9,
  samplesLibSize = NULL,
  genesLength = NULL,
  mrnasOnly = T,
  mergeComplexes = F,
  meanLogLibSize_lane = 7,
  sdLogLibSize_lane = 0.5,
  sdLogLibSize_samples = 0.2
)

Arguments

simdf

The data-frame with the result of the simulation (see simulateInSilicoSystem).

insilicosystem

The simulated in silico system (see createInSilicoSystem).

samplingTime

Numeric. Time-point of the simulation to be transformed. By default, the maximum time of the simulation is used.

laneEffect

Boolean. Are the samples processed on different lanes/batches (see sampleLibrarySize)? Ignored if samplesLibSize is provided. Default value is FALSE.

nLanes

Numeric. How many lanes are there in the experiment (see sampleLibrarySize)? Automatically set to 1 if laneEffect = F. Ignored if samplesLibSize is provided. Default value is 2.

propRnasSampled

Numeric. The proportion of molecules of RNAs that are sampled in each individual. Must be between 0 and 1. Default value is 0.9.

samplesLibSize

Vector of expected library size for each individual/sample. If named, the names of the vector must correspond to the names of the individuals as specified in the result of the simulation. If none provided, will be sampled from a log-normal distribution (see sampleLibrarySize). Default value is NULL.

genesLength

Vector of gene length for each gene in the system. Its length must be equal to the number of protein-coding genes in the system (if mrnasOnly is TRUE) or of genes in the system (if mrnasOnly is FALSE). If named, the names must correspond to the ID of the (protein-coding) genes in the system. If none provided, all genes are assumed to be of length 1.

mrnasOnly

Boolean. Are the noncoding RNAs to be discarded before the transformation? If TRUE, read counts will be returned only for protein-coding RNAs. Default value is TRUE.

mergeComplexes

Boolean. Are the RNAs in complex accounted for in the read counts (i.e. are they detected by the RNA-seq experiment)? Default value is FALSE. See also mergeComplexesAbundance.

meanLogLibSize_lane

Numeric. The mean of the log10 mean library size normal distribution (see sampleLibrarySize). Ignored if samplesLibSize is provided. Default value of 7.

sdLogLibSize_lane

Numeric. The sd of the log10 mean library size normal distribution (see sampleLibrarySize). Ignored if samplesLibSize is provided. Default value of 0.5.

sdLogLibSize_samples

Numeric. The sd of the log10 samples library size normal distribution (see sampleLibrarySize). Ignored if samplesLibSize is provided. Default value of 0.2.

Details

The abundance of the RNA form of each gene at time samplingTime is extracted from the result of the simulation. If mrnasOnly = TRUE, non-coding RNAs are discarded. If the simulation contains several trials per individual (see simulateInSilicoSystem), the abundance of each RNA is summed over the different trials for each individual. The abundance of the RNAs in each individual are then transformed into "noisy proportions": for each individual, tot_RNAs times propRnasSampled RNAs molecules are sampled from the total RNA molecules in the individual (where tot_RNAs is the total number of RNA molecules of a given individual). If propRnasSampled is 1, this step simply corresponds to dividing the abundance of each RNA by the total number of RNAs for the individual. Otherwise, if propRnasSampled is less than 1, it introduces some stochasticity in the "measurement" of RNAs as some low-expressed RNAs might not be represented. The noisy proportion of for each RNA is multiplied by the gene length (by default set to 1 for all genes), to reproduce the bias of RNA-seq experiments in which longer genes get more reads. The proportions are re-scaled such that their sum over all RNAs for each individual equals 1. The expected count of each RNA in each individual is then computed as the product of the corresponding noisy proportion and the library size of the individual. The latter can be provided by the user (parameter samplesLibSize); otherwise it is sampled using the function sampleLibrarySize. Finally, the actual read count of each RNA for each individual is sampled from a Poisson distribution with parameter lambda equal to the corresponding expected count.

Value

A list:

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
mysystem = createInSilicoSystem(G = 5, regcomplexes = "none",
                                ploidy = 2, PC.p = 1)
mypop = createInSilicoPopulation(10, mysystem)
sim = simulateInSilicoSystem(mysystem, mypop, simtime = 1000,
                             ntrials = 10, nepochs = 5)
rnaSeq = getRNAseqMatrix(sim$Simulation, mysystem, laneEffect = F)

## With a batch/lane effect on the library size of the samples
rnaSeq = getRNAseqMatrix(sim$Simulation, mysystem, laneEffect = T)

## Providing the library size of each sample/individual
libsize = rnorm(length(unique(sim$Simulation$Ind)), 1e7, 1e5)
names(libsize) = unique(sim$Simulation$Ind)
rnaSeq = getRNAseqMatrix(sim$Simulation, mysystem, samplesLibSize = libsize)

## Accounting for different gene lengths
genes_length = sample(1:200, nrow(mysystem$genes))
names(genes_length) = as.character(1:nrow(mysystem$genes))
rnaSeq = getRNAseqMatrix(sim$Simulation, mysystem,
                         samplesLibSize = libsize, genesLength = genes_length)

sismonr documentation built on Feb. 11, 2020, 9:07 a.m.