generate_fragments: generate a set of fragments from a set of transcripts
In alyssafrazee/polyester: Simulate RNA-seq reads

Description Usage Arguments Details Value References See Also Examples

Convert each sequence in a DNAStringSet to a "fragment" (subsequence)

generate_fragments(
  tObj,
  fraglen = 250,
  fragsd = 25,
  readlen = 100,
  distr = "normal",
  custdens = NULL,
  bias = "none",
  frag_GC_bias = "none"
)

`tObj`	DNAStringSet of sequences from which fragments should be extracted
`fraglen`	Mean fragment length, if drawing fragment lengths from a normal distribution.
`fragsd`	Standard deviation of fragment lengths, if drawing lengths from a normal distribution. Note: `fraglen` and `fragsd` are ignored unless `distr` is 'normal'.
`readlen`	Read length. Default 100. Used only to label read positions.
`distr`	One of 'normal', 'empirical', or 'custom'. If 'normal', draw fragment lengths from a normal distribution with mean `fraglen` and standard deviation `fragsd`. If 'empirical', draw fragment lengths from a fragment length distribution estimated from a real data set. If 'custom', draw fragment lengths from a custom distribution, provided as the `custdens` argument, which should be a density fitted using `logspline`.
`custdens`	If `distr` is 'custom', draw fragments from this density. Should be an object of class `logspline`.
`bias`	One of 'none', 'rnaf', or 'cdnaf' (default 'none'). 'none' represents uniform fragment selection (every possible fragment in a transcript has equal probability of being in the experiment); 'rnaf' represents positional bias that arises in protocols using RNA fragmentation, and 'cdnaf' represents positional bias arising in protocols that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model, coverage is higher in the middle of the transcript and lower at both ends, and in the 'cdnaf' model, coverage increases toward the 3' end of the transcript. The probability models used come from Supplementary Figure S3 of Li and Jiang (2012).
`frag_GC_bias`	See explanation in `simulate_experiment`.

The empirical fragment length distribution was estimated using 7 randomly selected RNA-seq samples from the GEUVADIS dataset ('t Hoen et al 2013), one sample from each laboratory that performed sequencing for that data set. We used Picard's "CollectInsertSizeMetrics" (http://broadinstitute.github.io/picard/), version 1.121, to estimate the insert size distribution based on the read alignments.

DNAStringSet consisting of one randomly selected subsequence per element of tObj.

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

logspline

  library(Biostrings)
  data(srPhiX174)

  ## get fragments with lengths drawn from normal distrubution
  set.seed(174)
  srPhiX174_fragments = generate_fragments(srPhiX174, fraglen=15, fragsd=3,
      readlen=4)
  srPhiX174_fragments
  srPhiX174

  ## get fragments with lengths drawn from an empirical distribution
  empirical_frags = generate_fragments(srPhiX174, distr='empirical')
  empirical_frags

  ## get fragments with lengths from a normal distribution, but include
  ## positional bias from cDNA fragmentation:
  biased_frags = generate_fragments(srPhiX174, bias='cdnaf')
  biased_frags