srnadiff: Finding differentially expressed unannotated genomic regions...

View source: R/srnadiff.R

srnadiffR Documentation

Finding differentially expressed unannotated genomic regions from RNA-seq data

Description

srnadiff is a package that finds differently expressed regions from RNA-seq data at base-resolution level without relying on existing annotation. To do so, the package implements the identify-then-annotate methodology that builds on the idea of combining two pipelines approach: differential expressed regions detection and differential expression quantification.

This is the main wrapper for running several key functions from this package. It is meant to be used after that a srnadiffExp object has been created. srnadiff implement four methods to produce potential DERs (see Details). Once DERs are detected, the second step in srnadiff is to quantify the statistic signification of these.

Usage

srnadiff(
  object,
  segMethod = c("hmm", "IR"),
  diffMethod = "DESeq2",
  useParameters = srnadiffDefaultParameters,
  nThreads = 1
)

Arguments

object

An srnadiffExp object.

segMethod

A character vector. The segmentation methods to use, one of 'annotation', 'naive', 'hmm', 'IR' or combinations thereof. Default 'all', all methods are used. See Details.

diffMethod

A character. The differential expression testing method to use, one of 'DESeq2', 'edgeR', or 'baySeq'. See Details.

useParameters

A named list containing the methods parameters to use. If missing, default parameter values are supplied. See parameters for details.

nThreads

integer(1). Number of workers. Defaults to all cores available as determined by multicoreWorkers.

Details

The srnadiff package implements two major methods to produce potential differentially expressed regions: the HMM and IR method. Briefly, these methods identify contiguous base-pairs in the genome that present differential expression signal, then these are regrouped into genomic intervals called differentially expressed regions (DERs).

Once DERs are detected, the second step in a sRNA-diff approach is to quantify the statistic signification of these. To do so, reads (including fractions of reads) that overlap each expressed region are counted to arrive at a count matrix with one row per region and one column per sample. Then, this count matrix is analyzed using the standard workflow of DESeq2 for differential expression of RNA-seq data, assigning a p-value to each candidate DER. Alternatively, other methods (edgeR, baySeq) can be used.

The main functions for finds differently expressed regions are srnadiffExp and srnadiff. The first one creats an S4 class providing the infrastructure (slots) to store the input data, methods parameters, intermediate calculations and results of an sRNA-diff approach. The second one implement four methods to find candidate differentially expressed regions and quantify the statistic signification of the finded regions. Details about the implemented methods are further described in the vignette and the manual page of the srnadiff function.

Implemented methods to produce potential differentially expressed regions in srnadiff are:

annotation:

This method simply provides the genomic regions corresponding to the annotation file that is optionally given by the user. It can be a set of known miRNAs, siRNAs, piRNAs, genes, or a combination thereof.

hmm:

This approach assumes that continuous regions of RNA along the chromosome are either "differentially expressed" or "not". This is captured with a hidden Markov model (HMM) with binary latent state of each nucleotide: differentially expressed or not differentially expressed. The observations of the HMM are then the empirical p-values arising from the differential expression analysis corresponding to each nucleotide position. The HMM approach normally needs emission, transition, and starting probabilities values (see parameters). They can be tuned by the user. In order to finding the most likely sequence of states from the HMM, the Viterbi algorithm is performed. This essentially segments the genome into regions, where a region is defined as a set of consecutive bases showing a common expression signature.

IR:

In this approach, for each base, the average from the normalized coverage is calculated across all samples into each condition. This generates a vector of (normalized) mean coverage expression per condition. These two vectors are then used to compute per-nucleotide log-ratios (in absolute value) across the genome. For the computed log-ratio expression, the method uses a sliding threshold h that run across the log-ratio levels identifying bases with log-ratio value above of h. Regions of contiguous bases passing this threshold are then analyzed using an adaptation of Aumann and Lindell algorithm for irreducibility property (Aumann and Lindell (2003)).

naive:

This method is the simplest, gived a fixed threshold h, contiguous bases with log-ratio expression (in absolute value) passing this threshold are then considered as candidate differentially expressed regions.

Value

An srnadiffExp object containing additional slots for:

  • regions

  • parameters

  • countMatrix

Author(s)

Matthias Zytnicki and Ignacio González

References

Aumann Y. and, Lindell Y. (2003). A Statistical Theory for Quantitative Association Rules. Journal of Intelligent Information Systems, 20(3):255-283.

See Also

regions, parameters, countMatrix and srnadiffExp

Examples

## A typical srnadiff session might look like the following.

## Here we assume that 'bamFiles' is a vector with the full
## paths to the BAM files and the sample and experimental
## design information are stored in a data frame 'sampleInfo'.

## Not run: 

#-- Data preparation
srnaExp <- srnadiffExp(bamFiles, sampleInfo)

#-- Detecting DERs and quantifying differential expression
srnaExp <- srnadiff(srnaExp)

#-- Visualization of the results
plotRegions(srnaExp, regions(srnaExp)[1])

## End(Not run)

srnaExp <- srnadiffExample()
srnaExp <- srnadiff(srnaExp)
srnaExp


mzytnicki/srnadiff documentation built on March 7, 2023, 2:18 a.m.