In agi0917/AlternativeSplicingDocu:

r6objDocuStat <- rqudocuhelper::R6ClassDocuStatus$new()
r6objDocuStat$setProject(psProject = "DEXSeq")

# r6objDocuStat$setVersion(psVersion = "0.0.900")
# r6objDocuStat$setDate(psDate = "31.03.2016")
# r6objDocuStat$setAuthor(psAuthor = "pvr")
# r6objDocuStat$setStatus(psStatus = "Init")
# r6objDocuStat$setVersion(psVersion = "0.0.901")
# r6objDocuStat$setDate(psDate = "01.04.2016")
# r6objDocuStat$setAuthor(psAuthor = "peter")
# r6objDocuStat$setStatus(psStatus = "First changes")
# r6objDocuStat$setVersion(psVersion = "0.0.902")
# r6objDocuStat$setDate(psDate = "22.04.2016")
# r6objDocuStat$setAuthor(psAuthor = "peter")
# r6objDocuStat$setStatus(psStatus = "Adding description on analysis")
# r6objDocuStat$setVersion(psVersion = "0.0.903")
# r6objDocuStat$setStatus(psStatus = "Adding more details on the analysis")
# r6objDocuStat$setVersion(psVersion = "0.0.904")
# r6objDocuStat$setStatus(psStatus = "Starting with report, added TODO-list")
# r6objDocuStat$setVersion(psVersion = "0.0.905")
# r6objDocuStat$setStatus(psStatus = "Adding annotation info to results")
# r6objDocuStat$setVersion(psVersion = "0.0.906")
# r6objDocuStat$setStatus(psStatus = "Extending description on preparation")
# r6objDocuStat$setVersion(psVersion = "0.0.907")
# r6objDocuStat$setStatus(psStatus = "Interpretation of multiple gene-ids")
# r6objDocuStat$setVersion(psVersion = "0.0.908")
# r6objDocuStat$setStatus(psStatus = "Annotations via biomaRt")
# r6objDocuStat$setVersion(psVersion = "0.0.909")
# r6objDocuStat$setStatus(psStatus = "Check assignment of treatment and control")
r6objDocuStat$setVersion(psVersion = "0.0.910")
r6objDocuStat$setStatus(psStatus = "Added subsection on assignment of treatment and control")

r6objDocuStat$include_doc_stat()

Disclaimer

This document describes how to assess differential exon usage using the R/BioC-Package DEXSeq. The description in this document follows closely the vignette of DEXSeq which is available from \url{http://bioconductor.org/packages/release/bioc/html/DEXSeq.html} and also from within R using the command

browseVignettes(package = "DEXSeq")

On the BioC website for DEXSeq the publication by @ARH2012 is mentioned as the reference to be cited for the package. This reference can also be obtained by

citation(package = "DEXSeq")

Introduction

Differential exon usage (DEU) can be analysed by comparing RNA-Seq data from different tissues or under different experimental conditions. DEU is defined as the changes in relative usage of exons caused by experimental conditions.

Since the basic nature of RNA-Seq data is counts of molecules within a given experimental sample, the measure of differential exon usage amounts to being a function of the count differences between the different experimental conditions. The R/BioC-package DEXSeq implements a statistical method based on generalized linear models which offers to control false discovery rate by taking biological variation into account.

People familiar with R/BioC-Package DESeq2 will recognize that the usage of DEXSeq is very similar to DESeq2.

\pagebreak

Preparation

Before being able to analyse differential exon usage, one has to do the following preparations steps.

Alignment: In case, the input data consists of raw read files, they first have to be aligned against a reference genome (not transcriptome) sequence which results in sam or bam files. It is important to use a splice-aware aligner (i.e. an alignment tool that can handle reads that span across introns) such as TopHat2, GSNAP or STAR. DEXSeq works with sam files as input for the reads. In case aligned reads are available in bam format, they can be converted using samtools.
Annotations: The requirements for the annotation are somewhat special, and it is recommended to follow the workflow given in the DEXSeq vignette.
Feature counts: Python scripts coming with package DEXSeq are used for counting the how many exons occur in a given read file sample.

For this document, we use the dataset that was used in the vignette of DEXSeq. It consists of read counts for pasillaGenes and pasillaExons in Drosophila. More information on the used dataset can be obtained from the vignette that comes with the pasilla-package using

browseVignettes(package = "pasilla")

Preparatory steps to load the data consisting of count files and an annotation file and to setup the relevant modelling objects which is a dataframe relating the samples to the experimental conditions, are shown below.

if (!require(pasilla)) {
  source("https://bioconductor.org/biocLite.R")
  biocLite("pasilla")
  require(pasilla)
}

The count data and the annotation file are taken from the package and the sample table dataframe is constructed manually.

inDir <- system.file("extdata", package="pasilla")
list.files(inDir)
countFiles  <-  list.files(inDir, pattern="fb.txt$", full.names=TRUE)
(flattenedFile  <-  list.files(inDir, pattern="gff$", full.names=TRUE))
sampleTable  <-  data.frame(
   row.names = c( "treated1", "treated2", "treated3",
                  "untreated1", "untreated2", "untreated3", "untreated4" ),
   condition = c("knockdown", "knockdown", "knockdown",
                 "control", "control", "control", "control" ),
   libType = c( "single-end", "paired-end", "paired-end",
                "single-end", "single-end", "paired-end", "paired-end" ) )

Analysis workflow

The general procedure of the analysis starts with constructing a DEXSeqDataSet using the function DEXSeq::DEXSeqDataSetFromHTSeq. This function requires four arguments

a vector with names of count files
a dataframe which relates the samples to the different conditions
a design formula
the name of the gff reference file

sDxdFn <- file.path(system.file(package = "AlternativeSplicing", "inst", "extdata", "pasilla"), "dxd.Rd")
bRunAnalysis <- TRUE
if (file.exists(sDxdFn)) {
  suppressPackageStartupMessages( library( "DEXSeq" ) )
  load(file = sDxdFn)
  bRunAnalysis <- FALSE
}

suppressPackageStartupMessages( library( "DEXSeq" ) )
dxd  <-  DEXSeqDataSetFromHTSeq(
   countFiles,
   sampleData=sampleTable,
   design= ~ sample + exon + condition:exon,
   flattenedfile=flattenedFile )

Inspecting the data

Column data

Once a DEXSeqDataSet object is created there are different accessor functions that can be used to inspect the content of a DEXSeqDataSet object. The annotation and all the other information about a DEXSeqDataSet can be obtained using the colData() function.

colData(dxd)

Condition

As seen above the result of colData() is a dataframe showing available information about the DEXSeqDataSet object. The column entitled condition is of particular interest with respect to the assignment of treatment and control samples. The content of the condition column is of type factor and it defines the way how the comparisions are done. Using the statement

dxd$condition

gives a list of the conditions that are used in the model for accessing DEU. It is important that the content of this column is of type factor. This can be verified using the statement

is.factor(dxd$condition)

For our comparison it is important to notice that the later used model fitting routines will assume that the first level is always the control or the untreated condition. The levels of a given factor can be shown using the function levels() as follows

levels(dxd$condition)

This shows that condition r levels(dxd$condition)[1] will be used as control condition. We can change the control condition using the function relevel(). The statement

dxd$condition <- relevel(dxd$condition, "knockdown")

would change the control condition to be "knockdown" instead of "control".

Count data

Counts data can be viewed using the counts() function

head( counts(dxd), 5 )

Two groups of counts

From the output of the functions colData() and counts(), we can see that there are 14 columns, although we have only seven samples. The first seven columns are the number of reads mapping to the specific exonic regions and the last seven columns correspond to the sum of counts mapping to the rest of the exons from the same gene on each sample. The two groups of counts are termed this and others respectively by DEXSeq. The distribution of column indices into the two groups can be viewd by

split( seq_len(ncol(dxd)), colData(dxd)$exon )

In case, we are only interested in the counts of the this group, the following function brings up the first five lines of all feature counts

head( featureCounts(dxd), 5 )

Row data

Details on counting bins can accessed using the function rowData().

head( rowRanges(dxd), 3 )

Annotation of a sample

The function sampleAnnotation gives details about the design table and the sample annotations.

sampleAnnotation( dxd )

Analysis

The following analysis steps are always done following the same principle. The current DEXSeqDataSet object is passed to a function. And the function result is again assiged to the same DEXSeqDataSet object.

Short cut

Package DEXSeq contains a wrapper function called DEXSeq that does all the analysis steps that will be described below in one go. Hence the only thing that needs to be done is to create an instance of a DEXSeqDataSet object using function DEXSeqDataSetFromHTSeq() as shown above and call function DEXSeq() on that DEXSeqDataSet object. This will return a DEXSeqResults object.

In what follows, each single step of the analysis is described in more details

Normalisation

Different samples may show different sequencing depths which might lead to a coverage bias. So called size factors are estimated which measure relative sequencing depth. Size factor estimation in DEXSeq is done the same way as in DESeq and DESeq2 which is implemented in the function estimateSizeFactors

dxd  <- estimateSizeFactors( dxd )

Dispersion estimation

Differential exon usage is based on separating variability of the count data into sources that are related to biological variation and into non-biological factors. Information on the amount of variability due to non-biological factors which is also known as noise is inferred from biological replicates in an experimental data set and is characterized by the so-clled dispersion. Typically the number of replicates is too small to estimated dispersion parameters on the level of individual exons. Therefore information is shared across exons and genes in an intensity-dependent manner, to get reliable dispersion estimates.

So far only simple one-way designs are considered, i.e. samples are grouped according to a common experimental condition which is indicated by the condition factor of the sample table. Samples that share the same level of this one experimental condition are considered to be replicates. It is possible to accomodated more complicated experimental designs than just a single condition factor. The discussion of such models is deferred to a later section

In DEXSeq dispersions are estimated the same way as in package DESeq2. Internally the functions of DESeq2 are called, adapting the parameters to the specific caes of the DEXSeq model. In short, per-exon dispersions are computed using a Cox-Reid adjusted profile likelihood estimation.

Report

The results are reported using the R-package ReporteRs. This package offers a number of functions to generate on the one hand MS-based documents and on the other hand also HTML pages. The report that is generated for the DEXSeq results is very simple. It consists of titles, paragraphs Tables and Links to either other pages or to images which are directly shown in the report.

For the gene-level results so far, we just inserted a link to the page that is generated by DEXSeq::DEXSeqHTML(). It might be more attractive to construct a separate table with relevant genes of interest directly on the main report page and use the links to the plots for each gene that are produced by DEXSeq::DEXSeqHTML().

Annotations

The reports generated by DEXSeq annotate the results using ENSEMBL gene-ids. As those Ids might not be informative enough, some additional annotations are added. The minimal set of additional annotation that is added so far consists of the gene_name and of the description. Both items of information are taken from an annotation file that is given by the S4 Object describing the genome reference build. The name of the annotation file is automatically assigned using the construct

refBuild = "Mus_musculus/Ensembl/GRCm38.PatchesSkipped/Annotation/Version-2015-06-25"
param <- ezParam(list(refBuild=refBuild))

This creates an item called refBuild in the parameter list param. That list item is an S4 object which has many slots one of which is called refAnnotationFile which stored the relevant annotation file belonging to the refBuild that was specified above. In short the following statement returns the name of the annotation file.

sGnAnFn <- param[['ezRef']]@refAnnotationFile

The annotation file is read into a dataframe and from that dataframe, informations in columns gene_name and description are extracted for the set of genes that are in the result table. Since the granularity of the information in the annotation file goes down to single transcripts, a gene_id is very likely to apprear several times. For that reason we extract the unique entries in from the annotation file. If there are multiple entries per gene, we paste the information together using "|" as a separator.

Interpretation of results

In the genetable that summarizes the results of a DEXSeq analysis, some genes are grouped together and their gene-names are separated by a "+" sign. An example of this is shown below (see red arrows).

rcoursetools::insertOdgAsPdf(psOdgFileStem = "GeneIdGroup")

According to \url{https://support.bioconductor.org/p/46294/} this means that multiple genes are merged and the reason for this is that the merged genes seam to share exons which are not obvious to be assigned to a single gene.

Further annotations using biomart

The HTML-Report generating function DEXSeqHTML() has additional arguments that let us specify additional annotation detail for the results. The additional annotation is collected using functionality that is provided by the package biomaRt. The following example statement for calling DEXSeqHTML() with information specifying queries to a biomart was copied from \url{https://www.biostars.org/p/95550}:

DEXSeqHTML( ecs, geneIDs=NULL, fitExpToVar="condition", 
            FDR=0.1, color=c("#FF000080","#0000FF80"),
            mart=ensembl_mart,
            filter="ensembl_gene_id",
            attributes=c("external_gene_id","description") )

What is important in the above call is that argument mart is of type "Mart", i.e., the result of the statment class(mart) must be "Mart". The easiest way to create such a "Mart"-object is to use the function useMart() from package biomaRt. This can be done as follows.

ensembl_mart <- biomaRt::useMart(biomart = "ensembl", dataset="mmusculus_gene_ensembl")

The arguments filter and attributes must be consistent with the results of what is returned by functions listFilters() and listAttributes(), respectively. Those arguments determine the parameters of the query being on the filters which operate as selectors on the rows and the attributes which select specific columns.

Specifying the additional biomart related arguments to DEXSeqHTML() leads to additional columns in the genetable. For each attribute that was specified, there is an additional column in the genetable created by DEXSeqHTML().

\pagebreak