startQuest("Counting ")
This last chapter builds on what we have done in the five previous chapters and
extend on the use of the R/Bioconductor packages. These give an extended
flexibility of usage over tools such as HTSeq
. Very few tools are addressing
the caveats we mentioned in the previous chapters, and this is still a field
under very active development. The aim of this chapter is to give you some
extra insight, i.e. pointers; in case you would in your research be
affected by one of the mentioned (or not) caveats.
h1("Estimating Expression over Genes and Exons")
library(RnaSeqTutorial) bamfiles <- dir(file.path(extdata(),"BAM"), pattern="*.bam$",full.names=TRUE)
This chapter[^6] describes part of an RNA-Seq analysis use-case. RNA-Seq [@Mortazavi:2008p740] was introduced as a new method to perform Gene Expression Analysis, using the advantages of the high throughput of Next-Generation Sequencing (NGS) machines.
h2("Counting reads over known genes and exons")
The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons, transcripts, gene models, etc.
To achieve this, we need to take advantage of all the steps performed previously in the workshop.
the alignments information has to be retrieved
the corresponding annotation need to be fetched and possibly adapted e.g.: as was done in the preceding lecture.
the read coverage per genic feature of interest determined
Can you associate at least a Bioconductor package to every of these tasks?
There are numerous choices, as an example in the following we will go for the following set of packages:
Rsamtools
genomeIntervals
GenomicAlignments
h3("The alignments")
In this section we
will import the data using the GenomicAlignments
package
readGAlignments
function. This will create a GAlignments
object that
contains only the reads that aligned to the genome.
Using what was introduced in the alignments section, read in the first bam file from the bigdata() bam folder. Remember that the protocol used was not strand-specific.
First we scan the bam directory:
bamfiles <- dir(file.path(extdata(), "BAM"), ".bam$", full=TRUE) names(bamfiles) <- sub("_.*", "", basename(bamfiles))
Then we read the first file:
aln <- readGAlignments(bamfiles[1]) strand(aln) <- "*"
As we have seen, many of these reads actually align to multiple locations. In a first basic analysis - i.e.: to get a feel for the data - such reads could be ignored.
Filter the multiple alignment reads. Think of the "NH" tag.
param <- ScanBamParam(tag="NH") nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH aln <- aln[nhs==1,]
Now that we have the alignments ( object) and the synthetic transcript annotation ( object) - the one from the lecture; the same used in the Interlude ; we can quantify gene expression by counting reads over all exons of a gene and summing them together. One thing to keep in mind is that special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping genes, isoforms), and thus might erroneously be counted several times in different features. To deal with this we can use any of the approaches summarised in Figure [1][count-modes]:
![Overlap modes: Image from the HTSeq package developed by Simon Anders.][count-modes]
The GenomicAlignments
package
summarizeOverlaps
function offer different possibilities to
summarize reads per features:
load("~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda") annot <- syntheticTrx[syntheticTrx$type=="mRNA",] counts1 <- summarizeOverlaps(annot, aln, mode="Union") counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict") counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")
Create a data.frame or a matrix of the results above and figure out if
any differences can be observed. E.g check for difference in the row
standard deviation (using the apply
and
sd
functions).
synthTrxCountsTable <- data.frame( assays(counts1)$counts, assays(counts2)$counts, assays(counts3)$counts) colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty") rownames(synthTrxCountsTable) <- rownames(counts1) sds <- apply(synthTrxCountsTable,1,sd) sum(sds!=0) sum(sds!=0)/length(sds) synthTrxCountsTable[which.max(sds),] syntheticTrx[which.max(sds),]
So it appears that we have 3,176 cases where the counts differ; 8% of the total!!), and that the synthetic transcript "Potri.001G244700.0" shows the largest difference.
quest(14) endQuest() startQuest("Continued feature summarisation")
For a detailed analysis, it would be important to adequatly choose one of the intersection modes above, however for the remainder of this section, we will use the "union" set. As before for reads aligning to multiple places in the genome, choosing to take the union when reads overlap several features is a simplification we may not want to do. There are several methods that probabilistically estimate the expression of overlapping features: RSEM [@Li:2010p4291], cufflinks [@Trapnell:2010p4759], MMSeq [@Turro:2011p4762].
This concludes that section on counting reads per known features. In the next section, we will look at how novel transcribed regions could be identified.
h2("Discovering novel transcribed regions")
Note: This section is not essential to the workshop
One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks [@Trapnell:2010p4759], Oases [@Schulz:2012p4765], Trinity [@Grabherr:2011p4766], to mention some of them. We can use Bioconductor tools as well, to identify loci and quantify counts without prior annotation knowledge. The example here is very crude and is really just a proof of concept of what one could do in a few commands i.e.: R rules.
But as a start and to make the results more precise, the reads have been
realigned using STAR [@Dobin:2013p5293], a very fast and
accurate aligner that use the recent approach of Maximum Exact Matches
(MEMs), see this website for more details.
This MEM approach - actually a derivative called MMP (Maximum Mappable
Prefix) - allow STAR to identify exon-exon junctions without prior
knowledge e.g.: no need for an annotation gff. To start, we re-read
one of the sample alignments using the GenomicAlignments
package
readGAlignments
function.
aln <- readGAlignments( BamFile(file.path(extdata(),"BAM","207_subset_sortmerna_trimmomatic_sorted.bam")))
h3("Defining transcribed regions")
The process begins with calculating the coverage, using the method from
the GenomicAlignments
package:
cover <- coverage(aln)
cover # this object is compressed to save space. It is an RLE (Running Length Encoding) # we can look at a section of chromosome 4 say between bp 1 and 1000 # which gives us the number of read overlapping each of those bases as.vector(cover[["Chr19"]])[6553903:6561936]
The coverage shows us how many reads overlap every single base in the genome. It is actually split per chromosomes.
The next step is to define, "islands" of expression. These can be
created using the slice
function. The peak height for
the islands can be determined with the viewMaxs
function and the island widths can be found using the
width
function:
islands <- slice(cover, 1) islandPeakHeight <- viewMaxs(islands) islandWidth <- width(islands)
While some more sophisticated approaches can be used to find exons de
novo, we can use a simple approach whereby we select islands whose
maximum peak height is 2 or more and whose width is 150bp (150%)
of the read size) or more to be candidate exons. The
elementLengths
function shows how many of these
candidate exons appear on each chromosome:
candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=150L] candidateExons[["Chr19"]]
Remember that we used an aligner which is capable of mapping reads across splice junctions in the genome.
sum(cigarOpTable(cigar(aln))[,"N"] > 0)
There are 473,809 reads that span exon-exon junctions (EEJs).
Let"s look up such a potential EEJ:
aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",] aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",][3:6,]
There are respectively 4 reads spanning what looks like introns of
997 bp. Note that the GenomicAlignments
package
Galignments
class is aware of splicing junctions. Have
a look at the coverage for the first intron:
cover[["Chr19"]][131278:132275]
Now, we can have a look if we can identify a specific motif around the EEJ. We cheery pick - almost at random really - 2 EEJs, with 7 supporting reads each. Then using their cigar string, we calculate the position of the putative acceptor and donor sites.
splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",] cherry.pick.sel <- c(13492:13498,13511:13517) read.start <- start(splice.reads)[cherry.pick.sel] donor.pos <- read.start - 1 + as.numeric(sapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M"),"[",1)) acceptor.pos <- read.start - 1 + sapply( lapply( lapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M|N"),"[",1:2), as.integer), sum)
Then we load the chromosome Chr19 sequence, so that we can determine the sequence in the vicinity of the EEJs.
Chr19 <- readDNAStringSet(file.path(extdata(), "FASTA", "Ptrichocarpa_v3.0_210-Chr19.fa.gz"))
And retrieve the sequences spanning the EEJs. We pad each of them by 10bp on both sides.
donor <- Views(subject=Chr19[[1]], start=donor.pos-8, end=donor.pos+11) acceptor <- Views(subject=Chr19[[1]], start=acceptor.pos-10, end=acceptor.pos+9)
Oups, is not something missing in our reasoning? Yes, indeed, we have not considered the strand until now! A visual evaluation told us that the mRNA is probably on the minus strand whereas the second is on the plus strand; we consquently correct the sequences created above.
donor <- DNAStringSet(donor) minus.acceptor <- reverseComplement(donor[1:7]) acceptor <- DNAStringSet(acceptor) minus.donor <- reverseComplement(acceptor[1:7]) donor[1:7] <- minus.donor acceptor[1:7] <- minus.acceptor
Let"s see if there"s a consensus in the sequences of 20bp centered around the potential acceptor and donor sites.
library(seqLogo) pwm <- makePWM(cbind( alphabetByCycle(donor)[c("A","C","G","T"),]/length(donor), alphabetByCycle(acceptor)[c("A","C","G","T"),]/length(acceptor)) ) seqLogo(pwm)
![GC content in aligned reads][seqlogo]
Clearly the logo in the figure above is not exceptional, but from only 2 EEJs, we can already see that the donor site at position 10-11 is GT and the acceptor site at position 30-31 is AG, i.e.: the canonical sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there - enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach complemented by additional criteria can prove very efficient.
This concludes the section on summarizing counts. As you could realize,
juggling with the different package for manipulating the alignment and
annotation requires some coding. To facilitate this a number of
"workflow" package are available at Bioconductor. The next section gives
a brief introduction of easyRNASeq
(obviously, a
biased selection ")
h2("Using easyRNASeq")
Let us redo what was done in the previous section using the
easyRNASeq
package.
load the easyRNASeq
package and its companion data
package: RnaSeqTutorial
.
Look up the help page on the simpleRNASeq
function
or use show
to lookup that function definition and
arguments.
The data package contains 4 BAM files located as indicated
below. Create a BamFileList object from them.
The data package also contains a gff3 formated file - given below.
Create a AnnotParam
object from it.
Knowing that the RNA-Sequencing protocol was single-end and
unstranded, create the appropriate BamParam
object.
Finally, combine the AnnotParam
and
BamParam
into a RnaSeqParam
object and call the simpleRNASeq
function,
summarizing the reads by exons.
The BAM files are:
dir(file.path(extdata(), pattern="^[A,T].*\\.bam$", full.names=TRUE)
The GFF3 file is:
system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")
And now we can proceed:
## we start by loading the packages library("easyRNASeq") library("RnaSeqTutorial") ## looking up the function definition show(simpleRNASeq) ## creating the BamFileList bamFileList <- getBamFileList( dir(path=system.file("extdata", package="RnaSeqTutorial"), pattern="^[A,T].*\\.bam$", full.names=TRUE)) ## creating the AnnotParam object annotParam <- AnnotParam(datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")) ## creating the BamParam object bamParam <- BamParam(paired=FALSE,stranded=FALSE) ## creating the RnaSeqParam rnaSeqParam <- RnaSeqParam(annotParam=annotParam, bamParam=bamParam, countBy="exons") ## and we then get a SummarizedExperiment containing the counts table sexp <- simpleRNASeq( bamFiles=bamFileList, param=rnaSeqParam, verbose=TRUE ) ## and look at the counts exon.count <- assays(sexp)$exons head(exon.count[rowSums(exon.count)>0,])
And that is it. In a few commands - one really - you have achieved what it takes us all the day to do!
Try to re-run simpleRNASeq
with a minimalistic set of
argument to check it"s parameter detection ability.
sexp <- simpleRNASeq( bamFiles=bamFileList, param=RnaSeqParam(annotParam=annotParam), verbose=TRUE )
h3("Details")
The simpleRNASeq
function currently accepts the
following type
of datasource
-
specified through an AnnotParam
object:
"biomaRt" use biomaRt to retrieve the annotation
"env" use a RangedData
or
GRanges
class object present in the environment
"gff" reads in a gff version 3 file
"gtf" reads in a gtf file
"rda" load an RData object. The object needs to be named and of
class RangedData
or GRanges
.
The reads can be summarized by:
exons
features (any features such as introns, enhancers, etc.)
transcripts
genes Ideally these are the result of a process similar to the one
from the lecture where a gene is represented by the set of non
overlapping loci (i.e. synthetic exons) that represents all the
possible exons and UTRs of a gene. Such as I call them "synthetic
transcripts" are essential when counting reads as they ensure that
no reads will be accounted for several times. E.g., a gene can
have different isoforms, using different exons, overlapping exons,
in which case summarizing by exons might result in counting a read
several times, once per overlapping exon. See the section \(7.1\)
of the easyRNASeq
vignette for details.
The results are returned in an SummarizedExperiment
class object.
Have a closer look at the SummarizedExperiment
object
.
See the GenomicRange
package
SummarizedExperiment
class for more details on last
three accessors used in the following.
## the counts head(assays(sexp)[[1]]) ## some non empty counts head(assays(sexp)[[1]][rowSums(assays(sexp)[[1]])!=0,]) ## the sample info colData(sexp) ## the 'features' info rowRanges(sexp)
For more details and a complete overview of the
easyRNASeq
package capabilities, have a look at the
easyRNASeq
vignette.
vignette("easyRNASeq")
h3("Caveats")
easyRNASeq
is still under active development and as
such the current version may still behave unexpectedly.
In addition, advanced checks are conducted on the data provided by the user to make decision on the overall process. This may need refinement.
The concerns raised by the analysis reported there
https://stat.ethz.ch/pipermail/bioc-devel/2012-September/003608.html
by Robinson et al. have been adressed. Both the original
easyRNASeq
method and the
GenomicAlignments
approach are provided, the later
one being the default.
Integration of recent tools for Differential Expression expression
analysis such as DESeq2
and
DEXSeq
is pending. Planned is an integration of
the limma
for enabling the
voom+limma
paradigm. Ideally,
easyRNASeq
would select the most appropriate
analysis to be conducted based on the report by Soneson and
Delorenzi[@Soneson:2013p5778].
h2("Where to from here")
After obtaining the count table, numerous downstream analyses are
available. Most often, such count tables are generated in a differential
expression experimental setup. In that case, packages such as
DESeq
, DEXSeq
,
edgeR
, limma
(see voom+limma in the
limma vignette), are some of the possibilities available in
Bioconductor. Have a look at [@dillies:2012p5465] and
[@Soneson:2013p5778] to decide which tool/approach is the
best suited for your experimental design. But, of course, counts can as
well be used for other purposes such as visualization, using e.g.: the
rtracklayer
and GViz
packages.
Actually, there"s no real limitation of what one can achieve with a
count table and it does not need be an RNA-Seq experiment; look at the
DiffBind
package for an example of using ChIP-Seq data
for differential binding analyses.
quest(15) endQuest()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.