startQuest("Counting ")
This last chapter builds on what we have done in the five previous chapters and
extend on the use of the R/Bioconductor packages. These give an extended
flexibility of usage over tools such as HTSeq. Very few tools are addressing
the caveats we mentioned in the previous chapters, and this is still a field
under very active development. The aim of this chapter is to give you some
extra insight, i.e. pointers; in case you would in your research be
affected by one of the mentioned (or not) caveats.
h1("Estimating Expression over Genes and Exons")
library(RnaSeqTutorial) bamfiles <- dir(file.path(extdata(),"BAM"), pattern="*.bam$",full.names=TRUE)
This chapter[^6] describes part of an RNA-Seq analysis use-case. RNA-Seq [@Mortazavi:2008p740] was introduced as a new method to perform Gene Expression Analysis, using the advantages of the high throughput of Next-Generation Sequencing (NGS) machines.
h2("Counting reads over known genes and exons")
The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons, transcripts, gene models, etc.
To achieve this, we need to take advantage of all the steps performed previously in the workshop.
the alignments information has to be retrieved
the corresponding annotation need to be fetched and possibly adapted e.g.: as was done in the preceding lecture.
the read coverage per genic feature of interest determined
Can you associate at least a Bioconductor package to every of these tasks?
There are numerous choices, as an example in the following we will go for the following set of packages:
Rsamtools
genomeIntervals
GenomicAlignments
h3("The alignments")
In this section we
will import the data using the GenomicAlignments package
readGAlignments function. This will create a GAlignments object that
contains only the reads that aligned to the genome.
Using what was introduced in the alignments section, read in the first bam file from the bigdata() bam folder. Remember that the protocol used was not strand-specific.
First we scan the bam directory:
bamfiles <- dir(file.path(extdata(), "BAM"), ".bam$", full=TRUE) names(bamfiles) <- sub("_.*", "", basename(bamfiles))
Then we read the first file:
aln <- readGAlignments(bamfiles[1]) strand(aln) <- "*"
As we have seen, many of these reads actually align to multiple locations. In a first basic analysis - i.e.: to get a feel for the data - such reads could be ignored.
Filter the multiple alignment reads. Think of the "NH" tag.
param <- ScanBamParam(tag="NH") nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH aln <- aln[nhs==1,]
Now that we have the alignments ( object) and the synthetic transcript annotation ( object) - the one from the lecture; the same used in the Interlude ; we can quantify gene expression by counting reads over all exons of a gene and summing them together. One thing to keep in mind is that special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping genes, isoforms), and thus might erroneously be counted several times in different features. To deal with this we can use any of the approaches summarised in Figure [1][count-modes]:
![Overlap modes: Image from the HTSeq package developed by Simon Anders.][count-modes]
The GenomicAlignments package
summarizeOverlaps function offer different possibilities to
summarize reads per features:
load("~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda") annot <- syntheticTrx[syntheticTrx$type=="mRNA",] counts1 <- summarizeOverlaps(annot, aln, mode="Union") counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict") counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")
Create a data.frame or a matrix of the results above and figure out if
any differences can be observed. E.g check for difference in the row
standard deviation (using the apply and
sd functions).
synthTrxCountsTable <- data.frame( assays(counts1)$counts, assays(counts2)$counts, assays(counts3)$counts) colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty") rownames(synthTrxCountsTable) <- rownames(counts1) sds <- apply(synthTrxCountsTable,1,sd) sum(sds!=0) sum(sds!=0)/length(sds) synthTrxCountsTable[which.max(sds),] syntheticTrx[which.max(sds),]
So it appears that we have 3,176 cases where the counts differ; 8% of the total!!), and that the synthetic transcript "Potri.001G244700.0" shows the largest difference.
quest(14) endQuest() startQuest("Continued feature summarisation")
For a detailed analysis, it would be important to adequatly choose one of the intersection modes above, however for the remainder of this section, we will use the "union" set. As before for reads aligning to multiple places in the genome, choosing to take the union when reads overlap several features is a simplification we may not want to do. There are several methods that probabilistically estimate the expression of overlapping features: RSEM [@Li:2010p4291], cufflinks [@Trapnell:2010p4759], MMSeq [@Turro:2011p4762].
This concludes that section on counting reads per known features. In the next section, we will look at how novel transcribed regions could be identified.
h2("Discovering novel transcribed regions")
Note: This section is not essential to the workshop
One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks [@Trapnell:2010p4759], Oases [@Schulz:2012p4765], Trinity [@Grabherr:2011p4766], to mention some of them. We can use Bioconductor tools as well, to identify loci and quantify counts without prior annotation knowledge. The example here is very crude and is really just a proof of concept of what one could do in a few commands i.e.: R rules.
But as a start and to make the results more precise, the reads have been
realigned using STAR [@Dobin:2013p5293], a very fast and
accurate aligner that use the recent approach of Maximum Exact Matches
(MEMs), see this website for more details.
This MEM approach - actually a derivative called MMP (Maximum Mappable
Prefix) - allow STAR to identify exon-exon junctions without prior
knowledge e.g.: no need for an annotation gff. To start, we re-read
one of the sample alignments using the GenomicAlignments package
readGAlignments function.
aln <- readGAlignments( BamFile(file.path(extdata(),"BAM","207_subset_sortmerna_trimmomatic_sorted.bam")))
h3("Defining transcribed regions")
The process begins with calculating the coverage, using the method from
the GenomicAlignments package:
cover <- coverage(aln)
cover
# this object is compressed to save space. It is an RLE (Running Length Encoding)
# we can look at a section of chromosome 4 say between bp 1 and 1000
# which gives us the number of read overlapping each of those bases
as.vector(cover[["Chr19"]])[6553903:6561936]
The coverage shows us how many reads overlap every single base in the genome. It is actually split per chromosomes.
The next step is to define, "islands" of expression. These can be
created using the slice function. The peak height for
the islands can be determined with the viewMaxs
function and the island widths can be found using the
width function:
islands <- slice(cover, 1) islandPeakHeight <- viewMaxs(islands) islandWidth <- width(islands)
While some more sophisticated approaches can be used to find exons de
novo, we can use a simple approach whereby we select islands whose
maximum peak height is 2 or more and whose width is 150bp (150%)
of the read size) or more to be candidate exons. The
elementLengths function shows how many of these
candidate exons appear on each chromosome:
candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=150L] candidateExons[["Chr19"]]
Remember that we used an aligner which is capable of mapping reads across splice junctions in the genome.
sum(cigarOpTable(cigar(aln))[,"N"] > 0)
There are 473,809 reads that span exon-exon junctions (EEJs).
Let"s look up such a potential EEJ:
aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",] aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",][3:6,]
There are respectively 4 reads spanning what looks like introns of
997 bp. Note that the GenomicAlignments package
Galignments class is aware of splicing junctions. Have
a look at the coverage for the first intron:
cover[["Chr19"]][131278:132275]
Now, we can have a look if we can identify a specific motif around the EEJ. We cheery pick - almost at random really - 2 EEJs, with 7 supporting reads each. Then using their cigar string, we calculate the position of the putative acceptor and donor sites.
splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",] cherry.pick.sel <- c(13492:13498,13511:13517) read.start <- start(splice.reads)[cherry.pick.sel] donor.pos <- read.start - 1 + as.numeric(sapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M"),"[",1)) acceptor.pos <- read.start - 1 + sapply( lapply( lapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M|N"),"[",1:2), as.integer), sum)
Then we load the chromosome Chr19 sequence, so that we can determine the sequence in the vicinity of the EEJs.
Chr19 <- readDNAStringSet(file.path(extdata(), "FASTA", "Ptrichocarpa_v3.0_210-Chr19.fa.gz"))
And retrieve the sequences spanning the EEJs. We pad each of them by 10bp on both sides.
donor <- Views(subject=Chr19[[1]], start=donor.pos-8, end=donor.pos+11) acceptor <- Views(subject=Chr19[[1]], start=acceptor.pos-10, end=acceptor.pos+9)
Oups, is not something missing in our reasoning? Yes, indeed, we have not considered the strand until now! A visual evaluation told us that the mRNA is probably on the minus strand whereas the second is on the plus strand; we consquently correct the sequences created above.
donor <- DNAStringSet(donor) minus.acceptor <- reverseComplement(donor[1:7]) acceptor <- DNAStringSet(acceptor) minus.donor <- reverseComplement(acceptor[1:7]) donor[1:7] <- minus.donor acceptor[1:7] <- minus.acceptor
Let"s see if there"s a consensus in the sequences of 20bp centered around the potential acceptor and donor sites.
library(seqLogo) pwm <- makePWM(cbind( alphabetByCycle(donor)[c("A","C","G","T"),]/length(donor), alphabetByCycle(acceptor)[c("A","C","G","T"),]/length(acceptor)) ) seqLogo(pwm)
![GC content in aligned reads][seqlogo]
Clearly the logo in the figure above is not exceptional, but from only 2 EEJs, we can already see that the donor site at position 10-11 is GT and the acceptor site at position 30-31 is AG, i.e.: the canonical sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there - enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach complemented by additional criteria can prove very efficient.
This concludes the section on summarizing counts. As you could realize,
juggling with the different package for manipulating the alignment and
annotation requires some coding. To facilitate this a number of
"workflow" package are available at Bioconductor. The next section gives
a brief introduction of easyRNASeq (obviously, a
biased selection ")
h2("Using easyRNASeq")
Let us redo what was done in the previous section using the
easyRNASeq package.
load the easyRNASeq package and its companion data
package: RnaSeqTutorial.
Look up the help page on the simpleRNASeq function
or use show to lookup that function definition and
arguments.
The data package contains 4 BAM files located as indicated
below. Create a BamFileList object from them.
The data package also contains a gff3 formated file - given below.
Create a AnnotParam object from it.
Knowing that the RNA-Sequencing protocol was single-end and
unstranded, create the appropriate BamParam
object.
Finally, combine the AnnotParam and
BamParam into a RnaSeqParam
object and call the simpleRNASeq function,
summarizing the reads by exons.
The BAM files are:
dir(file.path(extdata(), pattern="^[A,T].*\\.bam$", full.names=TRUE)
The GFF3 file is:
system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")
And now we can proceed:
## we start by loading the packages library("easyRNASeq") library("RnaSeqTutorial") ## looking up the function definition show(simpleRNASeq) ## creating the BamFileList bamFileList <- getBamFileList( dir(path=system.file("extdata", package="RnaSeqTutorial"), pattern="^[A,T].*\\.bam$", full.names=TRUE)) ## creating the AnnotParam object annotParam <- AnnotParam(datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")) ## creating the BamParam object bamParam <- BamParam(paired=FALSE,stranded=FALSE) ## creating the RnaSeqParam rnaSeqParam <- RnaSeqParam(annotParam=annotParam, bamParam=bamParam, countBy="exons") ## and we then get a SummarizedExperiment containing the counts table sexp <- simpleRNASeq( bamFiles=bamFileList, param=rnaSeqParam, verbose=TRUE ) ## and look at the counts exon.count <- assays(sexp)$exons head(exon.count[rowSums(exon.count)>0,])
And that is it. In a few commands - one really - you have achieved what it takes us all the day to do!
Try to re-run simpleRNASeq with a minimalistic set of
argument to check it"s parameter detection ability.
sexp <- simpleRNASeq( bamFiles=bamFileList, param=RnaSeqParam(annotParam=annotParam), verbose=TRUE )
h3("Details")
The simpleRNASeq function currently accepts the
following type of datasource -
specified through an AnnotParam object:
"biomaRt" use biomaRt to retrieve the annotation
"env" use a RangedData or
GRanges class object present in the environment
"gff" reads in a gff version 3 file
"gtf" reads in a gtf file
"rda" load an RData object. The object needs to be named and of
class RangedData or GRanges.
The reads can be summarized by:
exons
features (any features such as introns, enhancers, etc.)
transcripts
genes Ideally these are the result of a process similar to the one
from the lecture where a gene is represented by the set of non
overlapping loci (i.e. synthetic exons) that represents all the
possible exons and UTRs of a gene. Such as I call them "synthetic
transcripts" are essential when counting reads as they ensure that
no reads will be accounted for several times. E.g., a gene can
have different isoforms, using different exons, overlapping exons,
in which case summarizing by exons might result in counting a read
several times, once per overlapping exon. See the section \(7.1\)
of the easyRNASeq vignette for details.
The results are returned in an SummarizedExperiment
class object.
Have a closer look at the SummarizedExperiment object
.
See the GenomicRange package
SummarizedExperiment class for more details on last
three accessors used in the following.
## the counts head(assays(sexp)[[1]]) ## some non empty counts head(assays(sexp)[[1]][rowSums(assays(sexp)[[1]])!=0,]) ## the sample info colData(sexp) ## the 'features' info rowRanges(sexp)
For more details and a complete overview of the
easyRNASeq package capabilities, have a look at the
easyRNASeq vignette.
vignette("easyRNASeq")
h3("Caveats")
easyRNASeq is still under active development and as
such the current version may still behave unexpectedly.
In addition, advanced checks are conducted on the data provided by the user to make decision on the overall process. This may need refinement.
The concerns raised by the analysis reported there
https://stat.ethz.ch/pipermail/bioc-devel/2012-September/003608.html
by Robinson et al. have been adressed. Both the original
easyRNASeq method and the
GenomicAlignments approach are provided, the later
one being the default.
Integration of recent tools for Differential Expression expression
analysis such as DESeq2 and
DEXSeq is pending. Planned is an integration of
the limma for enabling the
voom+limma paradigm. Ideally,
easyRNASeq would select the most appropriate
analysis to be conducted based on the report by Soneson and
Delorenzi[@Soneson:2013p5778].
h2("Where to from here")
After obtaining the count table, numerous downstream analyses are
available. Most often, such count tables are generated in a differential
expression experimental setup. In that case, packages such as
DESeq, DEXSeq,
edgeR, limma (see voom+limma in the
limma vignette), are some of the possibilities available in
Bioconductor. Have a look at [@dillies:2012p5465] and
[@Soneson:2013p5778] to decide which tool/approach is the
best suited for your experimental design. But, of course, counts can as
well be used for other purposes such as visualization, using e.g.: the
rtracklayer and GViz packages.
Actually, there"s no real limitation of what one can achieve with a
count table and it does not need be an RNA-Seq experiment; look at the
DiffBind package for an example of using ChIP-Seq data
for differential binding analyses.
quest(15) endQuest()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.