In UPSCb/RnaSeqTutorial: RNA-Seq Tutorial

    startQuest("Counting ")

This last chapter builds on what we have done in the five previous chapters and extend on the use of the R/Bioconductor packages. These give an extended flexibility of usage over tools such as HTSeq. Very few tools are addressing the caveats we mentioned in the previous chapters, and this is still a field under very active development. The aim of this chapter is to give you some extra insight, i.e. pointers; in case you would in your research be affected by one of the mentioned (or not) caveats.

    h1("Estimating Expression over Genes and Exons")

    library(RnaSeqTutorial)
    bamfiles <- dir(file.path(extdata(),"BAM"),
                   pattern="*.bam$",full.names=TRUE)

This chapter[^6] describes part of an RNA-Seq analysis use-case. RNA-Seq [@Mortazavi:2008p740] was introduced as a new method to perform Gene Expression Analysis, using the advantages of the high throughput of Next-Generation Sequencing (NGS) machines.

    h2("Counting reads over known genes and exons")

The goal of this use-case is to generate a count table for the selected genic features of interest, i.e. exons, transcripts, gene models, etc.

To achieve this, we need to take advantage of all the steps performed previously in the workshop.

the alignments information has to be retrieved
the corresponding annotation need to be fetched and possibly adapted e.g.: as was done in the preceding lecture.
the read coverage per genic feature of interest determined

Can you associate at least a Bioconductor package to every of these tasks?

There are numerous choices, as an example in the following we will go for the following set of packages:

Rsamtools
genomeIntervals
GenomicAlignments

    h3("The alignments")

In this section we will import the data using the GenomicAlignments package readGAlignments function. This will create a GAlignments object that contains only the reads that aligned to the genome.

Using what was introduced in the alignments section, read in the first bam file from the bigdata() bam folder. Remember that the protocol used was not strand-specific.

First we scan the bam directory:

    bamfiles <- dir(file.path(extdata(), "BAM"), ".bam$", full=TRUE)
    names(bamfiles) <- sub("_.*", "", basename(bamfiles))

Then we read the first file:

    aln <- readGAlignments(bamfiles[1])
    strand(aln) <- "*"

As we have seen, many of these reads actually align to multiple locations. In a first basic analysis - i.e.: to get a feel for the data - such reads could be ignored.

Filter the multiple alignment reads. Think of the "NH" tag.

    param <- ScanBamParam(tag="NH")
    nhs <- scanBam(bamfiles[[1]], param=param)[[1]]$tag$NH
    aln <- aln[nhs==1,]

Now that we have the alignments ( object) and the synthetic transcript annotation ( object) - the one from the lecture; the same used in the Interlude ; we can quantify gene expression by counting reads over all exons of a gene and summing them together. One thing to keep in mind is that special care must be taken in dealing with reads that overlap more than one feature (e.g. overlapping genes, isoforms), and thus might erroneously be counted several times in different features. To deal with this we can use any of the approaches summarised in Figure [1][count-modes]:

![Overlap modes: Image from the HTSeq package developed by Simon Anders.][count-modes]

The GenomicAlignments package summarizeOverlaps function offer different possibilities to summarize reads per features:

    load("~/Ptrichocarpa_v3.0_210_synthetic_transcripts.rda")
    annot <- syntheticTrx[syntheticTrx$type=="mRNA",]
    counts1 <- summarizeOverlaps(annot, aln, mode="Union")
    counts2 <- summarizeOverlaps(annot, aln, mode="IntersectionStrict")
    counts3 <- summarizeOverlaps(annot, aln, mode="IntersectionNotEmpty")

Create a data.frame or a matrix of the results above and figure out if any differences can be observed. E.g check for difference in the row standard deviation (using the apply and sd functions).

    synthTrxCountsTable <- data.frame( 
                                 assays(counts1)$counts,
                                 assays(counts2)$counts,
                                 assays(counts3)$counts)
    colnames(synthTrxCountsTable) <- c("union","intStrict","intNotEmpty")
    rownames(synthTrxCountsTable) <- rownames(counts1)
    sds <- apply(synthTrxCountsTable,1,sd)
    sum(sds!=0)
    sum(sds!=0)/length(sds)

    synthTrxCountsTable[which.max(sds),]
    syntheticTrx[which.max(sds),]

So it appears that we have 3,176 cases where the counts differ; 8% of the total!!), and that the synthetic transcript "Potri.001G244700.0" shows the largest difference.

    quest(14)
endQuest()
startQuest("Continued feature summarisation")

For a detailed analysis, it would be important to adequatly choose one of the intersection modes above, however for the remainder of this section, we will use the "union" set. As before for reads aligning to multiple places in the genome, choosing to take the union when reads overlap several features is a simplification we may not want to do. There are several methods that probabilistically estimate the expression of overlapping features: RSEM [@Li:2010p4291], cufflinks [@Trapnell:2010p4759], MMSeq [@Turro:2011p4762].

This concludes that section on counting reads per known features. In the next section, we will look at how novel transcribed regions could be identified.

    h2("Discovering novel transcribed regions")

Note: This section is not essential to the workshop

One main advantage of RNA-seq experiments over microarrays is that they can be used to identify any transcribed molecule, including unknown transcripts and isoforms, as well as other regulatory transcribed elements. To identify such new elements, several methods are available to recreate and annotate transcripts, e.g. Cufflinks [@Trapnell:2010p4759], Oases [@Schulz:2012p4765], Trinity [@Grabherr:2011p4766], to mention some of them. We can use Bioconductor tools as well, to identify loci and quantify counts without prior annotation knowledge. The example here is very crude and is really just a proof of concept of what one could do in a few commands i.e.: R rules.

But as a start and to make the results more precise, the reads have been realigned using STAR [@Dobin:2013p5293], a very fast and accurate aligner that use the recent approach of Maximum Exact Matches (MEMs), see this website for more details. This MEM approach - actually a derivative called MMP (Maximum Mappable Prefix) - allow STAR to identify exon-exon junctions without prior knowledge e.g.: no need for an annotation gff. To start, we re-read one of the sample alignments using the GenomicAlignments package readGAlignments function.

    aln <- readGAlignments(
     BamFile(file.path(extdata(),"BAM","207_subset_sortmerna_trimmomatic_sorted.bam")))

    h3("Defining transcribed regions")

The process begins with calculating the coverage, using the method from the GenomicAlignments package:

    cover <- coverage(aln)

    cover
    # this object is compressed to save space. It is an RLE (Running Length Encoding)
    # we can look at a section of chromosome 4 say between bp 1 and 1000
    # which gives us the number of read overlapping each of those bases
    as.vector(cover[["Chr19"]])[6553903:6561936]

The coverage shows us how many reads overlap every single base in the genome. It is actually split per chromosomes.

The next step is to define, "islands" of expression. These can be created using the slice function. The peak height for the islands can be determined with the viewMaxs function and the island widths can be found using the width function:

    islands <- slice(cover, 1)
    islandPeakHeight <- viewMaxs(islands)
    islandWidth <- width(islands)

While some more sophisticated approaches can be used to find exons de novo, we can use a simple approach whereby we select islands whose maximum peak height is 2 or more and whose width is 150bp (150%) of the read size) or more to be candidate exons. The elementLengths function shows how many of these candidate exons appear on each chromosome:

    candidateExons <- islands[islandPeakHeight >= 2L & islandWidth >=150L]
    candidateExons[["Chr19"]]

Remember that we used an aligner which is capable of mapping reads across splice junctions in the genome.

    sum(cigarOpTable(cigar(aln))[,"N"] > 0)

There are 473,809 reads that span exon-exon junctions (EEJs).

Let"s look up such a potential EEJ:

    aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",]
    aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",][3:6,]

There are respectively 4 reads spanning what looks like introns of 997 bp. Note that the GenomicAlignments package Galignments class is aware of splicing junctions. Have a look at the coverage for the first intron:

    cover[["Chr19"]][131278:132275]

Now, we can have a look if we can identify a specific motif around the EEJ. We cheery pick - almost at random really - 2 EEJs, with 7 supporting reads each. Then using their cigar string, we calculate the position of the putative acceptor and donor sites.

    splice.reads <- aln[cigarOpTable(cigar(aln))[,"N"] > 0 & seqnames(aln) == "Chr19",]
    cherry.pick.sel <- c(13492:13498,13511:13517)
    read.start <- start(splice.reads)[cherry.pick.sel]
    donor.pos <- read.start - 1 +
    as.numeric(sapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M"),"[",1))

    acceptor.pos <- read.start - 1 +
     sapply(
       lapply(
         lapply(strsplit(cigar(splice.reads)[cherry.pick.sel],"M|N"),"[",1:2),
         as.integer),
       sum)

Then we load the chromosome Chr19 sequence, so that we can determine the sequence in the vicinity of the EEJs.

    Chr19 <- readDNAStringSet(file.path(extdata(),
                                       "FASTA",
                                       "Ptrichocarpa_v3.0_210-Chr19.fa.gz"))

And retrieve the sequences spanning the EEJs. We pad each of them by 10bp on both sides.

    donor <- Views(subject=Chr19[[1]],
                        start=donor.pos-8,
                        end=donor.pos+11)
    acceptor <- Views(subject=Chr19[[1]],
                      start=acceptor.pos-10,
                       end=acceptor.pos+9)

Oups, is not something missing in our reasoning? Yes, indeed, we have not considered the strand until now! A visual evaluation told us that the mRNA is probably on the minus strand whereas the second is on the plus strand; we consquently correct the sequences created above.

    donor <- DNAStringSet(donor)
    minus.acceptor <- reverseComplement(donor[1:7])

    acceptor <- DNAStringSet(acceptor)
    minus.donor <- reverseComplement(acceptor[1:7])

    donor[1:7] <- minus.donor
    acceptor[1:7] <- minus.acceptor

Let"s see if there"s a consensus in the sequences of 20bp centered around the potential acceptor and donor sites.

    library(seqLogo)
    pwm <- makePWM(cbind(
     alphabetByCycle(donor)[c("A","C","G","T"),]/length(donor),
     alphabetByCycle(acceptor)[c("A","C","G","T"),]/length(acceptor))
    )
    seqLogo(pwm)

![GC content in aligned reads][seqlogo]

Clearly the logo in the figure above is not exceptional, but from only 2 EEJs, we can already see that the donor site at position 10-11 is GT and the acceptor site at position 30-31 is AG, i.e.: the canonical sites. Moreover, we can see a relative - or at least I want to see it because I know it must be there - enrichment for Ts in the intron sequence, a known phenomenon. Hence, using a de-novo approach complemented by additional criteria can prove very efficient.

This concludes the section on summarizing counts. As you could realize, juggling with the different package for manipulating the alignment and annotation requires some coding. To facilitate this a number of "workflow" package are available at Bioconductor. The next section gives a brief introduction of easyRNASeq (obviously, a biased selection ")

    h2("Using easyRNASeq")

Let us redo what was done in the previous section using the easyRNASeq package.

load the easyRNASeq package and its companion data package: RnaSeqTutorial.
Look up the help page on the simpleRNASeq function or use show to lookup that function definition and arguments.
The data package contains 4 BAM files located as indicated below. Create a BamFileList object from them.
The data package also contains a gff3 formated file - given below. Create a AnnotParam object from it.
Knowing that the RNA-Sequencing protocol was single-end and unstranded, create the appropriate BamParam object.
Finally, combine the AnnotParam and BamParam into a RnaSeqParam object and call the simpleRNASeq function, summarizing the reads by exons.

The BAM files are:

    dir(file.path(extdata(),
                  pattern="^[A,T].*\\.bam$",
                  full.names=TRUE)

The GFF3 file is:

    system.file(
       "extdata",
       "Dmel-mRNA-exon-r5.52.gff3",
       package="RnaSeqTutorial")

And now we can proceed:

    ## we start by loading the packages
    library("easyRNASeq")
    library("RnaSeqTutorial")

    ## looking up the function definition
    show(simpleRNASeq)

    ## creating the BamFileList
    bamFileList <- getBamFileList(
                    dir(path=system.file("extdata",
                        package="RnaSeqTutorial"),
                        pattern="^[A,T].*\\.bam$",
                        full.names=TRUE))

    ## creating the AnnotParam object
    annotParam <- AnnotParam(datasource=system.file(
                           "extdata",
                           "Dmel-mRNA-exon-r5.52.gff3",
                           package="RnaSeqTutorial"))

    ## creating the BamParam object
    bamParam <- BamParam(paired=FALSE,stranded=FALSE)

    ## creating the RnaSeqParam 
    rnaSeqParam <- RnaSeqParam(annotParam=annotParam,
                              bamParam=bamParam,
                              countBy="exons")

    ## and we then get a SummarizedExperiment containing the counts table
    sexp <- simpleRNASeq(
       bamFiles=bamFileList,
       param=rnaSeqParam,
       verbose=TRUE
       )

    ## and look at the counts
    exon.count <- assays(sexp)$exons

    head(exon.count[rowSums(exon.count)>0,])

And that is it. In a few commands - one really - you have achieved what it takes us all the day to do!

Try to re-run simpleRNASeq with a minimalistic set of argument to check it"s parameter detection ability.

    sexp <- simpleRNASeq(
       bamFiles=bamFileList,
       param=RnaSeqParam(annotParam=annotParam),
       verbose=TRUE
       )

    h3("Details")

The simpleRNASeq function currently accepts the following type of datasource - specified through an AnnotParam object:

"biomaRt" use biomaRt to retrieve the annotation
"env" use a RangedData or GRanges class object present in the environment
"gff" reads in a gff version 3 file
"gtf" reads in a gtf file
"rda" load an RData object. The object needs to be named and of class RangedData or GRanges.

The reads can be summarized by:

exons
features (any features such as introns, enhancers, etc.)
transcripts
genes Ideally these are the result of a process similar to the one from the lecture where a gene is represented by the set of non overlapping loci (i.e. synthetic exons) that represents all the possible exons and UTRs of a gene. Such as I call them "synthetic transcripts" are essential when counting reads as they ensure that no reads will be accounted for several times. E.g., a gene can have different isoforms, using different exons, overlapping exons, in which case summarizing by exons might result in counting a read several times, once per overlapping exon. See the section \(7.1\) of the easyRNASeq vignette for details.

The results are returned in an SummarizedExperiment class object.

Have a closer look at the SummarizedExperiment object .

See the GenomicRange package SummarizedExperiment class for more details on last three accessors used in the following.

    ## the counts
    head(assays(sexp)[[1]])
    ## some non empty counts
    head(assays(sexp)[[1]][rowSums(assays(sexp)[[1]])!=0,])
    ## the sample info
    colData(sexp)
    ## the 'features' info
    rowRanges(sexp)

For more details and a complete overview of the easyRNASeq package capabilities, have a look at the easyRNASeq vignette.

    vignette("easyRNASeq")

    h3("Caveats")

easyRNASeq is still under active development and as such the current version may still behave unexpectedly.

In addition, advanced checks are conducted on the data provided by the user to make decision on the overall process. This may need refinement.
The concerns raised by the analysis reported there https://stat.ethz.ch/pipermail/bioc-devel/2012-September/003608.html by Robinson et al. have been adressed. Both the original easyRNASeq method and the GenomicAlignments approach are provided, the later one being the default.
Integration of recent tools for Differential Expression expression analysis such as DESeq2 and DEXSeq is pending. Planned is an integration of the limma for enabling the voom+limma paradigm. Ideally, easyRNASeq would select the most appropriate analysis to be conducted based on the report by Soneson and Delorenzi[@Soneson:2013p5778].

    h2("Where to from here")

After obtaining the count table, numerous downstream analyses are available. Most often, such count tables are generated in a differential expression experimental setup. In that case, packages such as DESeq, DEXSeq, edgeR, limma (see voom+limma in the limma vignette), are some of the possibilities available in Bioconductor. Have a look at [@dillies:2012p5465] and [@Soneson:2013p5778] to decide which tool/approach is the best suited for your experimental design. But, of course, counts can as well be used for other purposes such as visualization, using e.g.: the rtracklayer and GViz packages. Actually, there"s no real limitation of what one can achieve with a count table and it does not need be an RNA-Seq experiment; look at the DiffBind package for an example of using ChIP-Seq data for differential binding analyses.