In UPSCb/RnaSeqTutorial: RNA-Seq Tutorial

    options(digits=2)
    library(RnaSeqTutorial)

    startQuest("Annotation packages")

    h1("Genes and Genomes")

In the previous chapter, we have seen how aligned reads can be manipulated and what are some of the crucial caveats to be aware of; e.g. stranded vs. non-stranded data, multi-mapping reads, etc.. In the present chapter, we will look at how genic annotation are retrieved and processed. Here again we will introduce how this is commonly done and what caveats this implies.

    h2("Manipulating annotations")

Commonly, gene annotation are made available through flat-files encoded in the ggf3 (or its derivatives) format. gff3 stands for Generic Feature Format version 3. This is probably the most commonly used format, in particular when it comes to recently sequenced genomes. Another common format is the gtf (Gene Transfer Format), an alternative to the gff2 format. This format is arguably popular from its exclusive use by the EnsEMBL consortium.

In the Day03 folder, locate the P. trichocarpa gff3 file. Using the cut, sort and uniq commands, let us extract some key statistics from the file.

Find the number of features (replace ... by the file name your found) :

    cut -f3 share/Day03/... | sort | uniq -c

cut splits every line based on a default separator (space by default). The -f options return the selected "field", i.e. here the content of the third column. sort and uniq are self-explanatory. The -c option is to count the occurences.

quest(1)
endQuest()
startQuest("Examining the GFF3")

Let us find out some more about the gff3 structure, but for that let us use R.

First, we load the necessary libraries

library(IRanges)
library(genomeIntervals)

Next, we read in the GFF3 file

gff <- readGff3(file.path(extdata(),
                          "GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
                quiet=TRUE)

And have a look at the Genome_Intervals object content

gff
nrow(gff[gff$type=="exon",])
nrow(gff[gff$type=="mRNA",])
nrow(gff[gff$type=="gene",])

We get the same information as previously. Let us do now some validation:

    nrow(unique(gff[gff$type=="exon",])) / nrow(gff[gff$type=="gene",])

quest(2)
endQuest()
startQuest("Duplicated exons")

We will now use the duplicated function to identify the duplicated exons.

sel <- duplicated(gff[gff$type=="exon",])
firstPosition <- which(sel)[1]
firstDuplicate <- gff[gff$type=="exon",][firstPosition,]

Next, we lookup all exons located at the same position

sel <- gff$type=="exon"
gff[sel,][seqnames(gff[sel,]) == seqnames(firstDuplicate) &
              gff[sel,1] == firstDuplicate[,1] &
              gff[sel,2] == firstDuplicate[,2],]

quest(3)
endQuest()
startQuest("Validating with genomeTools")

GenomeTools [@Gremme2013] offers a nice utility to validate a gff3 file. Let's move back to the terminal and check the validaity of the gff3 file. The tool will only report errors so if it deems everything is fine, expect no output. The gff3 validation tool has many parameters, which all set on in the following command to make the test as stringent as possible.

    cd && mkdir gff3 && cd gff3
    gt gff3 -force -tidy yes -addids yes -fixregionboundaries yes \
    -retainids yes -sort yes -checkids yes \
    -o Ptrichocarpa_v3.0_210_gene_exons-validated.gff3 \
    ~/share/Day03/data/reference/gff/Ptrichocarpa_v3.0_210_gene_exons.gff3 \
    2>&1  | grep -v "##sequence-region"

Was that what we expected? It would appear that despite checking IDs -checkids yes, identical genomic features - e.g. duplicated exons - pass the validation.

quest(4)
endQuest()
startQuest("Creating a synthetic set of transcripts")

The duplicated exons we observe have different IDs, which is how they circumvent the previous gff3 validation, i.e. by breaking the gff3 format specifications.

The issue with identical exons identified by different IDs is that these will bias the gene expression summarisation. Consider a single mRNA molecule giving raise to a single read; if it overlaps with a duplicated exon, that single molecule would be counted twice! Or alternatively discarded, as the read could not be unambiguously assigned to a single feature!

To prevent that from happening, as long as we are not planning on measuring transcript, but gene expression, we need to collapse the annotation; i.e. we need to create a synthetic transcript that regroups all the gene's exons.

This process relies on identifying all the exons of a gene and merging any overlapping exons into super-exons. We can do that using the R Bioconductor easyRNASeq [@easyRNASeq] createSyntheticTranscripts function. So back to RStudio:

synthTrx <- createSyntheticTranscripts(
    file.path(extdata(),"GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
    verbose=FALSE)

If you are interested in the details of the implementation, they are available at the end of this section.

Next, we save our flattened annotation for later use at the counting stage.

writeGff3(synthTrx,file="~/gff3/Ptrichocarpa_v3.0_210_synthetic_transcripts.gff3")

quest(5)
endQuest()
startQuest("Validating the new file")

As we have modified our annotation, a good practice is to validate it again, so let us run GenomeTools again:

    cd gff3
    gt gff3 -force -tidy yes -addids yes -fixregionboundaries yes \
    -retainids yes -sort yes -checkids yes \
    -o Ptrichocarpa_v3.0_210_synthetic_transcripts-validated.gff3 \
    Ptrichocarpa_v3.0_210_synthetic_transcripts.gff3 \
    2>&1  | grep -v "##sequence-region"

Unsurprisingly this works, but we are going to be very suspicious and go back to R to check.

Adapt the previous R commands we used to check the duplicates in the original gff3 file. What do you observe?

quest(6)
endQuest()
startQuest("Summing up synthetic transcripts")

Obviously, annotation are always to be taken with a grain of salt as we have discussed in the lecture and exemplified here. The P. trichocarpa annotation is actually of a decent quality (nonwithstanding its poor application of the gff3 standard!).

IMPORTANT NOTE The present issue with the gff3 format is common. Using the recent years de-facto standard for counting (i.e. htseq-count, but note that this paradigm is changing with the advent of pseudo-alignment counting methods such as kallisto or salmon), it can lead to dramatic issues as the example below displays.

library(DESeq2)
setwd("~/share/Day02/data/htseq-comp/")
tab <- data.frame("SampleID"=sub("_subset.txt","",dir("full-gff/")),
                    "File"=dir("full-gff/"),
                    "Sample"=sub("_subset.txt","",dir("full-gff/")))

full <- DESeqDataSetFromHTSeqCount(tab,"full-gff",design = ~Sample)

synth <- DESeqDataSetFromHTSeqCount(tab,"synth-gff",design = ~Sample)

par(mfrow=c(1,2))
barplot(colSums(counts(synth)),ylim=c(1,1e6),las=2,main="Synthetic Transcripts gff3")
barplot(colSums(counts(full)),ylim=c(1,1e6),las=2,main="Original Transcripts gff3")

The reason counting is biased is that htseq-count only parses the exon lines, and hence only retrieves the Parent of the exon, i.e. the mRNA and not the gene feature.

NOTE Using a gtf file instead avoids that issue as exons are annotated with both the transcript_id and gene_id. It does not prevent the issue caused by overlapping genes though, so caution is still warranted.

quest(7)
endQuest()
startQuest("Looking out")

Before looking into how to retrieve annotations, one last question, looking forward

quest(8)
endQuest()
startQuest("Retrieving annotations")

    h2("Retrieving annotations")

Bioconductor provides extensive annotation resources, summarized in the following figure. These can be gene-, or genome-centric. Annotations can be provided in packages curated by Bioconductor, or obtained from web-based resources.
Gene-centric AnnotationDbi packages include:

Organism level: e.g. org.Mm.eg.db, Homo.sapiens.
Platform level: e.g. hgu133plus2.db, hgu133plus2.probes, hgu133plus2.cdf.
Homology level: e.g. hom.Dm.inp.db.
System biology level: GO.db, KEGG.db, Reactome.db.

Examples of genome-centric packages include:

GenomicFeatures, to represent genomic features, including constructing reproducible feature or transcript databases from local file or web resources.
Pre-built transcriptome packages, e.g. TxDb.Hsapiens.UCSC.hg19.knownGene based on the UCSC hg19 knownGenes track.
BSgenome for whole genome sequence representation and manipulation.
Pre-built genomes, e.g., BSgenome.Hsapiens.UCSC.hg19 based on the UCSC hg19 build.

Web-based resources include

biomaRt to query biomart resource for genes, sequence, SNPs, and etc.
rtracklayer for interfacing with browser tracks, especially the UCSC genome browser.
genome projects often provide gff3 (General Feature Format) or gtf (Gene Transfer Format) formatted files as annotation sources, which can be manipulated with the rtracklayer or genomeIntervals packages.

![Annotation Packages: the big picture][dbtypes]

Here, we will focus on the Genome-centric type of approaches using the GenomicFeatures package.

h2("Genome-centric annotations with `GenomicFeatures`")

Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.

Load the ‘transcript.db’ package relevant to the dm3 build of D. melanogaster.
Use select and friends to select the Flybase gene ids of the top table and the Flybase transcript names (TXNAME) and Entrez gene identifiers (GENEID).
Use cdsBy to extract all coding sequences, grouped by transcript.
Subset the coding sequences to contain just the transcripts relevant to the top table.
How many transcripts are there?
What is the structure of the first transcript’s coding sequence?
Load the ‘BSgenome’ package for the dm3 build of D. melanogaster.
Use the coding sequences ranges of the previous part of this exercise to extract the underlying DNA sequence, using the extractTranscriptSeqs function.
Use Biostrings’s translate function to convert DNA to amino acid sequences.

The following loads the relevant Transcript.db package, and creates a more convenient alias to the TranscriptDb instance defined in the package.

    library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
    txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

We can discover available keys (using keys) and columns (columns) in txdb, and then use select to retrieve the transcripts associated with each differentially expressed gene. The mapping between gene and transcript is not one-to-one – some genes have more than one transcripts, which is biologically relevant, and some may have none, which might be due to different database versions being used or database cross-references (xref) inconsistancies.

    set.seed(123)
    fbids <- sample(keys(txdb),10,FALSE)
    txnm <- select(txdb, fbids, "TXNAME", "GENEID")
    nrow(txnm)
    head(txnm, 3)

The TranscriptDb instances can be queried for data that is more structured than simple data frames, and in particular return GRanges or GRangesList instances to represent genomic coordinates. These queries are performed using cdsBy (coding sequence), transcriptsBy (transcripts), , where the function argument by specifies how coding sequences or transcripts are grouped. Here we extract the coding sequences grouped by transcript, returning the transcript names, and subset the resulting GRangesList to contain just the transcripts of interest to us. The sixth transcript is composed of 8 distinct coding sequence regions.

    cds <- cdsBy(txdb, "tx", use.names=TRUE)[txnm$TXNAME[6]]
    cds[1]

The following code loads the appropriate BSgenome package; the object refers to the whole genome sequence represented in this package. The remaining steps extract the DNA sequence of each transcript, and translates these to amino acid sequences. Issues of strand are handled correctly.

    library(BSgenome.Dmelanogaster.UCSC.dm3)
    txx <- extractTranscriptSeqs(Dmelanogaster, cds)
    length(txx)
    head(txx, 3)
    head(translate(txx), 3)

quest(9)
endQuest()
startQuest("Putting a GFF3 to use")

Pseudo-alignment tools, such as e.g. kallisto use transcripts' sequence to build their index and perform the pseudo-alignment and quantification. Here we will use the gffread tool to extract a set of transcripts sequences from a genome fasta file and a gff3 annotation file.

As we want to use a tool that can quantify transcripts expression (not summarising it by gene !!!), we will use the original gff3 file.

    cd && mkdir fasta && cd fasta
    gffread -w Ptrichocarpa_v3.0_210_transcripts.fa \
    ~/share/Day03/data/reference/gff/Ptrichocarpa_v3.0_210_gene_exons.gff3 \
    -g ~/share/Day03/data/reference/fasta/Ptrichocarpa_v3.0_210.fa

This file now contains all the transcripts sequences extracted from the genome using the gff annotation. This file will be the input for the generation of the e.g. kallisto index.

h2("Appendix")

One major caveat estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, can be aligned to several features ( transcripts or genes) if those alignments are of equivalent quality. This happens as a result of gene duplication and the presence of repetitive or common domains, for example. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a “synthetic” transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript-gene structure with a one to one relationship. As this procedure varies from organism to organism, there is, to the best of our knowledge, no tool available for performing this step.

Here, we will inspire ourselves (ha-ha) from the documentation of the R/Bioconductor easyRNASeq package [@easyRNASeq]; paragraph 7.1 of the package vignette.

First, we load the necessary libraries

library(IRanges)
library(genomeIntervals)

Next, we read in the GFF3 file

gff <- readGff3(file.path(extdata(),
                          "GFF3/Ptrichocarpa_v3.0_210_gene_exons.gff3.gz"),
                quiet=TRUE)

And have a look at the Genome_Intervals object content

gff
nrow(gff[gff$type=="exon",])
nrow(gff[gff$type=="mRNA",])
nrow(gff[gff$type=="gene",])

First, we identify the ID and Parents of all mRNA features and create a mapping table from it.

sel <- gff$type == "mRNA"
transcriptGeneMapping <- data.frame(getGffAttribute(gff[sel], "ID"), 
                                    getGffAttribute(gff[sel], "Parent")
)
head(transcriptGeneMapping)

Next, we select the exon features and sort them in "groups" by their Parent.

sel <- gff$type=="exon"
rngList<- split(IRanges(start=gff[sel,1],end=gff[sel,2]),
                transcriptGeneMapping[match(
                  sapply(strsplit(getGffAttribute(gff[sel,],"Parent"),","),"[",1),
                  transcriptGeneMapping$ID),"Parent"])
rngList
mostExons <- rev(names(table(elementNROWS(rngList))))[1]
mostExons

Then, we reduce the exon structure

rngList<- IRanges::reduce(rngList)
rngList
rev(names(table(elementNROWS(rngList))))[1]

Once we have reduced the transcript complexity, we can reconstruct a GFF structure from it.

First, we get the gff information; here we simply duplicate the first exon of every gene by the number of synthetic exons per gene. The content will be updated afterwards.

exons <- gff[sel,]
synthTrx<- exons[rep(
  match(names(rngList),
        transcriptGeneMapping[
          match(sapply(strsplit(getGffAttribute(exons,"Parent"),","),"[",1),
                transcriptGeneMapping$ID),"Parent"]),
  elementNROWS(rngList)),]

Then, we update the coordinates, and change the source. This latter step is to document our changes; i.e. to make it obvious to anyone using that the newly generated annotation are not the original ones.

synthTrx[,1]<- unlist(start(rngList))
synthTrx[,2]<- unlist(end(rngList))
levels(synthTrx$source)<- "inhouse"

Next, we get the exon number for both strands. As the exons are correctly ordered on the minus strands by our first command - i.e. in decreasing order, we have to revert those on the plus strand.

exonNumber<- lapply(elementNROWS(rngList),":",1)
sel<- strand(synthTrx)[cumsum(elementNROWS(rngList))] == "+"
exonNumber[sel]<- sapply(exonNumber[sel],rev)

This is followed by the attributes (the ninth column) update.

synthTrx$gffAttributes<- paste("ID=",
                                         rep(names(rngList),elementNROWS(rngList)),
                                         ":",unlist(exonNumber),";Parent=",
                                         rep(names(rngList),elementNROWS(rngList)),".0",sep="")

This gives us a similar results to that of the easyRNASeq createSyntheticTranscripts function with the exception that we only generated feature of type exon