startQuest("Annotation packages")
    h1("Annotation of Genes and Genomes")
    options(digits=2)
    library(RnaSeqTutorial)
    library(org.Dm.eg.db)

Here, we will deviate a little from our and genomes, a recent and a not yet model genome for trees, respectively, and look into other functionalities than the GenomicFeatures package makeTranscriptDbFromGFF function to retrieve annotation for genes and genomes.

h2("Annotation")

Bioconductor provides extensive annotation resources, summarized in the following figure. These can be gene-, or genome-centric. Annotations can be provided in packages curated by , or obtained from web-based resources.
Gene-centric AnnotationDbi packages include:

Examples of genome-centric packages include:

Web-based resources include

![Annotation Packages: the big picture][dbtypes]

h3("Gene-centric annotations with `AnnotationDbi`")

Organism-level (‘org’) packages contain mappings between a central identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form org.Sp.id.db (e.g. org.Sc.sgd.db) where Sp is a 2-letter abbreviation of the organism (e.g. Sc for Saccharomyces cerevisiae) and id is an abbreviation (in lower-case) describing the type of central identifier (e.g. sgd for gene identifiers assigned by the Saccharomyces Genome Database, or eg for Entrez gene ids). The “How to use the ‘.db’ annotation packages” vignette in the AnnotationDbi package (org packages are only one type of “.db” annotation packages) is a key reference. The ‘.db’ and most other Bioconductor annotation packages are updated every 6 months.

Annotation packages contain an object named after the package itself. These objects are collectively called AnnotationDb objects, with more specific classes named OrgDb, ChipDb or TranscriptDb objects. Methods that can be applied to these objects include columns, keys, keytypes and select.

    quest(list( "What is the name of the org package (Entrez gene ids) for Drosophila?" = 
            list("org.Dmel.sp.db",
                 "org.Dm.sp.db",
                 "org.Dm.eg.db",
                 "org.Dmel.eg.db")),3, 
          comment="To reiterate: the syntax is <em>org.Sp.id.db</em> (e.g. <code> org.Sc.sgd.db 
          </code>) where <em>Sp</em> is a 2-letter abbreviation of the organism (e.g. <code >Sc
          </code> for <em>Saccharomyces cerevisiae</em>) and <em>id</em> is an abbreviation (in
          lower-case) describing the type of central identifier (e.g. <code > sgd </code> for 
          gene identifiers assigned by the <em>Saccharomyces</em> Genome Database, or <code > 
          eg </code> for Entrez gene ids). In this case therefore <code > org.Dm.eg.db </code>")
endQuest()
startQuest("Database interactions")

The OrgDb object is named org.Dm.eg.db.

    columns(org.Dm.eg.db)
    keytypes(org.Dm.eg.db)
    uniprotKeys <- head(keys(org.Dm.eg.db, keytype="UNIPROT"))
    cols <- c("SYMBOL", "PATH")
    select(org.Dm.eg.db, keys=uniprotKeys, columns=cols, keytype="UNIPROT")

Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar:

    kegg <- select(org.Dm.eg.db, "00310", c("UNIPROT", "SYMBOL"), "PATH")
    nrow(kegg)
    head(kegg, 3)

For convenience, the RnaSeqTutorial package contains , an object representing the results of a RNA-seq gene-level differential expression analysis retrieved from the package. This package wraps data from the study Brooks2010 of the pasilla gene, involved in alternative splicing. The following code loads part of this data and creates a ‘top table’ of the ten most differentially represented genes. This top table is then coerced to a data.frame.

    library(edgeR)
    library(org.Dm.eg.db)
    data(lrTest)
    tt <- as.data.frame(topTags(lrTest))

Extract the Flybase gene identifiers (FLYBASE) from the row names of this table and map them to their corresponding Entrez gene (ENTREZID) and symbol ids (SYMBOL) using select. Use merge to add the results of select to the top table.

    fbids <- rownames(tt)
    cols <- c("ENTREZID", "SYMBOL")
    anno <- select(org.Dm.eg.db, fbids, cols, "FLYBASE")
    ttanno <- merge(tt, anno, by.x=0, by.y="FLYBASE")
    dim(ttanno)
    head(ttanno, 3)
    quest(list( "What would be the SYMBOL for the FLYBASE id FBgn0085359 (try to find out using R!)" = 
                  list("CG1544",
                       "Kal1",
                       "CG34330",
                       "Ama")),3,
          comment="A minimal command to find out the answer would be: 
          <code>select(org.Dm.eg.db, 'FBgn0085359', 'SYMBOL' , 'FLYBASE')</code>")
    endQuest()
    startQuest("GenomicFeatures")
h3("Genome-centric annotations with `GenomicFeatures`")

Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.

The following loads the relevant Transcript.db package, and creates a more convenient alias to the TranscriptDb instance defined in the package.

    library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
    txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

We can discover available keys (using keys) and columns (columns) in txdb, and then use select to retrieve the transcripts associated with each differentially expressed gene. The mapping between gene and transcript is not one-to-one – some genes have more than one transcripts, which is biologically relevant, and some may have none, which might be due to different database versions being used or database cross-references (xref) inconsistancy,

    txnm <- select(txdb, fbids, "TXNAME", "GENEID")
    nrow(txnm)
    head(txnm, 3)

The TranscriptDb instances can be queried for data that is more structured than simple data frames, and in particular return GRanges or GRangesList instances to represent genomic coordinates. These queries are performed using cdsBy (coding sequence), transcriptsBy (transcripts), , where the function argument by specifies how coding sequences or transcripts are grouped. Here we extract the coding sequences grouped by transcript, returning the transcript names, and subset the resulting GRangesList to contain just the transcripts of interest to us. The first transcript is composed of 6 distinct coding sequence regions.

    cds <- cdsBy(txdb, "tx", use.names=TRUE)[txnm$TXNAME[!is.na(txnm$TXNAME)]]
    length(cds)
    cds[1]

The following code loads the appropriate BSgenome package; the object refers to the whole genome sequence represented in this package. The remaining steps extract the DNA sequence of each transcript, and translates these to amino acid sequences. Issues of strand are handled correctly.

    library(BSgenome.Dmelanogaster.UCSC.dm3)
    txx <- extractTranscriptSeqs(Dmelanogaster, cds)
    length(txx)
    head(txx, 3)
    head(translate(txx), 3)
h3("Using biomaRt")

The biomaRt package offers access to the online biomart resource. this consists of several data base resources, referred to as ‘marts’. Each mart allows access to multiple data sets; the biomaRt package provides methods for mart and data set discovery, and a standard method getBM to retrieve data.

Load the biomaRt package and list the available marts. Choose the ensembl mart and list the datasets for that mart. Set up a mart to use the ensembl mart and the hsapiens_gene_ensembl dataset.

A biomaRt dataset can be accessed via getBM. In addition to the mart to be accessed, this function takes filters and attributes as arguments. Use filterOptions and listAttributes to discover values for these arguments. Call getBM using filters and attributes of your choosing.

    library(biomaRt)
    head(listMarts(), 3)                      ## list the marts
    head(listDatasets(useMart("ensembl")), 3) ## mart datasets
    ensembl <-                                ## fully specified mart
       useMart("ensembl", dataset = "hsapiens_gene_ensembl")

    head(listFilters(ensembl), 3)             ## filters
    myFilter <- "chromosome_name"
    head(filterOptions(myFilter, ensembl), 3) ## return values
    myValues <- c("21", "22")
    head(listAttributes(ensembl), 3)          ## attributes
    myAttributes <- c("ensembl_gene_id","chromosome_name")

    ## assemble and query the mart
    res <- getBM(attributes =  myAttributes, filters =  myFilter,
                values =  myValues, mart = ensembl)

Use head(res) to see the results.

h3("What’s next?")

In this chapter we have seen how to retrieve Annotation from different sources using Bioconductor packages. However, for novel or emerging organisms, such resources are seldom available whereas the annotation are often available as General Feature Format (GFF) (http://www.sequenceontology.org/gff3.shtml) or Gene Transfer Format(GTF) (http://www.ensembl.org/info/website/upload/gff.html) formatted files. These can easily be read in R using the genomeIntervals or rtracklayer packages. The easyRNASeq package provide helper function to convert these into RangedData or GRangesList objects. Finally, the GenomicFeatures package also has a makeTranscriptDbFromGFF function.

In the previous chapter , we have seen how to manipulate raw reads and their information. In the present chapter, we learned how to get genomic and genic information. These are the “raw” data you will most likely have to start working with on a RNA-Seq or ChIP-Seq project. Before we can proceed and look at how we would manipulate the alignments of these reads onto their respective genome, it is essential that we take a look at the quality of these reads. This will be the topic of the next chapter.

    quest(list("How would you query your txDb object for exons by gene? (use the help search if you're unsure!)" = list("exonsBygene(txDb)",
                      "exonsBy(txDb,by='gene')",
                      "select(txdb, fbids, 'EXONRANK','GENEID')",
                      "Ask Nico")),2)


UPSCb/RnaSeqTutorial documentation built on Nov. 24, 2020, 12:40 a.m.