Learn about various annotation package types
Learn the basics of querying these resources
Discuss annotations in regard to Bioc data structures
Get in some practice
library(BiocAnno2016) library(hugene20sttranscriptcluster.db) library(EnsDb.Mmusculus.v79) library(org.Hs.eg.db) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(Homo.sapiens) library(BSgenome) library(BSgenome.Hsapiens.UCSC.hg19) library(AnnotationHub)
Map a known ID to other functional or positional information
We have data and statistics, and we want to add other useful information
The end result might be as simple as a data.frame or HTML table, or as complex as a
RangedSummarizedExperiment
load(system.file("data/eset.Rdata", package = "BiocAnno2016")) eset
head(exprs(eset)) head(pData(phenoData(eset)))
head(pData(featureData(eset)))
Validity checking
Subsetting
Function dispatch
Automatic behaviors
Difficult to create
Cumbersome to extract data by hand
Useful only within R
df <- data.frame("Package type" = c("ChipDb","OrgDb","TxDb/EnsDb","OrganismDb","BSgenome","Others","AnnotationHub","biomaRt"), Example = c("hugene20sttranscriptcluster.db","org.Hs.eg.db","TxDb.Hsapiens.UCSC.hg19.knownGene; EnsDb.Hsapiens.v75", "Homo.sapiens","BSgenome.Hsapiens.UCSC.hg19","GO.db; KEGG.db", "Online resource","Online resource"), check.names = FALSE) knitr::kable(df)
The main function is select
:
select(annopkg, keys, columns, keytype)
Where
annopkg is the annotation package
keys are the IDs that we know
columns are the values we want
keytype is the type of key used
Say we have analyzed data from an Affymetrix Human Gene ST 2.0 array and want to know what the genes are. For purposes of this lab, we just select some IDs at random.
library(hugene20sttranscriptcluster.db) set.seed(12345) ids <- featureNames(eset)[sample(1:25000, 5)] ids select(hugene20sttranscriptcluster.db, ids, "SYMBOL")
How do you know what the central keys are?
If it's a ChipDb, the central key are the manufacturer's probe IDs
It's sometimes in the name - org.Hs.eg.db, where 'eg' means Entrez Gene ID
You can see examples using e.g., head(keys(annopkg)), and infer from that
But note that it's never necessary to know the central key, as long as you specify the keytype
What keytypes or columns are available for a given annotation package?
keytypes(hugene20sttranscriptcluster.db) columns(hugene20sttranscriptcluster.db)
There is one issue with select
however.
ids <- c('16737401','16657436' ,'16678303') select(hugene20sttranscriptcluster.db, ids, c("SYMBOL","MAP"))
mapIds
functionAn alternative to select
is mapIds
, which gives control of
duplicates
Same arguments as select
with slight differences
The columns argument can only specify one column
The keytype argument must be specified
An additional argument, multiVals used to control duplicates
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID")
Default is first
, where we just choose the first of the
duplicates. Other choices are list
, CharacterList
, filter
,
asNA
or a user-specified function.
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "list")
mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "CharacterList") mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "filter") mapIds(hugene20sttranscriptcluster.db, ids, "SYMBOL", "PROBEID", multiVals = "asNA")
Using either the hugene20sttranscriptcluster.db or org.Hs.eg.db package,
What gene symbol corresponds to Entrez Gene ID 1000?
What is the Ensembl Gene ID for PPARG?
What is the UniProt ID for GAPDH?
How many of the probesets from the ExpressionSet (eset) we loaded map to a single gene? How many don't map to a gene at all?
TxDb packages contain positional information; the contents can be inferred by the package name
TxDb.Species.Source.Build.Table
TxDb.Hsapiens.UCSC.hg19.knownGene
Homo sapiens
UCSC genome browser
hg19 (their version of GRCh37)
knownGene table
TxDb.Dmelanogaster.UCSC.dm3.ensGene TxDb.Athaliana.BioMart.plantsmart22
EnsDb packages are similar to TxDb packages, but based on Ensembl mappings
EnsDb.Hsapiens.v79
EnsDb.Mmusculus.v79
EnsDb.Rnorvegicus.v79
As with ChipDb and OrgDb packages, select
and mapIds
can be used
to make queries
select(TxDb.Hsapiens.UCSC.hg19.knownGene, c("1","10"), c("TXNAME","TXCHROM","TXSTART","TXEND"), "GENEID") select(EnsDb.Hsapiens.v79, c("1", "10"), c("GENEID","GENENAME","SEQNAME","GENESEQSTART","GENESEQEND"), "ENTREZID")
But this is not how one normally uses them...
The normal use case for transcript packages is to extract positional
information into a GRanges
or GRangesList
object. An example is
the genomic position of all genes:
gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) gns
Or the genomic position of all transcripts by gene:
txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene) txs
Positional information can be extracted for transcripts
, genes
, coding
sequences (cds
), promoters
and exons
.
Positional information can be extracted for most of the above, grouped
by a second element. For example, our transcriptsBy
call was all
transcripts, grouped by gene.
More detail on these *Ranges objects is beyond the scope of this workshop, but why we want them is not.
The main rationale for *Ranges objects is to allow us to easily select and subset data based on genomic position information. This is really powerful!
GRanges
and GRangesLists
act like data.frames and lists, and can
be subsetted using the [
function. As a really artificial example:
txs[txs %over% gns[1:2,]]
Gene expression changes near differentially methylated CpG islands
Closest genes to a set of interesting SNPs
Genes near DNAseI hypersensitivity clusters
Number of CpGs measured over Gene X by Chip Y
SummarizedExperiment objects are like ExpressionSets, but the row-wise annotations are GRanges, so you can subset by genomic locations:
How many transcripts does PPARG have, according to UCSC?
Does Ensembl agree?
How many genes are between 2858473 and 3271812 on chr2 in the hg19 genome?
GRanges
like this - GRanges("chr2", IRanges(2858473,3271812))
OrganismDb packages are meta-packages that contain an OrgDb, a TxDb, and a GO.db package and allow cross-queries between those packages.
All previous accessors work; select
, mapIds
, transcripts
, etc.
library(Homo.sapiens)
Homo.sapiens
Updateable - can change TxDb object
columns and keytypes span all underlying objects
Calls to TxDb accessors include a 'columns' argument
head(genes(Homo.sapiens, columns = c("ENTREZID","ALIAS","UNIPROT")),4)
Get all the GO terms for BRCA1
What gene does the UCSC transcript ID uc002fai.3 map to?
How many other transcripts does that gene have?
Get all the transcripts from the hg19 genome build, along with their Ensembl gene ID, UCSC transcript ID and gene symbol
BSgenome packages contain sequence information for a given
species/build. There are many such packages - you can get a listing
using available.genomes
library(BSgenome) head(available.genomes())
We can load and inspect a BSgenome package
library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens
The main accessor is getSeq
, and you can get data by sequence (e.g.,
entire chromosome or unplaced scaffold), or by
passing in a GRanges object, to get just a region.
getSeq(Hsapiens, "chr1") getSeq(Hsapiens, gns["5467",])
The Biostrings package contains most of the code for dealing with
these *StringSet
objects - please see the Biostrings vignettes and
help pages for more information.
AnnotationHub is a package that allows us to query and download many different annotation objects, without having to explicitly install them.
library(AnnotationHub) hub <- AnnotationHub()
library(AnnotationHub) hub <- AnnotationHub() hub
Finding the 'right' resource on AnnotationHub is like using Google - a well posed query is necessary to find what you are after. Useful queries are based on
Data provider
Data class
Species
Data source
names(mcols(hub))
unique(hub$dataprovider)
unique(hub$rdataclass)
head(unique(hub$species)) length(unique(hub$species))
unique(hub$sourcetype)
qry <- query(hub, c("granges","homo sapiens","ensembl")) qry
qry$sourceurl
whatIwant <- qry[["AH50377"]]
We can use these data as they are, or convert to a TxDb format:
GRCh38TxDb <- makeTxDbFromGRanges(whatIwant) GRCh38TxDb
How many resources are on AnnotationHub for Atlantic salmon (Salmo salar)?
Get the most recent Ensembl build for domesticated dog (Canis familiaris) and make a TxDb
The biomaRt package allows queries to an Ensembl Biomart server. We can see the choices of servers that we can use:
library(biomaRt) listMarts()
And we can then check for the available data sets on a particular server.
mart <- useMart("ENSEMBL_MART_ENSEMBL") head(listDatasets(mart))
After setting up a mart
object pointing to the server and data set
that we care about, we can make queries. We first set up the mart
object.
mart <- useMart("ENSEMBL_MART_ENSEMBL","hsapiens_gene_ensembl")
Queries are of the form
getBM(attributes, filters, values, mart)
where
attributes are the things we want
filters are the types of IDs we have
values are the IDs we have
mart is the mart
object we set up
Both attributes and filters have rather inscrutable names, but a listing can be accessed using
atrib <- listAttributes(mart) filts <- listFilters(mart) head(atrib) head(filts)
A simple example query
afyids <- c("1000_at","1001_at","1002_f_at","1007_s_at") getBM(c("affy_hg_u95av2", "hgnc_symbol"), c("affy_hg_u95av2"), afyids, mart)
Get the Ensembl gene IDs and HUGO symbol for Entrez Gene IDs 672, 5468 and 7157
What do you get if you query for the 'gene_exon' for GAPDH?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.