knitr::opts_chunk$set(error = TRUE, cache = FALSE, eval = TRUE)
In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The
r Biocpkg("biomaRt") package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give
r Biocpkg("biomaRt") users direct access to a diverse set of data and enable a wide range of powerful online queries from R.
Every analysis with
r Biocpkg("biomaRt") starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function
listMarts() will display all available BioMart web services
useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by
listMarts(). In the next example we choose to query the Ensembl BioMart database.
ensembl <- useMart("ensembl")
BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function
listDatasets(). Note: use the function
head() to display only the first 5 entries as the complete list has
r nrow(listDatasets(ensembl)) entries.
datasets <- listDatasets(ensembl) head(datasets)
To select a dataset we can update the
Mart object using the function
useDataset(). In the example below we choose to use the hsapiens dataset.
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
getBM() function has three arguments that need to be introduced: filters, attributes and values.
Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value 'X'. The
listFilters() function shows you all available filters in the selected dataset.
filters = listFilters(ensembl) filters[1:5,]
Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The
listAttributes() function displays all available attributes in the selected dataset.
attributes = listAttributes(ensembl) attributes[1:5,]
getBM() function is the main query function in
r Biocpkg("biomaRt"). It has four main arguments:
attributes: is a vector of attributes that one wants to retrieve (= the output of the query).
filters: is a vector of filters that one wil use as input to the query.
values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below).
mart: is an object of class
Mart, which is created by the
Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a
r Biocpkg("biomaRt") query. Let's make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.
The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the
listFilters() function respectively. Let's now run the query:
affyids=c("202763_at","209310_s_at","207500_at") getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl)
To save time and computing resources
r Biocpkg("biomaRt") will attempt to identify when you are re-running a query you have executed before. Each time a new query is run, the results will be saved to a cache on your computer. If a query is identified as having been run previously, rather than submitting the query to the server, the results will be loaded from the cache.
You can get some information on the size and location of the cache using the function
The cache can be deleted using the command
biomartCacheClear(). This will remove all cached files.
There are a small number of non-Ensembl databases that offer a BioMart interface to their data. The
r Biocpkg("biomaRt") package can be used to access these in a very similar fashion to Ensembl. The majority of
r Biocpkg("biomaRt") functions will work in the same manner, but the construction of the initial Mart object requires slightly more setup. In this section we demonstrate the setting requires to query Wormbase ParaSite and Phytozome
To demonstrate the use of the
r Biocpkg("biomaRt") package with non-Ensembl databases the next query is performed using the Wormbase ParaSite BioMart. In this example, we use the
listMarts() function to find the name of the available marts, given the URL of Wormbase. We use this to connect to Wormbase BioMart using the
useMart() function. Note that we use the
https address and must provide the port as
443. Queries to WormBase will fail without these options.
listMarts(host = "parasite.wormbase.org") wormbase <- useMart(biomart = "parasite_mart", host = "https://parasite.wormbase.org", port = 443)
We can then use functions described earlier in this vignette to find and select the gene dataset, and print the first 6 available attributes and filters. Then we use a list of gene names as filter and retrieve associated transcript IDs and the transcript biotype.
listDatasets(wormbase) wormbase <- useDataset(mart = wormbase, dataset = "wbps_gene") head(listFilters(wormbase)) head(listAttributes(wormbase)) getBM(attributes = c("external_gene_id", "wbps_transcript_id", "transcript_biotype"), filters = "gene_name", values = c("unc-26","his-33"), mart = wormbase)
The Phytozome BioMart can be accessed with the following settings. Note that we use the
https address. Queries to Phytozome will fail without specifying this.
listMarts(host = "https://phytozome.jgi.doe.gov") phytozome <- useMart(biomart = "phytozome_mart", host = "https://phytozome.jgi.doe.gov") listDatasets(phytozome) wormbase <- useDataset(mart = phytozome, dataset = "phytozome")
Once this is set up the usual
r Biocpkg("biomaRt") functions can be used to interogate the database options and run queries.
getBM(attributes = c("organism_name", "gene_name1"), filters = "gene_name_filter", values = "82092", mart = phytozome)
Version 13 of Phyotozome can be found at https://phytozome-next.jgi.doe.gov/ and if you wish to query that version the URL used to create the Mart object must reflect that.
phytozome_v13 <- useMart(biomart = "phytozome_mart", dataset = "phytozome", host = "https://phytozome-next.jgi.doe.gov")
r Biocpkg("biomaRt") package can be used with a local install of a public BioMart database or a locally developed BioMart database and web service.
In order for
r Biocpkg("biomaRt") to recognize the database as a BioMart, make sure that the local database you create has a name conform with
database_mart_version where database is the name of the database and version is a version number. No more underscores than the ones showed should be present in this name. A possible name is for example
More information on installing a local copy of a BioMart database or develop your own BioMart database and webservice can be found on http://www.biomart.org
Once the local database is installed you can use
r Biocpkg("biomaRt") on this database by:
listMarts(host="www.myLocalHost.org", path="/myPathToWebservice/martservice") mart=useMart("nameOfMyMart",dataset="nameOfMyDataset",host="www.myLocalHost.org", path="/myPathToWebservice/martservice")
For more information on how to install a public BioMart database see: http://www.biomart.org/install.html and follow link databases.
In order to provide a more consistent interface to all annotations in
keys() have been implemented to wrap
some of the existing functionality above. These methods can be called
in the same manner that they are used in other parts of the project
except that instead of taking a
AnnotationDb derived class
they take instead a
Mart derived class as their 1st argument.
Otherwise usage should be essentially the same. You still use
columns() to discover things that can be extracted from a
keytypes() to discover which things can be
used as keys with
mart <- useMart(dataset="hsapiens_gene_ensembl",biomart='ensembl') head(keytypes(mart), n=3) head(columns(mart), n=3)
And you still can use
keys() to extract potential keys, for a
particular key type.
k = keys(mart, keytype="chromosome_name") head(k, n=3)
keys(), you can even take advantage of the extra
arguments that are available for others keys methods.
k = keys(mart, keytype="chromosome_name", pattern="LRG") head(k, n=3)
keys() method will not work with all key
types because they are not all supported.
But you can still use
select() here to extract columns of data
that match a particular set of keys (this is basically a wrapper for
affy=c("202763_at","209310_s_at","207500_at") select(mart, keys=affy, columns=c('affy_hg_u133_plus_2','entrezgene_id'), keytype='affy_hg_u133_plus_2')
So why would we want to do this when we already have functions like
getBM()? For two reasons: 1) for people who are familiar
with select and it's helper methods, they can now proceed to use
r Biocpkg("biomaRt") making the same kinds of calls that are already familiar to them and 2) because the select method is implemented in many places elsewhere, the fact that these methods are shared allows for more convenient programmatic access of all these resources. An example of a package that takes advantage of this is the
r Biocpkg("OrganismDbi") package. Where several packages can be accessed as if they were one resource.
Note: if the function
useMart() runs into proxy problems you should set your proxy first before calling any
r Biocpkg("biomaRt") functions.
You can do this using the Sys.putenv command:
Sys.setenv("http_proxy" = "http://my.proxy.org:9999")
Some users have reported that the workaround above does not work, in this case an alternative proxy solution below can be tried:
options(RCurlOptions = list(proxy="uscache.kcc.com:80",proxyuserpwd="------:-------"))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.