The biomaRt users guide

knitr::opts_chunk$set(error = TRUE, cache = FALSE, eval = TRUE)

Introduction

In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The r Biocpkg("biomaRt") package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give r Biocpkg("biomaRt") users direct access to a diverse set of data and enable a wide range of powerful online queries from R.

Selecting a BioMart database and dataset

Every analysis with r Biocpkg("biomaRt") starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function listMarts() will display all available BioMart web services

options(width=100)
library("biomaRt")
listMarts()

The useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by listMarts(). In the next example we choose to query the Ensembl BioMart database.

ensembl <- useMart("ensembl")

BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function listDatasets(). Note: use the function head() to display only the first 5 entries as the complete list has r nrow(listDatasets(ensembl)) entries.

datasets <- listDatasets(ensembl)
head(datasets)

To select a dataset we can update the Mart object using the function useDataset(). In the example below we choose to use the hsapiens dataset.

ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)

Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

How to build a biomaRt query

The getBM() function has three arguments that need to be introduced: filters, attributes and values. Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value 'X'. The listFilters() function shows you all available filters in the selected dataset.

filters = listFilters(ensembl)
filters[1:5,]

Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The listAttributes() function displays all available attributes in the selected dataset.

attributes = listAttributes(ensembl)
attributes[1:5,]

The getBM() function is the main query function in r Biocpkg("biomaRt"). It has four main arguments:

Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a r Biocpkg("biomaRt") query. Let's make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.

The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the listAttributes() and listFilters() function respectively. Let's now run the query:

affyids=c("202763_at","209310_s_at","207500_at")
getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id'), 
      filters = 'affy_hg_u133_plus_2', 
      values = affyids, 
      mart = ensembl)

Result Caching

To save time and computing resources r Biocpkg("biomaRt") will attempt to identify when you are re-running a query you have executed before. Each time a new query is run, the results will be saved to a cache on your computer. If a query is identified as having been run previously, rather than submitting the query to the server, the results will be loaded from the cache.

You can get some information on the size and location of the cache using the function biomartCacheInfo():

biomartCacheInfo()

The cache can be deleted using the command biomartCacheClear(). This will remove all cached files.

Using a BioMart other than Ensembl

There are a small number of non-Ensembl databases that offer a BioMart interface to their data. The r Biocpkg("biomaRt") package can be used to access these in a very similar fashion to Ensembl. The majority of r Biocpkg("biomaRt") functions will work in the same manner, but the construction of the initial Mart object requires slightly more setup. In this section we demonstrate the setting requires to query Wormbase ParaSite and Phytozome

Wormbase

To demonstrate the use of the r Biocpkg("biomaRt") package with non-Ensembl databases the next query is performed using the Wormbase ParaSite BioMart. In this example, we use the listMarts() function to find the name of the available marts, given the URL of Wormbase. We use this to connect to Wormbase BioMart using the useMart() function. Note that we use the https address and must provide the port as 443. Queries to WormBase will fail without these options.

listMarts(host = "parasite.wormbase.org")
wormbase <- useMart(biomart = "parasite_mart", 
                   host = "https://parasite.wormbase.org", 
                   port = 443)

We can then use functions described earlier in this vignette to find and select the gene dataset, and print the first 6 available attributes and filters. Then we use a list of gene names as filter and retrieve associated transcript IDs and the transcript biotype.

listDatasets(wormbase)
wormbase <- useDataset(mart = wormbase, dataset = "wbps_gene")
head(listFilters(wormbase))
head(listAttributes(wormbase))
getBM(attributes = c("external_gene_id", "wbps_transcript_id", "transcript_biotype"), 
      filters = "gene_name", 
      values = c("unc-26","his-33"), 
      mart = wormbase)

Phytozome

Version 12

The Phytozome BioMart can be accessed with the following settings. Note that we use the https address. Queries to Phytozome will fail without specifying this.

listMarts(host = "https://phytozome.jgi.doe.gov")
phytozome <- useMart(biomart = "phytozome_mart", 
                host = "https://phytozome.jgi.doe.gov")
listDatasets(phytozome)
wormbase <- useDataset(mart = phytozome, dataset = "phytozome")

Once this is set up the usual r Biocpkg("biomaRt") functions can be used to interogate the database options and run queries.

getBM(attributes = c("organism_name", "gene_name1"), 
      filters = "gene_name_filter", 
      values = "82092", 
      mart = phytozome)

Version 13

Version 13 of Phyotozome can be found at https://phytozome-next.jgi.doe.gov/ and if you wish to query that version the URL used to create the Mart object must reflect that.

phytozome_v13 <- useMart(biomart = "phytozome_mart", 
                dataset = "phytozome", 
                host = "https://phytozome-next.jgi.doe.gov")

Local BioMart databases

The r Biocpkg("biomaRt") package can be used with a local install of a public BioMart database or a locally developed BioMart database and web service. In order for r Biocpkg("biomaRt") to recognize the database as a BioMart, make sure that the local database you create has a name conform with database_mart_version where database is the name of the database and version is a version number. No more underscores than the ones showed should be present in this name. A possible name is for example ensemblLocal_mart_46.

Minimum requirements for local database installation

More information on installing a local copy of a BioMart database or develop your own BioMart database and webservice can be found on http://www.biomart.org Once the local database is installed you can use r Biocpkg("biomaRt") on this database by:

listMarts(host="www.myLocalHost.org", path="/myPathToWebservice/martservice")
mart=useMart("nameOfMyMart",dataset="nameOfMyDataset",host="www.myLocalHost.org", path="/myPathToWebservice/martservice")

For more information on how to install a public BioMart database see: http://www.biomart.org/install.html and follow link databases.

Using select()

In order to provide a more consistent interface to all annotations in Bioconductor the select(), columns(), keytypes() and keys() have been implemented to wrap some of the existing functionality above. These methods can be called in the same manner that they are used in other parts of the project except that instead of taking a AnnotationDb derived class they take instead a Mart derived class as their 1st argument. Otherwise usage should be essentially the same. You still use columns() to discover things that can be extracted from a Mart, and keytypes() to discover which things can be used as keys with select().

mart <- useMart(dataset="hsapiens_gene_ensembl",biomart='ensembl')
head(keytypes(mart), n=3)
head(columns(mart), n=3)

And you still can use keys() to extract potential keys, for a particular key type.

k = keys(mart, keytype="chromosome_name")
head(k, n=3)

When using keys(), you can even take advantage of the extra arguments that are available for others keys methods.

k = keys(mart, keytype="chromosome_name", pattern="LRG")
head(k, n=3)

Unfortunately the keys() method will not work with all key types because they are not all supported.

But you can still use select() here to extract columns of data that match a particular set of keys (this is basically a wrapper for getBM()).

affy=c("202763_at","209310_s_at","207500_at")
select(mart, keys=affy, columns=c('affy_hg_u133_plus_2','entrezgene_id'),
  keytype='affy_hg_u133_plus_2')

So why would we want to do this when we already have functions like getBM()? For two reasons: 1) for people who are familiar with select and it's helper methods, they can now proceed to use r Biocpkg("biomaRt") making the same kinds of calls that are already familiar to them and 2) because the select method is implemented in many places elsewhere, the fact that these methods are shared allows for more convenient programmatic access of all these resources. An example of a package that takes advantage of this is the r Biocpkg("OrganismDbi") package. Where several packages can be accessed as if they were one resource.

Connection troubleshooting

Note: if the function useMart() runs into proxy problems you should set your proxy first before calling any r Biocpkg("biomaRt") functions.
You can do this using the Sys.putenv command:

Sys.setenv("http_proxy" = "http://my.proxy.org:9999")

Some users have reported that the workaround above does not work, in this case an alternative proxy solution below can be tried:

options(RCurlOptions = list(proxy="uscache.kcc.com:80",proxyuserpwd="------:-------"))

Session Info

sessionInfo()
warnings()


Try the biomaRt package in your browser

Any scripts or data that you put into this service are public.

biomaRt documentation built on Feb. 11, 2021, 2:01 a.m.