README.md
In rentrez: Entrez in R

rentrez

rentrez provides functions that work with the NCBI eutils to search or download data from various NCBI databases.

The package hasn't been thoroughly tested yet, but the functions for each of the Eutils functions are implimented. If you try the package and find bugs please let me know.

rentrez is on CRAN, so you can get the latest stable release with install.packages("rentrez"). This repository will sometimes be a little ahead of the CRAN version, if you want the latest (and possibly greatest) version you can either download the archive above and install using $R CMD INSTALL or use Hadley Wickham's devtools:

library(devtools)
install_github("rentrez", "ropensci")

rentrez presumes you already know your way around the Eutils' API, which is well documented. Make sure you read the documentation, and in particular, be aware of the NCBI's usage policies and try to limit very large requests to off peak (USA) times.

The functions in rentrez are designed to create URLs in the form required by the api, fetch the file and parse information from it. Specific examples below illustrate how the functions work.

To see how the package works, let's look at a couple of possible uses of the library

Let's say I've just read a paper on the evolution of Hox genes, Di-Poi et al. (2010), and I want to get the data required to replicate their results. First, I need the unique ID for this paper in pubmed (the PMID). Annoyingly, many journals don't give PMIDS for their papers, but we can use entrez_search to find the paper using the doi field:

hox_paper <- entrez_search(db="pubmed", term="10.1038/nature08789[doi]")
(hox_pmid <- hox_paper$ids)
        # [1] 20203609

Now, what sorts of data are avaliable from other NCBI database for this paper?

hox_data <- entrez_link(db="all", id=hox_pmid, dbfrom="pubmed")
str(hox_data)
#List of 11
# $ pubmed_nuccore            : chr [1:32] "290760437"  ...
# $ pubmed_pmc_refs           : chr [1:11] "3218823" ...
# $ pubmed_protein            : chr [1:49] "290760438" ...
# $ pubmed_pubmed             : chr [1:128] "20203609" ...
# $ pubmed_pubmed_citedin     : chr [1:10] "22016857"  ...
# $ pubmed_pubmed_combined    : chr [1:6] "20203609" " ...
# $ pubmed_pubmed_five        : chr [1:6] "20203609" ...
# $ pubmed_pubmed_reviews     : chr [1:31] "20203609"  ...
# $ pubmed_pubmed_reviews_five: chr [1:6] "20203609"  ...
# $ pubmed_taxonomy_entrez    : chr [1:14] "742354"  ...
# $ file                      :Classes 'XMLInternalDocument'...'

Each of the character vectors in this object contain unique IDS for records in the named databases. These functions try to make the most useful bits of the returned files available to users, but they also return the original file in case you want to dive into the XML yourself.

In this case we'll get the protein sequences as genbank files, using ' entrez_fetch:

hox_proteins <- entrez_fetch(db="protein", ids=hox_data$pubmed_protein, file_format="gb")

I like spiders. So let's say I want to learn a little more about New Zealand's endemic "black widow" the katipo. Specifically, in the past the katipo has been split into two species, can we make a phylogeny to test this idea?

The first step here is to use the function entrez_search to find datasets that include katipo sequences. The popset database has sequences arising from phylogenetic or population-level studies, so let's start there.

library(rentrez)
katipo_search <- entrez_search(db="popset", term="Latrodectus katipo[Organism]")
katipo_search$count
        # [1] 6

In this search count is the total number of hits returned for the search term. We can use entrez_summary to learn a little about these datasets. Because different databases give differnt xml files, entrez_summary returns an xml file for you to further process. In this case, a little xpath can tell us about each dataset.

summaries <- entrez_summary(db="popset", id=katipo_search$ids)
xpathSApply(summaries, "//Item[@Name='Title']", xmlValue)
        #[1] "Latrodectus katipo 18S ribosomal RNA gene ..."
        #[2] "Latrodectus katipo cytochrome oxidase subunit 1 (COI)..."
        #[3] "Latrodectus 18S ribosomal RNA gene..."
        #[4] "Latrodectus cytochrome 1 oxidase subunit 1 (COI)...""
        #[5] "Latrodectus tRNA-Leu (trnL) gene ... ""                                               
        #[6] "Theridiidae cytochrome oxidase subunit I (COI) gene ..."

Let's just get the two mitochondrial loci (COI and trnL), using entrez_fetch:

COI_ids <- katipo_search$ids[c(2,6)]
trnL_ids <- katipo_search$ids[5]
COI <- entrez_fetch(db="popset", id=COI_ids, file_format="fasta")
trnL <- entrez_fetch(db="popset", id=trnL_ids, file_format="fasta")

The "fetched" results are fasta formatted characters, which can be written to disk easily:

write(COI, "Test/COI.fasta")      
write(trnL, "Test/trnL.fasta")

Once you've got the sequences you can do what you want with them, but I wanted a phylogeny so let's do that with ape:

library(ape)
coi <- read.dna("Test/COI.fasta", "fasta")
coi_aligned <- clustal(coi)
tree <- nj(dist.dna(coi_aligned))

The NCBI provides search history features, which can be useful for dealing with alrge lists of IDs (which will not fit in a single URL) or repeated searches. As an example, we will go searching for COI sequences from all the land snail (Stylommatophora) species we can find in the nucleotide database:

library(rentrez)
snail_search <- entrez_search(db="nuccore", "Gastropoda[Organism] AND COI[Gene]", retmax=200, usehistory="y")

Because we set usehistory to "y" the snail_search object contains a unique ID for the search (WebEnv) and the particular query in that search history (QueryKey). Instead of using the 200 ids we turned up to make a new URL and fetch the sequences we can use the webhistory features.

cookie <- snail_search$WebEnv
qk <- snail_search$QueryKey
snail_coi <- entrez_fetch(db="nuccore", WebEnv=cookie, query_key=qk, file_format="fasta", retmax=10)

This is one is a little more trivial, but you can also use entrez to search pubmed and the EUtils API allows you to limit searches by the year in which the paper was published. That gives is a chance to find the trendiest -omics going around (this has quite a lot of repeated searching, so it you want to run your own version be sure to do it in off peak times).

Let's start by making a function that finds the number of records matching a given search term for each of several years (using the mindate and maxdate terms from the Eutils API):

library(rentrez)
papers_by_year <- function(years, search_term){
            return(sapply(years, function(y) entrez_search(db="pubmed",term=search_term, mindate=y, maxdate=y, retmax=0)$count))
        }

With that we can fetch the data for earch term and, by searching with no term, find the total number of papers published in each year:

years <- 1990:2011
total_papers <- papers_by_year(years, "")
omics <- c("genomic", "epigenomic", "metagenomic", "proteomic", "transcriptomic", "pharmacogenomic", "connectomic" )
trend_data <- sapply(omics, function(t) papers_by_year(years, t))
trend_props <- trend_data/total_papers

That's the data, let's plot it:

library(reshape)
library(ggplot2)
trend_df <- melt(as.data.frame(trend_props), id.vars="years")
p <- ggplot(trend_df, aes(years, value, colour=variable))
png("trendy.png", width=500, height=250)
p + geom_line(size=1) + scale_y_log10("number of papers")
dev.off()

Giving us... well this: