knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) hook_output <- knitr::knit_hooks$get("output") knitr::knit_hooks$set(output = function(x, options) { lines <- options$output.lines if (is.null(lines)) { return(hook_output(x, options)) } x <- unlist(strsplit(x, "\n")) more <- "..." if (length(lines) == 1) { if (length(x) > lines) { x <- c(head(x, lines), more) } } else { x <- c(if (abs(lines[1]) > 1 | lines[1] < 0) more else NULL, x[lines], if (length(x) > lines[abs(length(lines))]) more else NULL ) } x <- paste(c(x, ""), collapse = "\n") hook_output(x, options) }) options(reutils.api.key = NULL) options(reutils.rcurl.connecttimeout = 50) library(reutils)
reutils
is an R package for interfacing with NCBI databases such as PubMed,
Genbank, or GEO via the Entrez Programming Utilities
(EUtils). It provides access to the
nine basic eutils: einfo
, esearch
, esummary
, epost
, efetch
, elink
,
egquery
, espell
, and ecitmatch
.
Please check the relevant usage guidelines when using these services. Note that Entrez server requests are subject to frequency limits. Consider obtaining an NCBI API key if are a heavy user of E-utilities.
With nine E-Utilities, NCBI provides a programmatical interface to the Entrez query and database system for searching and retrieving requested data
Each of these tools corresponds to an R
function in the reutils package described below.
esearch
esearch
: search and retrieve a list of primary UIDs or the NCBI History
Server information (queryKey and webEnv). The objects returned by esearch
can be passed on directly to epost
, esummary
, elink
, or efetch
.
efetch
efetch
: retrieve data records from NCBI in a specified retrieval type
and retrieval mode as given in this
table. Data are returned as XML or text documents.
esummary
esummary
: retrieve Entrez database summaries (DocSums) from a list of primary UIDs (Provided as a character vector or as an esearch
object)
elink
elink
: retrieve a list of UIDs (and relevancy scores) from a target database
that are related to a set of UIDs provided by the user. The objects returned by
elink
can be passed on directly to epost
, esummary
, or efetch
.
einfo
einfo
: provide field names, term counts, last update, and available updates
for each database.
epost
epost
: upload primary UIDs to the users's Web Environment on the Entrez
history server for subsequent use with esummary
, elink
, or efetch
.
esearch
: Searching the Entrez databasesLet's search PubMed for articles with Chlamydia psittaci in the title that have been published in 2020 and retrieve a list of PubMed IDs (PMIDs).
pmid <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed") pmid
Alternatively we can collect the PMIDs on the history server.
pmid2 <- esearch("Chlamydia psittaci[titl] and 2020[pdat]", "pubmed", usehistory = TRUE) pmid2
We can also use esearch
to search GenBank. Here we do a search for polymorphic
membrane proteins (PMPs) in Chlamydiaceae.
cpaf <- esearch("Chlamydiaceae[orgn] and PMP[gene]", "nucleotide") cpaf
Some accessors for esearch
objects
getUrl(cpaf)
getError(cpaf)
database(cpaf)
Extract a vector of GIs:
uid(cpaf)
Get query key and web environment:
querykey(pmid2)
webenv(pmid2)
Extract the content of an EUtil request as XML.
content(cpaf, "xml")
Or extract parts of the XML data using the reference class method #xmlValue()
and an XPath expression:
cpaf$xmlValue("//Id")
esummary
: Retrieving summaries from primary IDsesummary
retrieves document summaries (docsums) from a list of primary IDs.
Let's find out what the first entry for PMP is about:
esum <- esummary(cpaf[1]) esum
We can also parse docsums into a tibble
esum <- esummary(cpaf[1:4]) content(esum, "parsed")
efetch
: Downloading full records from EntrezFirst we search the protein database for sequences of the chlamydial protease activity factor, CPAF
cpaf <- esearch("Chlamydia[orgn] and CPAF", "protein") cpaf
Let's fetch the FASTA record for the first protein. To do that, we have to
set rettype = "fasta"
and retmode = "text"
:
cpaff <- efetch(cpaf[1], db = "protein", rettype = "fasta", retmode = "text") cpaff
Now we can write the sequence to a fasta file by first extracting the data from
the efetch
object using content()
:
write(content(cpaff), file = "~/cpaf.fna")
cpafx <- efetch(cpaf, db = "protein", rettype = "fasta", retmode = "xml") cpafx
aa <- cpafx$xmlValue("//TSeq_sequence") aa defline <- cpafx$xmlValue("//TSeq_defline") defline
einfo
: Information about the Entrez databasesYou can use einfo
to obtain a list of all database names accessible through
the Entrez utilities:
einfo()
For each of these databases, we can use einfo
again to obtain more information:
einfo("taxonomy")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.