# Declarations base_epm_ver <- '3.0' stab_epm_ver <- '3.1.3' this_epm_ver <- '3.1.3' # pre-load libs and data library(easyPubMed) data("epm_samples") # Collect custom f(x) rebuild_uili <- epm_samples$fx$rebuild_uili rebuild_li <- epm_samples$fx$rebuild_li rebuild_df <- epm_samples$fx$rebuild_df fabricate_epm_obj <- epm_samples$fx$fabricate_epm_obj slice_epm_obj_to_special <- epm_samples$fx$slice_epm_obj_to_special # Collect data blca_2018 <- epm_samples$bladder_cancer_2018 blca_40y <- epm_samples$bladder_cancer_40y # Fabricate objects (vignette version) epm <- fabricate_epm_obj(blca_2018$demo_data_03) epm_xmpl_01 <- slice_epm_obj_to_special(epm, mode = 1) epm_xmpl_02 <- slice_epm_obj_to_special(epm, mode = 2) epm_xmpl_03 <- fabricate_epm_obj(blca_2018$demo_data_04) epm_xmpl_04 <- fabricate_epm_obj(blca_40y$demo_data_01) epm_xmpl_06 <- fabricate_epm_obj(blca_2018$demo_data_02) epm_xmpl_07 <- fabricate_epm_obj(blca_2018$demo_data_05)
PubMed is an online repository of references and abstracts of publications in the fields of medicine and life sciences. Pubmed is a free resource that is developed and maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH). PubMed homepage is located at the following URL: https://pubmed.ncbi.nlm.nih.gov/. Other than its web portal, PubMed can be programmatically queried via the NCBI Entrez E-utilities interface.
easyPubMed is an open-source R interface to the Entrez Programming Utilities aimed at allowing programmatic access to PubMed in the R environment. The package is suitable for downloading large number of records, and includes a collection of functions to perform basic processing of the Entrez/PubMed query responses. The library supports either XML or TXT ("medline") format.
This vignette covers the key functionalities of easyPubMed
and provides some informative examples to get started.
This vignette covers easyPubMed
version r this_epm_ver
. Compatibility with previous versions of easyPubMed
is NOT implied nor guaranteed.
easyPubMed
is an open-source software, under the GPL-3 license and comes with ABSOLUTELY NO WARRANTY. For more questions about the GPL-3 license terms, see www.gnu.org/licenses.
This R library was written based on the information included in the Entrez Programming Utilities Help manual authored by Eric Sayers, PhD and available on the NCBI Bookshelf (NBK25500).
This R library is NOT endorsed, supported, maintained NOR affiliated with NCBI.
r this_epm_ver
Simplified Pipeline. The process of retrieving and analyzing Pubmed records has been updated and simplified. The revised pipeline includes three steps: 1) submit a query; 2) fetch records; 3) extract information. The corresponding functions are discussed below in this vignette.
Automatic Job splitting into Sub-Queries. The Entrez server imposes a strict
n=10,000 limit to the number of records that can be programmatically
retrieved from a single query. Whenever possible, the easyPubMed
library automatically attempts to split queries returning large number of
records into lists of smaller, manageable queries.
The easyPubMed S4 Class. Here we introduce a new S4 class
(easyPubMed
) that was designed to better store and manage query information,
retrieved records and associated meta-data.
This S4 class comes with a series of methods and is
aimed at improving data handling and analysis reproducibility. Raw or processed
data can be obtained from an easyPubMed
object via the appropriate
getter functions (as discussed below).
Additional Parsed Fields. The new epm_parse()
function supports
extraction of additional information from Pubmed records (compared to
previous versions of this R library). Extracted fields now include
mesh_codes, grant_ids, references and conflict of interest statements
(cois) among others. For more information, see the examples below.
Compact Output. Unlike previous versions of easyPubMed
, it is now
possible to collapse author information (i.e., author names) into a
single string. This way, the output (parsed data.frame
) only
includes one row per record.
To install the stable version (r stab_epm_ver
) of easyPubMed
from CRAN, you can run the following line of code.
install.packages("easyPubMed")
Dev versions of the library are hosted on GitHub. If interested, you can install the latest dev version using the devtools
library.
devtools::install_github("dami82/easyPubMed")
The first section of the tutorial covers how to use easyPubMed
(version r base_epm_ver
or later) for querying PubMed, retrieving records from the Entrez History Server, and analyzing results. The second part of the tutorial provides additional examples and information about special cases and advanced operations.
The typical easyPubMed
pipeline is a three-step process.
Submit a PubMed query via epm_query()
. This function submits a query to Entrez/PubMed, reads the server response, and depending on the anticipated number of results it may automatically split the query into multiple sub-queries (whenever needed and/or possible). The query string should be built using standard PubMed syntax, i.e. the user can use the same tags/keywords (e.g., '[AU]' or '[PDAT]') as the PubMed web portal. This function returns an easyPubMed
object.
Download PubMed records via epm_fetch()
. This function reads the information included in an easyPubMed
object and downloads records from the Entrez/PubMed server. Records can be retrieved in XML (default) or Medline formats. By default, raw records are stored in the 'raw' slot of the easyPubMed
object, and can be accessed via the get_epm_raw()
function. Alternatively, records can be written to local files.
[Optional] Extract Information from raw XML records via epm_parse()
. This function is used to identify XML fields of interest in each record, extract the corresponding information, and cast them into a structured data.frame. Results are stored in the 'data' slot of the easyPubMed
object, and can be accessed via the get_epm_data()
function.
PubMed Query and Record Retrieval
The code below illustrates the typical steps of an easyPubMed
analysis. All data (raw records as well as processed data) are stored in the resulting easyPubMed
object. In this example, n=1,597 records were retrieved and processed. This took about 14 min using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04. Data parsing (epm_parse()
) is the step taking the longest time to complete.
# Load library library(easyPubMed) # Define Query String my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' # Submit the Query epm <- epm_query(my_query) # Retrieve Records (xml format) epm <- epm_fetch(epm, format = 'xml') # Extract Information epm <- epm_parse(epm) # All results are stored in an easyPubMed object. epm
epm
Get Meta data
Meta data are attached to each easyPubMed
object and provide information about the record query job (e.g., number of expected records; date when the query was performed) as well as type/format of the downloaded data (e.g., format and encoding of the raw data). A unique identifier (UID) is also included to track different objects/query jobs. Meta data can be requested from an easyPubMed
object via the get_epm_meta()
function, which returns a list.
job_meta <- get_epm_meta(x = epm) head(job_meta)
Get Raw Records
Raw PubMed records can be obtained from an easyPubMed
(after epm_fetch()
has been completed) via the get_epm_raw()
function, which returns a named list. Each element includes one PubMed record. The name of each element corresponds to its PubMed record identifier (PMID).
raw_records <- get_epm_raw(epm) # elements are named after the corresponding PMIDs head(names(raw_records)) # elements include raw PubMed records first_record <- raw_records[[1]] # Show excerpt (from record #1) cat(substr(first_record, 1, 1200))
Get Processed Data
Processed data can be obtained from an easyPubMed
(after epm_parse()
has been executed) via the get_epm_data()
function Processed data are returned as a data.frame. By default, each row corresponds to a PubMed record. This default behavior can be modified by tuning the compact_output
and max_authors
arguments (see section below). The columns/fields extracted include record identifiers, journal name, publication date, title, abstract, MeSH codes, author names and affiliations.
proc_data <- get_epm_data(epm) # show an excerpt (first 6 records, selected columns) slctd_fields <- c('pmid', 'doi', 'jabbrv', 'year', 'month', 'day') head(proc_data[, slctd_fields])
A comprehensive list of the fields that are extracted from raw XML records and returned as columns of the processed data object (data.frame) is shown below.
pmid: PubMed Record Unique Reference Number (Identifier).
doi: Digital Object Identifier.
pmc: PubMed Central Unique Reference Number (PMCID) (if available).
journal: Journal Name (full-length).
jabbrv: Journal Name (abbreviation).
lang: Language (e.g., 'eng').
year: Publication Date (
month: Publication Date (
day: Publication Date (
title: Record Title.
abstract: Record Abstract.
mesh_codes: Medical Subject Heading Codes (e.g., 'D001749').
mesh_terms: Medical Subject Heading Codes (e.g., 'Urinary Bladder Neoplasms').
grant_ids: Reference Number of Funding Grants Supporting the Publication (if available/provided).
references: Identifiers (PMIDs or DOIs) of Publications
coi: Conflict of Interest Statement (if available/provided).
authors: List of Author Names. This field is returned when the compact_output
argument is set to TRUE
(epm_parse()
function). Otherwise, the following columns are included: 'lastname', 'forename', 'address', 'email'.
affiliation: Address and/or Affiliation Associated with the First Author of the Study.
Get Record Identifiers (PMIDs)
The identifiers (PMIDs) of records included in an easyPubMed
object (after epm_fetch()
has been executed) can be obtained via the get_epm_uilist()
function, which returns a character vector. PMIDs are automatically detected and extracted from all downloaded records, independently of the raw record format.
# Get PMIDs all_pmids <- get_epm_uilist(epm) # Show excerpt head(all_pmids)
This section includes a few examples of less-common easyPubMed
pipelines and operations. Please, contact the package maintainer for additional questions.
The easyPubMed
library comes with two special Query functions that are designed to address specific goals:
Query Entrez/PubMed by exact match of a full-length title (article title): epm_query_by_fulltitle()
.
Execute a PubMed Query by providing a list of record identifiers (PMIDs): epm_query_by_pmid()
.
These special query functions may replace the first step of the easyPubMed
pipeline. After the query step has been completed, record retrieval proceeds as outlined above, i.e., via the epm_fetch()
function.
Query by Article Title
It is possible to query PubMed for a record of interest by providing its full-length title as query string and via the epm_query_by_fulltitle()
function. This function takes a string (character vector of length 1) as its fulltitle
argument. The string should NOT include new-line characters (e.g., \n) or multi-spaces, as those may prevent the exact-match search. These special characters are NOT removed automatically (by design). You can use regular expressions (e.g., gsub()
) to clean a fulltitle
string before performing the query. An example is shown below.
# Article Title (including new-line chars) my_title <- "Role of gemcitabine and cisplatin as neoadjuvant chemotherapy in muscle invasive bladder cancer: Experience over the last decade." # Unpolished title string cat(my_title) # Clean the title my_title <- gsub('[[:space:]]+', ' ', my_title) # Clean title string cat(my_title)
# Query and fetch epm_xmpl_01 <- epm_query_by_fulltitle(fulltitle = my_title) epm_xmpl_01 <- epm_fetch(epm_xmpl_01) epm_xmpl_01
epm_xmpl_01
Query Using a List of PMIDs
The epm_query_by_pmid()
takes a character vector as its pmids
argument. If a long list of PMIDs is provided (n>50), the function automatically splits the query job into multiple 50-record sub-jobs. The resulting 'easyPubMed' object displays '
my_pmids <- c('31572460', '31511849', '31411998') epm_xmpl_02 <- epm_query_by_pmid(pmids = my_pmids) epm_xmpl_02 <- epm_fetch(epm_xmpl_02) epm_xmpl_02
epm_xmpl_02
The epm_fetch()
function supports three different formats. The default format is xml
. Alternatively, the medline
and uilist
formats are also supported. Briefly, the medline
option returns records in plain text format (see example below). On the contrary, the uilist
format simply requests the identifiers (PMIDs) of all records returned by a query (no additional record content is retrieved from Entrez/PubMed). Note that non-XML records cannot be used to extract record information via epm_parse()
.
# Define Query String my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' # Submit the Query epm_xmpl_03 <- epm_query(my_query) # Retrieve Records (request 'medline' format!) epm_xmpl_03 <- epm_fetch(epm_xmpl_03, format = 'medline') # Get records xmpl_03_raw <- get_epm_raw(epm_xmpl_03) # Elements are named after the corresponding PMIDs head(names(xmpl_03_raw))
xmpl_03_raw <- get_epm_raw(epm_xmpl_03) head(names(xmpl_03_raw))
# Elements include raw PubMed records first_record <- xmpl_03_raw[[1]] # Show an Excerpt (record n. 12, first 18 lines) cat(head(first_record, n=20), sep = '\n')
In easyPubMed
(version r base_epm_ver
or later) there are no dedicated functions for downloading large numbers of records. Large query jobs are still carried out via the epm_query()
and epm_fetch()
functions, which will attempt to split a single query into a list of manageable sub-jobs. An example is shown below. Briefly, we performed a query that returned n=20,825 records. The job was automatically split in n=4 sub-jobs, records were downloaded and parsed. The whole operation took about 3h 28m using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04 (i.e., about 0.6s per record).
# Define Query String blca_query <- '"bladder cancer"[Ti] AND ("1980"[PDAT]:"2020"[PDAT])' # Submit the Query epm_xmpl_04 <- epm_query(blca_query) # Retrieve Records (medline format) epm_xmpl_04 <- epm_fetch(epm_xmpl_04) # Parse all records epm_xmpl_04 <- epm_parse(epm_xmpl_04) # Show Object epm_xmpl_04
# Show Object
epm_xmpl_04
Unlike previous versions of easyPubMed
, there are no dedicated functions to write PubMed records to a local disk. Starting from easyPubMed
version r base_epm_ver
, this operation is performed by tuning the arguments of the epm_fetch()
function and by setting the write_to_file
to TRUE.
Write Files to the Local Disc.
There are 4 arguments that can be adjusted to fine-tune the behavior of epm_fetch()
and write PubMed records to local files.
write_to_file
: logical (defaults to FALSE
). If TRUE
, raw PubMed records are written to a local disk.
outfile_path
: string pointing to an existing local directory (defaults to NULL
). This argument is evaluated only when write_to_file
is set to TRUE
. Files including the raw PubMed records will be saved at the indicated location. If NULL
, the current directory is used.
outfile_prefix
: string indicating a prefix used for the name of files written to the local disk (defaults to NULL
). If NULL
, a unique prefix (prefix pattern: easypubmed_job_yyyymmddhhmmss_
) is automatically generated. Files may be overwritten is a non-unique prefix is used.
store_contents
: logical (defaults to TRUE
). This argument is used to control whether a copy of the raw records should be stored in the current easyPubMed
object (raw
slot). If write_to_file
is set to TRUE
and store_contents
is set to FALSE
, raw records are written to disk and no copies are stored in the current object. This option is recommended for very large queries and replaces the batch_pubmed_download()
function (available until version 2.32). If both write_to_file
and store_contents
are set to TRUE
, records will be written to the local disk (backup copy) and also stored in the current easyPubMed
object.
# Define Query String my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' # Submit the Query epm_xmpl_05 <- epm_query(my_query) # Retrieve Records epm_xmpl_05 <- epm_fetch(epm_xmpl_05, write_to_file = TRUE) # Check if file exists dir(pattern = '^easypubmed')
print('easypubmed_job_202311201513_batch_01.txt')
Read Files From the Local Disc.
It is possible to import local files storing raw PubMed records for further processing via the epm_import_xml()
function. This function can be used if the following 3 conditions are met:
records were retrieved in the XML format. Other formats (e.g., the "medline" format) are NOT supported;
records were downloaded and written using easyPubMed
, version 3.0
or later. Compatibility with data/files downloaded using other tools or former versions of the easyPubMed
library is NOT guaranteed.
if multiple files are imported together, all files were written as part of the same epm_fetch()
job. Note that different jobs that used the same query are considered as independent jobs.
Users should feed the epm_import_xml()
function a character vector of file names (of length >= 1), where each element indicates a text file to be read and imported.
epm_import_xml()
function will warn that only a subset of the expected files were imported, and then the parsing step will continue using all records included in the selected file(s). # Import XML records from saved file epm_xmpl_06 <- epm_import_xml(x = 'easypubmed_job_202311201513_batch_01.txt') # Show Object epm_xmpl_06
epm_xmpl_06
As we outlined above, information can be extracted from raw records ("xml" format) via the epm_parse()
function. Results (data.frame
) are stored in the same easyPubMed
object (data
slot) and can be requested via the get_epm_data()
function. Users can adjust the way information are extracted and formatted from PubMed records by tweaking the epm_parse()
function arguments. The most important arguments are discussed below.
Compact vs. extended output.
A new feature of easyPubMed
(version r base_epm_ver
or later) is the capacity of tuning the author information extraction process. The compact_output
and max_authors
arguments can be adjusted to get the desired behavior.
compact_output
: logical (defaults to TRUE
). Author names are returned in a compact format (i.e., author names are collapsed together) when this argument is set to TRUE
, and each row in the final data.frame corresponds to a PubMed record. If FALSE
, each row corresponds to a single author of the publication and the record-specific data are recycled for all included authors (legacy approach, this is similar to the typical output of the corresponding function in easyPubMed
ver. 2.30 and earlier). When compact_output
is set to FALSE
, the processed data row number is typically bigger than the number or raw records that were retrieved. The behavior managed via the compact_output
argument is a new feature of easyPubMed
version r base_epm_ver
or later.
max_authors
: numeric, maximum number of authors to retrieve. If this is set to -1, only the last author is extracted. If this is set to 1, only the first author is returned. If this is set to 2, the first and the last authors are extracted. If this is set to any other positive number (i), up to the leading (i-1) authors are retrieved together with the last author. If this is set to a number larger than the number of authors in a record, all authors are returned. Note that at least 1 author has to be retrieved, therefore a value of 0 is not accepted (coerced to -1).
Citations.
The epm_parse()
function can now extract citation information (if available). This feature was introduced in easyPubMed
version r base_epm_ver
. The max_references
and ref_id_type
arguments can be adjusted to obtain information in the desired format.
max_references
: numeric, maximum number of references to return (from each PubMed record).
ref_id_type
: string, must be one of the following values: c('pmid', 'doi')
. Type of identifier used to describe citation references.
In the example below, n=1,597 records were retrieved and processed. This took less about 7 min using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04.
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' # Submit the Query epm_xmpl_07 <- epm_query(my_query) # Retrieve Records epm_xmpl_07 <- epm_fetch(epm_xmpl_07) # Parse (custom params) epm_xmpl_07 <- epm_parse(epm_xmpl_07, max_authors = 3, compact_output = TRUE, max_references = 5, ref_id_type = 'pmid') # Request parsed data epm_data <- get_epm_data(epm_xmpl_07) # Columns of interest cols_of_int <- c('pmid', 'doi', 'authors', 'jabbrv', 'year', 'references') # Show an excerpt head(epm_data[, cols_of_int])
epm_data <- get_epm_data(epm_xmpl_07) # Columns of interest cols_of_int <- c('pmid', 'doi', 'authors', 'jabbrv', 'year', 'references') # Show an excerpt head(epm_data[, cols_of_int])
The new pipeline proposed in easyPubMed
version r this_epm_ver
replaces the old pipeline based on the following functions: get_pubmed_ids()
, fetch_pubmed_data()
, and table_articles_byAuth()
. We are planning to phase out these functions by the end of 2024.
The current version of easyPubMed
still includes revised versions of the get_pubmed_ids()
, fetch_pubmed_data()
, and table_articles_byAuth()
for legacy purposes. The new versions of these functions should return output that is compatible with old versions of easyPubMed
(with the exception of get_pubmed_ids()
). Other functions (e.g., batch_pubmed_download ()
or fetch_all_pubmed_ids()
) have been discontinued and/or replaced as outlined below.
Function Replacement Map
article_to_df()
-> epm_parse_record()
articles_to_list()
-> discontinued
batch_pubmed_download()
-> epm_fetch()
[write_to_file = TRUE
]
custom_grep()
-> EPM_custom_grep()
[not exported]
fetch_all_pubmed_ids()
-> epm_fetch()
[format = 'uilist'
]
fetch_pubmed_data()
-> epm_fetch()
get_pubmed_ids_by_fulltitle()
-> epm_query_by_fulltitle()
get_pubmed_ids()
-> epm_query()
table_articles_byAuth()
-> epm_parse()
and then get_epm_data()
trim_address()
-> discontinued
fetch_PMID_data()
-> epm_query_by_pmid()
and then epm_fetch()
extract_article_ids()
-> EPM_detect_pmid()
[not exported]
fetch_pubmed_data_by_PMID()
-> epm_query_by_pmid()
and then epm_fetch()
Dev version of easyPubMed
on GitHub Website
Additional Resources and Tutorials are made available via the official website.
easyPubMed official website including news, vignettes, and further information https://www.data-pulse.com/dev_site/easypubmed/
Sayers, E. A General Introduction to the E-utilities (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK25497/
PubMed Help (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK3827/
Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is
If you use easyPubMed in a scientific publication, please cite this R package in the Materials and Methods section of the paper. Thanks!
I may be open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this vignette, feel free to contact me via email.
easyPubMed Copyright (C) 2017-2023 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
sessionInfo()
Success! - by Damiano Fantini - r format(Sys.time(), format = '%b %d, %Y')
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.