In dami82/easyPubMed: Search and Retrieve Scientific Publication Records from PubMed

# Declarations
base_epm_ver <- '3.0'
stab_epm_ver <- '3.1.3'
this_epm_ver <- '3.1.3'



# pre-load libs and data
library(easyPubMed)
data("epm_samples")

# Collect custom f(x)
rebuild_uili <- epm_samples$fx$rebuild_uili
rebuild_li <- epm_samples$fx$rebuild_li
rebuild_df <- epm_samples$fx$rebuild_df
fabricate_epm_obj <- epm_samples$fx$fabricate_epm_obj
slice_epm_obj_to_special <- epm_samples$fx$slice_epm_obj_to_special

# Collect data
blca_2018 <- epm_samples$bladder_cancer_2018
blca_40y <- epm_samples$bladder_cancer_40y

# Fabricate objects (vignette version)
epm <- fabricate_epm_obj(blca_2018$demo_data_03)
epm_xmpl_01 <- slice_epm_obj_to_special(epm, mode = 1)
epm_xmpl_02 <- slice_epm_obj_to_special(epm, mode = 2)
epm_xmpl_03 <- fabricate_epm_obj(blca_2018$demo_data_04)
epm_xmpl_04 <- fabricate_epm_obj(blca_40y$demo_data_01)
epm_xmpl_06 <- fabricate_epm_obj(blca_2018$demo_data_02)
epm_xmpl_07 <- fabricate_epm_obj(blca_2018$demo_data_05)

PubMed is an online repository of references and abstracts of publications in the fields of medicine and life sciences. Pubmed is a free resource that is developed and maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH). PubMed homepage is located at the following URL: https://pubmed.ncbi.nlm.nih.gov/. Other than its web portal, PubMed can be programmatically queried via the NCBI Entrez E-utilities interface.

easyPubMed is an open-source R interface to the Entrez Programming Utilities aimed at allowing programmatic access to PubMed in the R environment. The package is suitable for downloading large number of records, and includes a collection of functions to perform basic processing of the Entrez/PubMed query responses. The library supports either XML or TXT ("medline") format. This vignette covers the key functionalities of easyPubMed and provides some informative examples to get started.

Notes

This vignette covers easyPubMed version r this_epm_ver. Compatibility with previous versions of easyPubMed is NOT implied nor guaranteed.
easyPubMed is an open-source software, under the GPL-3 license and comes with ABSOLUTELY NO WARRANTY. For more questions about the GPL-3 license terms, see www.gnu.org/licenses.
This R library was written based on the information included in the Entrez Programming Utilities Help manual authored by Eric Sayers, PhD and available on the NCBI Bookshelf (NBK25500).
This R library is NOT endorsed, supported, maintained NOR affiliated with NCBI.

New features of easyPubMed version `r this_epm_ver`

Simplified Pipeline. The process of retrieving and analyzing Pubmed records has been updated and simplified. The revised pipeline includes three steps: 1) submit a query; 2) fetch records; 3) extract information. The corresponding functions are discussed below in this vignette.
Automatic Job splitting into Sub-Queries. The Entrez server imposes a strict n=10,000 limit to the number of records that can be programmatically retrieved from a single query. Whenever possible, the easyPubMed library automatically attempts to split queries returning large number of records into lists of smaller, manageable queries.
The easyPubMed S4 Class. Here we introduce a new S4 class (easyPubMed) that was designed to better store and manage query information, retrieved records and associated meta-data. This S4 class comes with a series of methods and is aimed at improving data handling and analysis reproducibility. Raw or processed data can be obtained from an easyPubMed object via the appropriate getter functions (as discussed below).
Additional Parsed Fields. The new epm_parse() function supports extraction of additional information from Pubmed records (compared to previous versions of this R library). Extracted fields now include mesh_codes, grant_ids, references and conflict of interest statements (cois) among others. For more information, see the examples below.
Compact Output. Unlike previous versions of easyPubMed, it is now possible to collapse author information (i.e., author names) into a single string. This way, the output (parsed data.frame) only includes one row per record.

Installation

Stable Version

To install the stable version (r stab_epm_ver) of easyPubMed from CRAN, you can run the following line of code.

install.packages("easyPubMed")

Dev Version(s)

Dev versions of the library are hosted on GitHub. If interested, you can install the latest dev version using the devtools library.

devtools::install_github("dami82/easyPubMed")

Tutorial

The first section of the tutorial covers how to use easyPubMed (version r base_epm_ver or later) for querying PubMed, retrieving records from the Entrez History Server, and analyzing results. The second part of the tutorial provides additional examples and information about special cases and advanced operations.

Overview

The typical easyPubMed pipeline is a three-step process.

Submit a PubMed query via epm_query(). This function submits a query to Entrez/PubMed, reads the server response, and depending on the anticipated number of results it may automatically split the query into multiple sub-queries (whenever needed and/or possible). The query string should be built using standard PubMed syntax, i.e. the user can use the same tags/keywords (e.g., '[AU]' or '[PDAT]') as the PubMed web portal. This function returns an easyPubMed object.
Download PubMed records via epm_fetch(). This function reads the information included in an easyPubMed object and downloads records from the Entrez/PubMed server. Records can be retrieved in XML (default) or Medline formats. By default, raw records are stored in the 'raw' slot of the easyPubMed object, and can be accessed via the get_epm_raw() function. Alternatively, records can be written to local files.
[Optional] Extract Information from raw XML records via epm_parse(). This function is used to identify XML fields of interest in each record, extract the corresponding information, and cast them into a structured data.frame. Results are stored in the 'data' slot of the easyPubMed object, and can be accessed via the get_epm_data() function.

PubMed Query and Record Retrieval

The code below illustrates the typical steps of an easyPubMed analysis. All data (raw records as well as processed data) are stored in the resulting easyPubMed object. In this example, n=1,597 records were retrieved and processed. This took about 14 min using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04. Data parsing (epm_parse()) is the step taking the longest time to complete.

# Load library
library(easyPubMed)

# Define Query String
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm <- epm_query(my_query) 

# Retrieve Records (xml format)
epm <- epm_fetch(epm, format = 'xml')

# Extract Information
epm <- epm_parse(epm)

# All results are stored in an easyPubMed object.
epm

epm

Get Meta data

Meta data are attached to each easyPubMed object and provide information about the record query job (e.g., number of expected records; date when the query was performed) as well as type/format of the downloaded data (e.g., format and encoding of the raw data). A unique identifier (UID) is also included to track different objects/query jobs. Meta data can be requested from an easyPubMed object via the get_epm_meta() function, which returns a list.

job_meta <- get_epm_meta(x = epm)
head(job_meta)

Get Raw Records

Raw PubMed records can be obtained from an easyPubMed (after epm_fetch() has been completed) via the get_epm_raw() function, which returns a named list. Each element includes one PubMed record. The name of each element corresponds to its PubMed record identifier (PMID).

raw_records <- get_epm_raw(epm)

# elements are named after the corresponding PMIDs
head(names(raw_records))
# elements include raw PubMed records
first_record <- raw_records[[1]] 

# Show excerpt (from record #1)
cat(substr(first_record, 1, 1200))

Get Processed Data

Processed data can be obtained from an easyPubMed (after epm_parse() has been executed) via the get_epm_data() function Processed data are returned as a data.frame. By default, each row corresponds to a PubMed record. This default behavior can be modified by tuning the compact_output and max_authors arguments (see section below). The columns/fields extracted include record identifiers, journal name, publication date, title, abstract, MeSH codes, author names and affiliations.

proc_data <- get_epm_data(epm)

# show an excerpt (first 6 records, selected columns)
slctd_fields <- c('pmid', 'doi', 'jabbrv', 'year', 'month', 'day')
head(proc_data[, slctd_fields])

A comprehensive list of the fields that are extracted from raw XML records and returned as columns of the processed data object (data.frame) is shown below.

pmid: PubMed Record Unique Reference Number (Identifier).
doi: Digital Object Identifier.
pmc: PubMed Central Unique Reference Number (PMCID) (if available).
journal: Journal Name (full-length).
jabbrv: Journal Name (abbreviation).
lang: Language (e.g., 'eng').
year: Publication Date ( field), Year.
month: Publication Date ( field), Month.
day: Publication Date ( field), Day.
title: Record Title.
abstract: Record Abstract.
mesh_codes: Medical Subject Heading Codes (e.g., 'D001749').
mesh_terms: Medical Subject Heading Codes (e.g., 'Urinary Bladder Neoplasms').
grant_ids: Reference Number of Funding Grants Supporting the Publication (if available/provided).
references: Identifiers (PMIDs or DOIs) of Publications
coi: Conflict of Interest Statement (if available/provided).
authors: List of Author Names. This field is returned when the compact_output argument is set to TRUE (epm_parse() function). Otherwise, the following columns are included: 'lastname', 'forename', 'address', 'email'.
affiliation: Address and/or Affiliation Associated with the First Author of the Study.

Get Record Identifiers (PMIDs)

The identifiers (PMIDs) of records included in an easyPubMed object (after epm_fetch() has been executed) can be obtained via the get_epm_uilist() function, which returns a character vector. PMIDs are automatically detected and extracted from all downloaded records, independently of the raw record format.

# Get PMIDs
all_pmids <- get_epm_uilist(epm)

# Show excerpt
head(all_pmids)

Advanced Operations

This section includes a few examples of less-common easyPubMed pipelines and operations. Please, contact the package maintainer for additional questions.

Non-standard PubMed Queries

The easyPubMed library comes with two special Query functions that are designed to address specific goals:

Query Entrez/PubMed by exact match of a full-length title (article title): epm_query_by_fulltitle().
Execute a PubMed Query by providing a list of record identifiers (PMIDs): epm_query_by_pmid().

These special query functions may replace the first step of the easyPubMed pipeline. After the query step has been completed, record retrieval proceeds as outlined above, i.e., via the epm_fetch() function.

Query by Article Title

It is possible to query PubMed for a record of interest by providing its full-length title as query string and via the epm_query_by_fulltitle() function. This function takes a string (character vector of length 1) as its fulltitle argument. The string should NOT include new-line characters (e.g., \n) or multi-spaces, as those may prevent the exact-match search. These special characters are NOT removed automatically (by design). You can use regular expressions (e.g., gsub()) to clean a fulltitle string before performing the query. An example is shown below.

# Article Title (including new-line chars)
my_title <- "Role of gemcitabine and cisplatin as 
             neoadjuvant chemotherapy in muscle invasive bladder cancer: 
             Experience over the last decade."

# Unpolished title string
cat(my_title)
# Clean the title
my_title <- gsub('[[:space:]]+', ' ', my_title)

# Clean title string
cat(my_title)

# Query and fetch
epm_xmpl_01 <- epm_query_by_fulltitle(fulltitle = my_title)
epm_xmpl_01 <- epm_fetch(epm_xmpl_01)
epm_xmpl_01

epm_xmpl_01

Query Using a List of PMIDs

The epm_query_by_pmid() takes a character vector as its pmids argument. If a long list of PMIDs is provided (n>50), the function automatically splits the query job into multiple 50-record sub-jobs. The resulting 'easyPubMed' object displays '' as value of the query_string meta data field. An example is shown below.

my_pmids <- c('31572460', '31511849', '31411998')

epm_xmpl_02 <- epm_query_by_pmid(pmids = my_pmids)
epm_xmpl_02 <- epm_fetch(epm_xmpl_02)
epm_xmpl_02

epm_xmpl_02

Retrieve non-XML Records

The epm_fetch() function supports three different formats. The default format is xml. Alternatively, the medline and uilist formats are also supported. Briefly, the medline option returns records in plain text format (see example below). On the contrary, the uilist format simply requests the identifiers (PMIDs) of all records returned by a query (no additional record content is retrieved from Entrez/PubMed). Note that non-XML records cannot be used to extract record information via epm_parse().

# Define Query String
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm_xmpl_03 <- epm_query(my_query) 

# Retrieve Records (request 'medline' format!)
epm_xmpl_03 <- epm_fetch(epm_xmpl_03, format = 'medline')

# Get records
xmpl_03_raw <- get_epm_raw(epm_xmpl_03)

# Elements are named after the corresponding PMIDs
head(names(xmpl_03_raw))

xmpl_03_raw <- get_epm_raw(epm_xmpl_03)
head(names(xmpl_03_raw))

# Elements include raw PubMed records
first_record <- xmpl_03_raw[[1]] 

# Show an Excerpt (record n. 12, first 18 lines)
cat(head(first_record, n=20), sep = '\n')

Queries Returning Large Numbers of Records

In easyPubMed (version r base_epm_ver or later) there are no dedicated functions for downloading large numbers of records. Large query jobs are still carried out via the epm_query() and epm_fetch() functions, which will attempt to split a single query into a list of manageable sub-jobs. An example is shown below. Briefly, we performed a query that returned n=20,825 records. The job was automatically split in n=4 sub-jobs, records were downloaded and parsed. The whole operation took about 3h 28m using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04 (i.e., about 0.6s per record).

# Define Query String
blca_query <- '"bladder cancer"[Ti] AND ("1980"[PDAT]:"2020"[PDAT])'

# Submit the Query
epm_xmpl_04 <- epm_query(blca_query) 

# Retrieve Records (medline format)
epm_xmpl_04 <- epm_fetch(epm_xmpl_04)

# Parse all records
epm_xmpl_04 <- epm_parse(epm_xmpl_04)

# Show Object
epm_xmpl_04

# Show Object
epm_xmpl_04

Save Raw Records Locally

Unlike previous versions of easyPubMed, there are no dedicated functions to write PubMed records to a local disk. Starting from easyPubMed version r base_epm_ver, this operation is performed by tuning the arguments of the epm_fetch() function and by setting the write_to_file to TRUE.

Write Files to the Local Disc.

There are 4 arguments that can be adjusted to fine-tune the behavior of epm_fetch() and write PubMed records to local files.

write_to_file: logical (defaults to FALSE). If TRUE, raw PubMed records are written to a local disk.
outfile_path: string pointing to an existing local directory (defaults to NULL). This argument is evaluated only when write_to_file is set to TRUE. Files including the raw PubMed records will be saved at the indicated location. If NULL, the current directory is used.
outfile_prefix: string indicating a prefix used for the name of files written to the local disk (defaults to NULL). If NULL, a unique prefix (prefix pattern: easypubmed_job_yyyymmddhhmmss_) is automatically generated. Files may be overwritten is a non-unique prefix is used.
store_contents: logical (defaults to TRUE). This argument is used to control whether a copy of the raw records should be stored in the current easyPubMed object (raw slot). If write_to_file is set to TRUE and store_contents is set to FALSE, raw records are written to disk and no copies are stored in the current object. This option is recommended for very large queries and replaces the batch_pubmed_download() function (available until version 2.32). If both write_to_file and store_contents are set to TRUE, records will be written to the local disk (backup copy) and also stored in the current easyPubMed object.

# Define Query String
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm_xmpl_05 <- epm_query(my_query) 

# Retrieve Records
epm_xmpl_05 <- epm_fetch(epm_xmpl_05, write_to_file = TRUE)

# Check if file exists
dir(pattern = '^easypubmed')

print('easypubmed_job_202311201513_batch_01.txt')

Read Files From the Local Disc.

It is possible to import local files storing raw PubMed records for further processing via the epm_import_xml() function. This function can be used if the following 3 conditions are met:

records were retrieved in the XML format. Other formats (e.g., the "medline" format) are NOT supported;
records were downloaded and written using easyPubMed, version 3.0 or later. Compatibility with data/files downloaded using other tools or former versions of the easyPubMed library is NOT guaranteed.
if multiple files are imported together, all files were written as part of the same epm_fetch() job. Note that different jobs that used the same query are considered as independent jobs.

Users should feed the epm_import_xml() function a character vector of file names (of length >= 1), where each element indicates a text file to be read and imported.

Note. For queries returning very large number of records, it could be convenient to parse records in multiple batches (e.g., one file at the time). This approach may be preferred because of memory efficiency and/or compatibility with parallelization. In these instances, the epm_import_xml() function will warn that only a subset of the expected files were imported, and then the parsing step will continue using all records included in the selected file(s).

# Import XML records from saved file
epm_xmpl_06 <- epm_import_xml(x = 'easypubmed_job_202311201513_batch_01.txt')

# Show Object
epm_xmpl_06

epm_xmpl_06

Alternative Approaches for Record Parsing

As we outlined above, information can be extracted from raw records ("xml" format) via the epm_parse() function. Results (data.frame) are stored in the same easyPubMed object (data slot) and can be requested via the get_epm_data() function. Users can adjust the way information are extracted and formatted from PubMed records by tweaking the epm_parse() function arguments. The most important arguments are discussed below.

Compact vs. extended output.

A new feature of easyPubMed (version r base_epm_ver or later) is the capacity of tuning the author information extraction process. The compact_output and max_authors arguments can be adjusted to get the desired behavior.

compact_output: logical (defaults to TRUE). Author names are returned in a compact format (i.e., author names are collapsed together) when this argument is set to TRUE, and each row in the final data.frame corresponds to a PubMed record. If FALSE, each row corresponds to a single author of the publication and the record-specific data are recycled for all included authors (legacy approach, this is similar to the typical output of the corresponding function in easyPubMed ver. 2.30 and earlier). When compact_output is set to FALSE, the processed data row number is typically bigger than the number or raw records that were retrieved. The behavior managed via the compact_output argument is a new feature of easyPubMed version r base_epm_ver or later.
max_authors: numeric, maximum number of authors to retrieve. If this is set to -1, only the last author is extracted. If this is set to 1, only the first author is returned. If this is set to 2, the first and the last authors are extracted. If this is set to any other positive number (i), up to the leading (i-1) authors are retrieved together with the last author. If this is set to a number larger than the number of authors in a record, all authors are returned. Note that at least 1 author has to be retrieved, therefore a value of 0 is not accepted (coerced to -1).

Citations.

The epm_parse() function can now extract citation information (if available). This feature was introduced in easyPubMed version r base_epm_ver. The max_references and ref_id_type arguments can be adjusted to obtain information in the desired format.

max_references: numeric, maximum number of references to return (from each PubMed record).
ref_id_type: string, must be one of the following values: c('pmid', 'doi'). Type of identifier used to describe citation references.

In the example below, n=1,597 records were retrieved and processed. This took less about 7 min using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04.

my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm_xmpl_07 <- epm_query(my_query) 

# Retrieve Records
epm_xmpl_07 <- epm_fetch(epm_xmpl_07)

# Parse (custom params)
epm_xmpl_07 <- epm_parse(epm_xmpl_07, 
                         max_authors = 3, compact_output = TRUE, 
                         max_references = 5, ref_id_type = 'pmid')

# Request parsed data
epm_data <- get_epm_data(epm_xmpl_07)

# Columns of interest
cols_of_int <- c('pmid',  'doi', 'authors', 'jabbrv', 'year', 'references')

# Show an excerpt
head(epm_data[, cols_of_int])

epm_data <- get_epm_data(epm_xmpl_07)

# Columns of interest
cols_of_int <- c('pmid',  'doi', 'authors', 'jabbrv', 'year', 'references')

# Show an excerpt
head(epm_data[, cols_of_int])

Software Maintenance and Life Cycle

The new pipeline proposed in easyPubMed version r this_epm_ver replaces the old pipeline based on the following functions: get_pubmed_ids(), fetch_pubmed_data(), and table_articles_byAuth(). We are planning to phase out these functions by the end of 2024.
The current version of easyPubMed still includes revised versions of the get_pubmed_ids(), fetch_pubmed_data(), and table_articles_byAuth() for legacy purposes. The new versions of these functions should return output that is compatible with old versions of easyPubMed (with the exception of get_pubmed_ids()). Other functions (e.g., batch_pubmed_download () or fetch_all_pubmed_ids()) have been discontinued and/or replaced as outlined below.
Function Replacement Map
article_to_df() -> epm_parse_record()
articles_to_list() -> discontinued
batch_pubmed_download() -> epm_fetch() [write_to_file = TRUE]
custom_grep() -> EPM_custom_grep() [not exported]
fetch_all_pubmed_ids() -> epm_fetch() [format = 'uilist']
fetch_pubmed_data() -> epm_fetch()
get_pubmed_ids_by_fulltitle() -> epm_query_by_fulltitle()
get_pubmed_ids() -> epm_query()
table_articles_byAuth() -> epm_parse() and then get_epm_data()
trim_address() -> discontinued
fetch_PMID_data() -> epm_query_by_pmid() and then epm_fetch()
extract_article_ids() -> EPM_detect_pmid() [not exported]
fetch_pubmed_data_by_PMID() -> epm_query_by_pmid() and then epm_fetch()

Additional Information

More info, other examples and vignettes, and Advanced Guides

Dev version of easyPubMed on GitHub Website
Additional Resources and Tutorials are made available via the official website.

References

easyPubMed official website including news, vignettes, and further information https://www.data-pulse.com/dev_site/easypubmed/
Sayers, E. A General Introduction to the E-utilities (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK25497/
PubMed Help (NCBI) https://www.ncbi.nlm.nih.gov/books/NBK3827/

Feedback, Citations and Collaborations

Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is .
If you use easyPubMed in a scientific publication, please cite this R package in the Materials and Methods section of the paper. Thanks!
I may be open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this vignette, feel free to contact me via email.

easyPubMed Copyright (C) 2017-2023 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

SessionInfo

sessionInfo()

Success! - by Damiano Fantini - r format(Sys.time(), format = '%b %d, %Y').

dami82/easyPubMed documentation built on Jan. 4, 2024, 6:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dami82/easyPubMed
Search and Retrieve Scientific Publication Records from PubMed

In dami82/easyPubMed: Search and Retrieve Scientific Publication Records from PubMed

Notes

New features of easyPubMed version `r this_epm_ver`

Installation

Stable Version

Dev Version(s)

Tutorial

Overview

Advanced Operations

Non-standard PubMed Queries

Retrieve non-XML Records

Queries Returning Large Numbers of Records

Save Raw Records Locally

Alternative Approaches for Record Parsing

Software Maintenance and Life Cycle

Additional Information

More info, other examples and vignettes, and Advanced Guides

References

Feedback, Citations and Collaborations

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

dami82/easyPubMed Search and Retrieve Scientific Publication Records from PubMed

In dami82/easyPubMed: Search and Retrieve Scientific Publication Records from PubMed

Notes

New features of easyPubMed version r this_epm_ver

Installation

Stable Version

Dev Version(s)

Tutorial

Overview

Advanced Operations

Non-standard PubMed Queries

Retrieve non-XML Records

Queries Returning Large Numbers of Records

Save Raw Records Locally

Alternative Approaches for Record Parsing

Software Maintenance and Life Cycle

Additional Information

More info, other examples and vignettes, and Advanced Guides

References

Feedback, Citations and Collaborations

SessionInfo

R Package Documentation

Browse R Packages

We want your feedback!

dami82/easyPubMed
Search and Retrieve Scientific Publication Records from PubMed

New features of easyPubMed version `r this_epm_ver`