# Declarations
base_epm_ver <- '3.0'
stab_epm_ver <- '3.1.3'
this_epm_ver <- '3.1.3'



# pre-load libs and data
library(easyPubMed)
data("epm_samples")

# Collect custom f(x)
rebuild_uili <- epm_samples$fx$rebuild_uili
rebuild_li <- epm_samples$fx$rebuild_li
rebuild_df <- epm_samples$fx$rebuild_df
fabricate_epm_obj <- epm_samples$fx$fabricate_epm_obj
slice_epm_obj_to_special <- epm_samples$fx$slice_epm_obj_to_special

# Collect data
blca_2018 <- epm_samples$bladder_cancer_2018
blca_40y <- epm_samples$bladder_cancer_40y

# Fabricate objects (vignette version)
epm <- fabricate_epm_obj(blca_2018$demo_data_03)
epm_xmpl_01 <- slice_epm_obj_to_special(epm, mode = 1)
epm_xmpl_02 <- slice_epm_obj_to_special(epm, mode = 2)
epm_xmpl_03 <- fabricate_epm_obj(blca_2018$demo_data_04)
epm_xmpl_04 <- fabricate_epm_obj(blca_40y$demo_data_01)
epm_xmpl_06 <- fabricate_epm_obj(blca_2018$demo_data_02)
epm_xmpl_07 <- fabricate_epm_obj(blca_2018$demo_data_05)

PubMed is an online repository of references and abstracts of publications in the fields of medicine and life sciences. Pubmed is a free resource that is developed and maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH). PubMed homepage is located at the following URL: https://pubmed.ncbi.nlm.nih.gov/. Other than its web portal, PubMed can be programmatically queried via the NCBI Entrez E-utilities interface.

easyPubMed is an open-source R interface to the Entrez Programming Utilities aimed at allowing programmatic access to PubMed in the R environment. The package is suitable for downloading large number of records, and includes a collection of functions to perform basic processing of the Entrez/PubMed query responses. The library supports either XML or TXT ("medline") format. This vignette covers the key functionalities of easyPubMed and provides some informative examples to get started.


Notes


New features of easyPubMed version r this_epm_ver


Installation

Stable Version

To install the stable version (r stab_epm_ver) of easyPubMed from CRAN, you can run the following line of code.

install.packages("easyPubMed")

Dev Version(s)

Dev versions of the library are hosted on GitHub. If interested, you can install the latest dev version using the devtools library.

devtools::install_github("dami82/easyPubMed")

Tutorial

The first section of the tutorial covers how to use easyPubMed (version r base_epm_ver or later) for querying PubMed, retrieving records from the Entrez History Server, and analyzing results. The second part of the tutorial provides additional examples and information about special cases and advanced operations.


Overview

The typical easyPubMed pipeline is a three-step process.


PubMed Query and Record Retrieval

The code below illustrates the typical steps of an easyPubMed analysis. All data (raw records as well as processed data) are stored in the resulting easyPubMed object. In this example, n=1,597 records were retrieved and processed. This took about 14 min using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04. Data parsing (epm_parse()) is the step taking the longest time to complete.

# Load library
library(easyPubMed)

# Define Query String
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm <- epm_query(my_query) 

# Retrieve Records (xml format)
epm <- epm_fetch(epm, format = 'xml')

# Extract Information
epm <- epm_parse(epm)

# All results are stored in an easyPubMed object.
epm
epm

Get Meta data

Meta data are attached to each easyPubMed object and provide information about the record query job (e.g., number of expected records; date when the query was performed) as well as type/format of the downloaded data (e.g., format and encoding of the raw data). A unique identifier (UID) is also included to track different objects/query jobs. Meta data can be requested from an easyPubMed object via the get_epm_meta() function, which returns a list.

job_meta <- get_epm_meta(x = epm)
head(job_meta)

Get Raw Records

Raw PubMed records can be obtained from an easyPubMed (after epm_fetch() has been completed) via the get_epm_raw() function, which returns a named list. Each element includes one PubMed record. The name of each element corresponds to its PubMed record identifier (PMID).

raw_records <- get_epm_raw(epm)

# elements are named after the corresponding PMIDs
head(names(raw_records))
# elements include raw PubMed records
first_record <- raw_records[[1]] 

# Show excerpt (from record #1)
cat(substr(first_record, 1, 1200))

Get Processed Data

Processed data can be obtained from an easyPubMed (after epm_parse() has been executed) via the get_epm_data() function Processed data are returned as a data.frame. By default, each row corresponds to a PubMed record. This default behavior can be modified by tuning the compact_output and max_authors arguments (see section below). The columns/fields extracted include record identifiers, journal name, publication date, title, abstract, MeSH codes, author names and affiliations.

proc_data <- get_epm_data(epm)

# show an excerpt (first 6 records, selected columns)
slctd_fields <- c('pmid', 'doi', 'jabbrv', 'year', 'month', 'day')
head(proc_data[, slctd_fields])

A comprehensive list of the fields that are extracted from raw XML records and returned as columns of the processed data object (data.frame) is shown below.


Get Record Identifiers (PMIDs)

The identifiers (PMIDs) of records included in an easyPubMed object (after epm_fetch() has been executed) can be obtained via the get_epm_uilist() function, which returns a character vector. PMIDs are automatically detected and extracted from all downloaded records, independently of the raw record format.

# Get PMIDs
all_pmids <- get_epm_uilist(epm)

# Show excerpt
head(all_pmids)

Advanced Operations

This section includes a few examples of less-common easyPubMed pipelines and operations. Please, contact the package maintainer for additional questions.

Non-standard PubMed Queries

The easyPubMed library comes with two special Query functions that are designed to address specific goals:

These special query functions may replace the first step of the easyPubMed pipeline. After the query step has been completed, record retrieval proceeds as outlined above, i.e., via the epm_fetch() function.


Query by Article Title

It is possible to query PubMed for a record of interest by providing its full-length title as query string and via the epm_query_by_fulltitle() function. This function takes a string (character vector of length 1) as its fulltitle argument. The string should NOT include new-line characters (e.g., \n) or multi-spaces, as those may prevent the exact-match search. These special characters are NOT removed automatically (by design). You can use regular expressions (e.g., gsub()) to clean a fulltitle string before performing the query. An example is shown below.

# Article Title (including new-line chars)
my_title <- "Role of gemcitabine and cisplatin as 
             neoadjuvant chemotherapy in muscle invasive bladder cancer: 
             Experience over the last decade."

# Unpolished title string
cat(my_title)
# Clean the title
my_title <- gsub('[[:space:]]+', ' ', my_title)

# Clean title string
cat(my_title)
# Query and fetch
epm_xmpl_01 <- epm_query_by_fulltitle(fulltitle = my_title)
epm_xmpl_01 <- epm_fetch(epm_xmpl_01)
epm_xmpl_01
epm_xmpl_01

Query Using a List of PMIDs

The epm_query_by_pmid() takes a character vector as its pmids argument. If a long list of PMIDs is provided (n>50), the function automatically splits the query job into multiple 50-record sub-jobs. The resulting 'easyPubMed' object displays '' as value of the query_string meta data field. An example is shown below.

my_pmids <- c('31572460', '31511849', '31411998')

epm_xmpl_02 <- epm_query_by_pmid(pmids = my_pmids)
epm_xmpl_02 <- epm_fetch(epm_xmpl_02)
epm_xmpl_02
epm_xmpl_02

Retrieve non-XML Records

The epm_fetch() function supports three different formats. The default format is xml. Alternatively, the medline and uilist formats are also supported. Briefly, the medline option returns records in plain text format (see example below). On the contrary, the uilist format simply requests the identifiers (PMIDs) of all records returned by a query (no additional record content is retrieved from Entrez/PubMed). Note that non-XML records cannot be used to extract record information via epm_parse().

# Define Query String
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm_xmpl_03 <- epm_query(my_query) 

# Retrieve Records (request 'medline' format!)
epm_xmpl_03 <- epm_fetch(epm_xmpl_03, format = 'medline')

# Get records
xmpl_03_raw <- get_epm_raw(epm_xmpl_03)

# Elements are named after the corresponding PMIDs
head(names(xmpl_03_raw))
xmpl_03_raw <- get_epm_raw(epm_xmpl_03)
head(names(xmpl_03_raw))
# Elements include raw PubMed records
first_record <- xmpl_03_raw[[1]] 

# Show an Excerpt (record n. 12, first 18 lines)
cat(head(first_record, n=20), sep = '\n')  

Queries Returning Large Numbers of Records

In easyPubMed (version r base_epm_ver or later) there are no dedicated functions for downloading large numbers of records. Large query jobs are still carried out via the epm_query() and epm_fetch() functions, which will attempt to split a single query into a list of manageable sub-jobs. An example is shown below. Briefly, we performed a query that returned n=20,825 records. The job was automatically split in n=4 sub-jobs, records were downloaded and parsed. The whole operation took about 3h 28m using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04 (i.e., about 0.6s per record).

# Define Query String
blca_query <- '"bladder cancer"[Ti] AND ("1980"[PDAT]:"2020"[PDAT])'

# Submit the Query
epm_xmpl_04 <- epm_query(blca_query) 

# Retrieve Records (medline format)
epm_xmpl_04 <- epm_fetch(epm_xmpl_04)

# Parse all records
epm_xmpl_04 <- epm_parse(epm_xmpl_04)

# Show Object
epm_xmpl_04
# Show Object
epm_xmpl_04

Save Raw Records Locally

Unlike previous versions of easyPubMed, there are no dedicated functions to write PubMed records to a local disk. Starting from easyPubMed version r base_epm_ver, this operation is performed by tuning the arguments of the epm_fetch() function and by setting the write_to_file to TRUE.

Write Files to the Local Disc.

There are 4 arguments that can be adjusted to fine-tune the behavior of epm_fetch() and write PubMed records to local files.

# Define Query String
my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm_xmpl_05 <- epm_query(my_query) 

# Retrieve Records
epm_xmpl_05 <- epm_fetch(epm_xmpl_05, write_to_file = TRUE)

# Check if file exists
dir(pattern = '^easypubmed')
print('easypubmed_job_202311201513_batch_01.txt')

Read Files From the Local Disc.

It is possible to import local files storing raw PubMed records for further processing via the epm_import_xml() function. This function can be used if the following 3 conditions are met:

Users should feed the epm_import_xml() function a character vector of file names (of length >= 1), where each element indicates a text file to be read and imported.

# Import XML records from saved file
epm_xmpl_06 <- epm_import_xml(x = 'easypubmed_job_202311201513_batch_01.txt')

# Show Object
epm_xmpl_06
epm_xmpl_06

Alternative Approaches for Record Parsing

As we outlined above, information can be extracted from raw records ("xml" format) via the epm_parse() function. Results (data.frame) are stored in the same easyPubMed object (data slot) and can be requested via the get_epm_data() function. Users can adjust the way information are extracted and formatted from PubMed records by tweaking the epm_parse() function arguments. The most important arguments are discussed below.

Compact vs. extended output.

A new feature of easyPubMed (version r base_epm_ver or later) is the capacity of tuning the author information extraction process. The compact_output and max_authors arguments can be adjusted to get the desired behavior.


Citations.

The epm_parse() function can now extract citation information (if available). This feature was introduced in easyPubMed version r base_epm_ver. The max_references and ref_id_type arguments can be adjusted to obtain information in the desired format.


In the example below, n=1,597 records were retrieved and processed. This took less about 7 min using a 2-vCPUs, 4Gb-memory machine running on Ubuntu 20.04.

my_query <- '"bladder cancer"[Ti] AND "2018"[PDAT]' 

# Submit the Query
epm_xmpl_07 <- epm_query(my_query) 

# Retrieve Records
epm_xmpl_07 <- epm_fetch(epm_xmpl_07)

# Parse (custom params)
epm_xmpl_07 <- epm_parse(epm_xmpl_07, 
                         max_authors = 3, compact_output = TRUE, 
                         max_references = 5, ref_id_type = 'pmid')

# Request parsed data
epm_data <- get_epm_data(epm_xmpl_07)

# Columns of interest
cols_of_int <- c('pmid',  'doi', 'authors', 'jabbrv', 'year', 'references')

# Show an excerpt
head(epm_data[, cols_of_int])
epm_data <- get_epm_data(epm_xmpl_07)

# Columns of interest
cols_of_int <- c('pmid',  'doi', 'authors', 'jabbrv', 'year', 'references')

# Show an excerpt
head(epm_data[, cols_of_int])

Software Maintenance and Life Cycle


Additional Information

More info, other examples and vignettes, and Advanced Guides


References


Feedback, Citations and Collaborations


easyPubMed Copyright (C) 2017-2023 Damiano Fantini. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.


SessionInfo

sessionInfo()

Success! - by Damiano Fantini - r format(Sys.time(), format = '%b %d, %Y').



dami82/easyPubMed documentation built on Jan. 4, 2024, 6:21 a.m.