In biostatistically/PubscrapeR: Scrape PubMed, Compile, and Export to csv

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The pubscraper package is a helper and a wrapper for businessPubMed which extracts author information from PubMed query results for a user-defined set of search terms. The scrape2csv function was written for the purpose of automating the process of querying the same set of search terms over a defined list of journals to extract author contact information and produce a list with unduplicated contacts. However, the scrape2csv function can be used for PubMed queries over all journals and there are options to produce csv files other than author contact information.

Default exports

Query results from selected journals (if a selection is specified) are compiled, cleaned of duplicates in regards to author contact information (name and email), and then exported to csv. As a default, raw query results for each journal are also exported to csv and a report of several different types of counts of unique observations (journal, article, author, email, and contact information) are provided and exported to csv.

Optional exports

Options for exporting other data include: 1) a list of journals with a count of articles pertaining to the query for each journal; 2) a list of authors with their affiliation, a count of articles and a count of jounals for each author; and 3) a list of publication titles pertaining to the query with journal, year, and first author name.

Installing the pubscraper package

The pubscraper package can be installed from GitHub via devtools (devtools package is available on CRAN). It requires easyPubMed and businessPubMed (both available on GitHub) for performing the queries, and dplyr. Tidyverse is suggested.

## tidyverse is suggested since functions from dplyr and tidyr are used
# install.packages("tidyverse") 
# install.pacakges("here")
# install.packages("devtools")
devtools::install_github("dami82/easyPubMed")
devtools::install_github("dami82/businessPubMed")
devtools::install_github("biostatistically/pubscraper")

library(tidyverse)
library(easyPubMed)
library(businessPubMed)
library(here)
library(pubscraper)

Basics

The arguments

Here's the structure of the entire function. There are a lot of options to customize your search and output, but only two of the parameters are required for you to define: narrow (the search terms) and start (the start of the range of dates for PubMed to search).

scrape2csv <- function(narrow, 
                       broad=NULL,
                       operator=NULL,
                       journals=NULL, 
                       start, 
                       end=NULL, 
                       title=NULL, 
                       outpath=NULL, 
                       newfolder=TRUE, 
                       raw=TRUE,
                       clist=TRUE,
                       alist=FALSE,
                       plist=FALSE,
                       jlist=FALSE,
                       report=TRUE)

The details

narrow : ( REQUIRED; character vector of length 1) This parameter has a lot of flexibility. One way to use this argument is for key search terms using AND as the operator between terms (therefore the parameter name of narrow). Each term must be enclosed in double quotes, and the whole set of terms needs to be enclosed in parentheses. e.g. ("keyword" AND "some key phrase" AND "another key phrase") NOTE- for advanced users, this parameter can be used to enter all desired search terms with any Boolean operator(s).
broad : (optional; character vector of length 1) Again, there is flexibility of how this is defined. In the same vein as above, one way to use this is for key search terms using OR as the operator between terms. e.g. ("keyword1" OR "key phrase1")
operator : (optional; character string) Boolean operator to be placed between narrow and broad; if not specified, 'AND' will be used
journals : (optional; character vector of length j) a list of journals to query for the set of search terms. _Note- When specified, the query will iterate over the list of journals for the set of search terms defined
start : ( REQUIRED; character string) the earliest publication date you want to search
end : (optional) the latest publication date you want to search; if not specified, default of 3000/12/31 will be used
title : (optional; character string) title of your query which will be used to name folders and output files NOTE- best to keep it short and simple
outpath : (optional; character string) directory where you want your output to be placed; if not specified, the working directory will be used
newfolder : If TRUE, a new folder will be created for your output; default is TRUE
raw : If TRUE, raw query results are exported to csv; default is TRUE
clist : If TRUE, author contact information with emails will be exported to csv; default is TRUE
alist : If TRUE, a list of authors with a count of articles and a count of journals for each author is exported to csv; default is FALSE
plist : If TRUE, a list of articles with journal and first author name is exported to csv; default is FALSE
jlist = If TRUE, a list of journals with a count of number of articles is exported to csv; default is FALSE
report = If TRUE, a query report will be exported to csv; default is TRUE

Example

Let's say we want to know which journals have published articles on the effect of social media on mental health in adolescents in 2020. We'd also like to know how many articles each journal has published on the subject. In addition, we want a list of publication titles with their first authors. We can set up our query by:

using ("social media" AND "adolescents") in the narrow search term parameter. (It's narrow since you're meant to put AND between each search term. There is flexibility in this, however. See the next section on customizing the query even further.) narrow = '(“social media” AND “adolescents”)'
using ("mental health" OR "mental illness") in the broad search term parameter. (It's broad since you're meant to put OR between each search term. This parameter is optional.) broad = '("mental health" OR "mental illness")'
using "AND" for the operator between narrow and broad operator = "AND" Note that this does not need to be explicitly defined since the default is AND.
setting the start date to 2020/01/01 (or 2020/01) start = "2020/01/01"

The resulting query will be ("social media" AND "adolescents") AND ("mental health" OR "mental illness") for articles published from 2020/01/01 to 3000/12/31. The end date is not a typo - 3000/12/31 is the default end of publication date range in PubMed. Note- This can result in queries containing articles that have a published date in the future. So if you want only want articles that have a publication date in 2020, then end = "2020/12/31" needs to be specified.

Since we want a list of journals with a count of publications matching search criteria for each journal, we specify jlist = TRUE. We also specify plist = TRUE, since we want a separate list of publication titles with their first authors.

If we don't change the defaults of raw = TRUE,clist = TRUE, and report = TRUE, we will also obtain raw results, a contact list for the authors, and a query report as csv files, even if these parameters are left out of the code. So, according to the code for this example, we will obtain the following csv files:

raw query search results of journals, publication titles, authors, author affiliation, author email (since raw = TRUE by default)
a list of author contacts with author name, email, and affiliation, as well as the number of publications and the number of journals (since clist = TRUE by default)
a list of journals with the total number of publication titles appearing in the search for each journal (since we specify jlist = TRUE)
publication list with title, journal, year, as well as first author first name and last name (since we specify plist = TRUE)
a report of query stats (since report = TRUE by default)

When a title is defined, output files will be created with the prefix specified by title. In this example, title = "SocialMedia_run01". Note- Best to keep titles as simple as possible since they go into file nomenclature. Since newfolder = FALSE, the csv files are exported directly to the directory specified by outpath.

n_terms <- '("social media" AND "adolescents")'
b_terms <- '("mental health" OR " mental illness")'
my.operator <- "AND" #explicitly being specified eventhough AND is the default
my.start <- "2020/01/01"
my.end <- "2020/12/31"
my.title <- "SocialMedia_run01"  
my.path <- "/Users/ivy/example/"

scrape2csv(narrow = n_terms, 
           broad = b_terms, 
           operator = my.operator, 
           start = my.start, 
           end = my.end, 
           title = my.title, 
           outpath = my.path, 
           newfolder = FALSE, 
           plist = TRUE,
           jlist = TRUE)

Customizing for more advanced searches

For help on advanced PubMed searches, see https://pubmed.ncbi.nlm.nih.gov/advanced/

For this second example, let's say we're interested in obtaining a contact list of researchers who have published articles between 2011 and 2020 in two journals,"Sleep Health" and "Journal of Adolescent Health", pertaining to the effect of social media on mental health in adolescents but not young adults. We can set up our customized search query by:

assigning ("adolescents" OR "teens") NOT "young adults" AND ("mental illness" OR "mental health") AND ("social media" OR "social network") to the narrow parameter and ignore the broad argument narrow = '("adolescents" OR "teens") NOT "young adults" AND ("mental illness" OR "mental health") AND ("social media" OR "social network")'
creating a vector of length 2 with the two journal names so that the query will loop through the list of journals iteratively journals = c("Sleep Health", "J Adolesc Health")
setting start and end dates as 2011 and 2020, respectively start = "2011" end = "2020"

Let's say that we want our author contact list and other default output exported to a new folder and we want a custom prefix added to each output file. Since by default, the contact list is exported and a new folder created, we don't need to specify clist = TRUE or set newfolder = TRUE, but it doesn't hurt to do so. We set our custom prefix by defining title = "SocialMedia_run02". Note- It’s best practice to title your runs differently since files and folders can be overwritten if they have the same name.

The following csv files are exported to a new folder named SocialMedia_run02 into the working directory since we don't specify outpath, and each file exported would have the same prefix (SocialMedia_run02): raw query search results of articles, authors, author affiliation, author email for each specified journal (since raw = TRUE by default) author contact information compiled from both journals, then cleaned of duplicates by author first name, last name, and email (since clist = TRUE by default) * a report of query stats (since report = TRUE by default)

my.custom <- '("adolescents" OR "teens") NOT "young adults" AND ("mental illness" OR "mental health") AND ("social media" OR "social network")'
my.journals <- c("Sleep Health", "J Adolesc Health")
my.start <- "2011"
my.end <- "2020"
my.title <- "SocialMedia_run02"
scrape2csv(narrow = my.custom, 
           journals = my.journals, 
           start = my.start, 
           end = my.end, 
           title = my.title)