knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The pubscraper package is a helper and a wrapper for businessPubMed which extracts author information from PubMed query results for a user-defined set of search terms. The scrape2csv function was written for the purpose of automating the process of querying the same set of search terms over a defined list of journals to extract author contact information and produce a list with unduplicated contacts. However, the scrape2csv function can be used for PubMed queries over all journals and there are options to produce csv files other than author contact information.

Default exports

Query results from selected journals (if a selection is specified) are compiled, cleaned of duplicates in regards to author contact information (name and email), and then exported to csv. As a default, raw query results for each journal are also exported to csv and a report of several different types of counts of unique observations (journal, article, author, email, and contact information) are provided and exported to csv.

Optional exports

Options for exporting other data include: 1) a list of journals with a count of articles pertaining to the query for each journal; 2) a list of authors with their affiliation, a count of articles and a count of jounals for each author; and 3) a list of publication titles pertaining to the query with journal, year, and first author name.

Installing the pubscraper package

The pubscraper package can be installed from GitHub via devtools (devtools package is available on CRAN). It requires easyPubMed and businessPubMed (both available on GitHub) for performing the queries, and dplyr. Tidyverse is suggested.

## tidyverse is suggested since functions from dplyr and tidyr are used
# install.packages("tidyverse") 
# install.pacakges("here")
# install.packages("devtools")
devtools::install_github("dami82/easyPubMed")
devtools::install_github("dami82/businessPubMed")
devtools::install_github("biostatistically/pubscraper")
library(tidyverse)
library(easyPubMed)
library(businessPubMed)
library(here)
library(pubscraper)

Basics

The arguments

Here's the structure of the entire function. There are a lot of options to customize your search and output, but only two of the parameters are required for you to define: narrow (the search terms) and start (the start of the range of dates for PubMed to search).

scrape2csv <- function(narrow, 
                       broad=NULL,
                       operator=NULL,
                       journals=NULL, 
                       start, 
                       end=NULL, 
                       title=NULL, 
                       outpath=NULL, 
                       newfolder=TRUE, 
                       raw=TRUE,
                       clist=TRUE,
                       alist=FALSE,
                       plist=FALSE,
                       jlist=FALSE,
                       report=TRUE)

The details

Example

Let's say we want to know which journals have published articles on the effect of social media on mental health in adolescents in 2020. We'd also like to know how many articles each journal has published on the subject. In addition, we want a list of publication titles with their first authors. We can set up our query by:

The resulting query will be ("social media" AND "adolescents") AND ("mental health" OR "mental illness") for articles published from 2020/01/01 to 3000/12/31. The end date is not a typo - 3000/12/31 is the default end of publication date range in PubMed. Note- This can result in queries containing articles that have a published date in the future. So if you want only want articles that have a publication date in 2020, then end = "2020/12/31" needs to be specified.

Since we want a list of journals with a count of publications matching search criteria for each journal, we specify jlist = TRUE. We also specify plist = TRUE, since we want a separate list of publication titles with their first authors.

If we don't change the defaults of raw = TRUE,clist = TRUE, and report = TRUE, we will also obtain raw results, a contact list for the authors, and a query report as csv files, even if these parameters are left out of the code. So, according to the code for this example, we will obtain the following csv files:

When a title is defined, output files will be created with the prefix specified by title. In this example, title = "SocialMedia_run01". Note- Best to keep titles as simple as possible since they go into file nomenclature. Since newfolder = FALSE, the csv files are exported directly to the directory specified by outpath.

n_terms <- '("social media" AND "adolescents")'
b_terms <- '("mental health" OR " mental illness")'
my.operator <- "AND" #explicitly being specified eventhough AND is the default
my.start <- "2020/01/01"
my.end <- "2020/12/31"
my.title <- "SocialMedia_run01"  
my.path <- "/Users/ivy/example/"

scrape2csv(narrow = n_terms, 
           broad = b_terms, 
           operator = my.operator, 
           start = my.start, 
           end = my.end, 
           title = my.title, 
           outpath = my.path, 
           newfolder = FALSE, 
           plist = TRUE,
           jlist = TRUE)

Customizing for more advanced searches

For help on advanced PubMed searches, see https://pubmed.ncbi.nlm.nih.gov/advanced/

For this second example, let's say we're interested in obtaining a contact list of researchers who have published articles between 2011 and 2020 in two journals,"Sleep Health" and "Journal of Adolescent Health", pertaining to the effect of social media on mental health in adolescents but not young adults. We can set up our customized search query by:

Let's say that we want our author contact list and other default output exported to a new folder and we want a custom prefix added to each output file. Since by default, the contact list is exported and a new folder created, we don't need to specify clist = TRUE or set newfolder = TRUE, but it doesn't hurt to do so. We set our custom prefix by defining title = "SocialMedia_run02". Note- It’s best practice to title your runs differently since files and folders can be overwritten if they have the same name.

The following csv files are exported to a new folder named SocialMedia_run02 into the working directory since we don't specify outpath, and each file exported would have the same prefix (SocialMedia_run02): raw query search results of articles, authors, author affiliation, author email for each specified journal (since raw = TRUE by default) author contact information compiled from both journals, then cleaned of duplicates by author first name, last name, and email (since clist = TRUE by default) * a report of query stats (since report = TRUE by default)

my.custom <- '("adolescents" OR "teens") NOT "young adults" AND ("mental illness" OR "mental health") AND ("social media" OR "social network")'
my.journals <- c("Sleep Health", "J Adolesc Health")
my.start <- "2011"
my.end <- "2020"
my.title <- "SocialMedia_run02"
scrape2csv(narrow = my.custom, 
           journals = my.journals, 
           start = my.start, 
           end = my.end, 
           title = my.title)


biostatistically/PubscrapeR documentation built on Dec. 31, 2020, 8:55 p.m.