knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The pubscraper
package is a helper and a wrapper for businessPubMed
which extracts author information from PubMed query results for a user-defined set of search terms. The scrape2csv
function was written for the purpose of automating the process of querying the same set of search terms over a defined list of journals to extract author contact information and produce a list with unduplicated contacts. However, the scrape2csv
function can be used for PubMed queries over all journals and there are options to produce csv files other than author contact information.
Query results from selected journals (if a selection is specified) are compiled, cleaned of duplicates in regards to author contact information (name and email), and then exported to csv. As a default, raw query results for each journal are also exported to csv and a report of several different types of counts of unique observations (journal, article, author, email, and contact information) are provided and exported to csv.
Options for exporting other data include: 1) a list of journals with a count of articles pertaining to the query for each journal; 2) a list of authors with their affiliation, a count of articles and a count of jounals for each author; and 3) a list of publication titles pertaining to the query with journal, year, and first author name.
The pubscraper
package can be installed from GitHub via devtools (devtools
package is available on CRAN). It requires easyPubMed and businessPubMed (both available on GitHub) for performing the queries, and dplyr. Tidyverse is suggested.
## tidyverse is suggested since functions from dplyr and tidyr are used # install.packages("tidyverse") # install.pacakges("here") # install.packages("devtools") devtools::install_github("dami82/easyPubMed") devtools::install_github("dami82/businessPubMed") devtools::install_github("biostatistically/pubscraper")
library(tidyverse) library(easyPubMed) library(businessPubMed) library(here) library(pubscraper)
Here's the structure of the entire function. There are a lot of options to customize your search and output, but only two of the parameters are required for you to define: narrow
(the search terms) and start
(the start of the range of dates for PubMed to search).
scrape2csv <- function(narrow, broad=NULL, operator=NULL, journals=NULL, start, end=NULL, title=NULL, outpath=NULL, newfolder=TRUE, raw=TRUE, clist=TRUE, alist=FALSE, plist=FALSE, jlist=FALSE, report=TRUE)
narrow
: ( REQUIRED; character vector of length 1) This parameter has a lot of flexibility. One way to use this argument is for key search terms using AND as the operator between terms (therefore the parameter name of narrow
). Each term must be enclosed in double quotes, and the whole set of terms needs to be enclosed in parentheses. e.g. ("keyword" AND "some key phrase" AND "another key phrase") NOTE- for advanced users, this parameter can be used to enter all desired search terms with any Boolean operator(s).broad
: (optional; character vector of length 1) Again, there is flexibility of how this is defined. In the same vein as above, one way to use this is for key search terms using OR as the operator between terms. e.g. ("keyword1" OR "key phrase1") operator
: (optional; character string) Boolean operator to be placed between narrow
and broad
; if not specified, 'AND' will be usedjournals
: (optional; character vector of length j) a list of journals to query for the set of search terms. _Note- When specified, the query will iterate over the list of journals for the set of search terms definedstart
: ( REQUIRED; character string) the earliest publication date you want to searchend
: (optional) the latest publication date you want to search; if not specified, default of 3000/12/31 will be usedtitle
: (optional; character string) title of your query which will be used to name folders and output files NOTE- best to keep it short and simpleoutpath
: (optional; character string) directory where you want your output to be placed; if not specified, the working directory will be usednewfolder
: If TRUE, a new folder will be created for your output; default is TRUEraw
: If TRUE, raw query results are exported to csv; default is TRUEclist
: If TRUE, author contact information with emails will be exported to csv; default is TRUEalist
: If TRUE, a list of authors with a count of articles and a count of journals for each author is exported to csv; default is FALSEplist
: If TRUE, a list of articles with journal and first author name is exported to csv; default is FALSEjlist
= If TRUE, a list of journals with a count of number of articles is exported to csv; default is FALSEreport
= If TRUE, a query report will be exported to csv; default is TRUELet's say we want to know which journals have published articles on the effect of social media on mental health in adolescents in 2020. We'd also like to know how many articles each journal has published on the subject. In addition, we want a list of publication titles with their first authors. We can set up our query by:
narrow
search term parameter. (It's narrow since you're meant to put AND between each search term. There is flexibility in this, however. See the next section on customizing the query even further.)
narrow = '(“social media” AND “adolescents”)'
broad
search term parameter. (It's broad since you're meant to put OR between each search term. This parameter is optional.)
broad = '("mental health" OR "mental illness")'
narrow
and broad
operator = "AND"
Note that this does not need to be explicitly defined since the default is AND.start = "2020/01/01"
The resulting query will be ("social media" AND "adolescents") AND ("mental health" OR "mental illness") for articles published from 2020/01/01 to 3000/12/31. The end date is not a typo - 3000/12/31 is the default end of publication date range in PubMed. Note- This can result in queries containing articles that have a published date in the future. So if you want only want articles that have a publication date in 2020, then end = "2020/12/31"
needs to be specified.
Since we want a list of journals with a count of publications matching search criteria for each journal, we specify jlist = TRUE
. We also specify plist = TRUE
, since we want a separate list of publication titles with their first authors.
If we don't change the defaults of raw = TRUE
,clist = TRUE
, and report = TRUE
, we will also obtain raw results, a contact list for the authors, and a query report as csv files, even if these parameters are left out of the code. So, according to the code for this example, we will obtain the following csv files:
raw = TRUE
by default)clist = TRUE
by default)jlist = TRUE
)plist = TRUE
)report = TRUE
by default)When a title
is defined, output files will be created with the prefix specified by title
. In this example, title = "SocialMedia_run01"
. Note- Best to keep titles as simple as possible since they go into file nomenclature. Since newfolder = FALSE
, the csv files are exported directly to the directory specified by outpath
.
n_terms <- '("social media" AND "adolescents")' b_terms <- '("mental health" OR " mental illness")' my.operator <- "AND" #explicitly being specified eventhough AND is the default my.start <- "2020/01/01" my.end <- "2020/12/31" my.title <- "SocialMedia_run01" my.path <- "/Users/ivy/example/" scrape2csv(narrow = n_terms, broad = b_terms, operator = my.operator, start = my.start, end = my.end, title = my.title, outpath = my.path, newfolder = FALSE, plist = TRUE, jlist = TRUE)
For help on advanced PubMed searches, see https://pubmed.ncbi.nlm.nih.gov/advanced/
For this second example, let's say we're interested in obtaining a contact list of researchers who have published articles between 2011 and 2020 in two journals,"Sleep Health" and "Journal of Adolescent Health", pertaining to the effect of social media on mental health in adolescents but not young adults. We can set up our customized search query by:
narrow = '("adolescents" OR "teens") NOT "young adults" AND ("mental illness" OR "mental health") AND ("social media" OR "social network")'
journals = c("Sleep Health", "J Adolesc Health")
start = "2011"
end = "2020"
Let's say that we want our author contact list and other default output exported to a new folder and we want a custom prefix added to each output file. Since by default, the contact list is exported and a new folder created, we don't need to specify clist = TRUE
or set newfolder = TRUE
, but it doesn't hurt to do so. We set our custom prefix by defining title = "SocialMedia_run02"
. Note- It’s best practice to title your runs differently since files and folders can be overwritten if they have the same name.
The following csv files are exported to a new folder named SocialMedia_run02 into the working directory since we don't specify outpath
, and each file exported would have the same prefix (SocialMedia_run02):
raw query search results of articles, authors, author affiliation, author email for each specified journal (since raw = TRUE
by default)
author contact information compiled from both journals, then cleaned of duplicates by author first name, last name, and email (since clist = TRUE
by default)
* a report of query stats (since report = TRUE
by default)
my.custom <- '("adolescents" OR "teens") NOT "young adults" AND ("mental illness" OR "mental health") AND ("social media" OR "social network")' my.journals <- c("Sleep Health", "J Adolesc Health") my.start <- "2011" my.end <- "2020" my.title <- "SocialMedia_run02" scrape2csv(narrow = my.custom, journals = my.journals, start = my.start, end = my.end, title = my.title)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.