devtools::install_github("julianflowers/myScrapers", force = TRUE)

This vignette shows how to extract abstracts from Pubmed and perform simple topic modelling on them. It uses functions in the myScrapers package which can be downloaded as below.


Obtaining article abstracts

The first step is to search Pubmed. We use the pubmedAbstractR function. This is a wrapper for RISmed and interacts with the NCBI E-utilities API. It takes 5 arguments:

In addition it is recommended to obtain an API key for NCBI. Instructions on how to obtain a key is available from here. Once you have a key you should store it as an environment variable.

There are two other arguments to extract authors and mesh headings (keywords). These are set to FALSE by default.

In this example we show how to search for articles on population health management.

## load key

key <- Sys.getenv("ncbi_key")

## initialise - initially with n = 1 - this will tell us how many abstracts out query returns

query <- "population health management[tw]"
n <- 1
end <- 2020

## search

out <- pubmedAbstractR(search = query, n = n, end = end, ncbi_key = key)

The query returns r out$n_articles abstracts. The search term is translated by the API into r out$search

Let us download them.

n <- out$n_articles

results <- pubmedAbstractR(search = query, n = n, end = end, ncbi_key = key)


Topic modeling

Topic modelling is a form of unsepervised machine learning that cn help us classify texts. There are two main packages in R for this, topicmodels and stm. In this workflow we are using an NLP package udpipe to tokemnise and annote texts, and topicmodels to classify and visualise documents.

To facilitate this process we have added 4 functions to the myScrapers package. These are:

Let's illustrate how the flow works.


The first step is to parse the abstracts. Note: this can take some time

anno <- annotate_abstracts(abstract = results$abstracts$abstract, pmid = results$abstracts$DOI)


Creating nounphrases

This step takes the annotated data created in the previous step and creates phrases. This can enrich the topic modelling step but can be missed out.

np <- abstract_nounphrases(anno)

np %>%
  filter(! %>%
  head(10) %>%
  select(doc_id, sentence, term)

Create topics

topics <- abstract_topics(k = 10, x = np)

Visualising topics

topic <- myScrapers::abstract_topic_viz(x = np, m = topics$model, scores = topics$scores, n = 10)

Visualising all topics

figures <- map(1:10, ~(abstract_topic_viz(x = np, m = topics$model, scores = topics$scores, n = .x)))

