library(pubmedRecords)
library(dplyr)
library(ggthemes)
library(stringr)
library(tidyr)
library(tidytext)
library(wordcloud2)

This package provides tools to download records from the NCBI PubMed database based on user-specified search criteria, and to add CrossRef citation data for the returned records. The output is in tidy data format, facilitating downstream analysis using tools from the 'tidyverse'.

This vignette illustrates the use of the package to download data for an author from PubMed and the building a word-cloud from the titles of their publications.

I will be using Prof Rolf-Detlef Treede, a renowned scientist in the field of pain research, in this example.

Step 1

Load the packages required for this vignette.

# install.packages("devtools")
# devtools::install_github("kamermanpr/pubmedRecords")

library(pubmedRecords)

library(dplyr)
library(stringr)
library(tidyr)
library(tidytext)
library(wordcloud2)

Step 2

Enter search parameters and perform a search using the get_records function.

The function parameters are:

Returning the records can take a while if there are a lot of records, so I suggest that you use count_records before get_records (they use the same parameters) to check how many record queries will be made before executing a request.

# Search for journal articles by RD Treede in the journal "PAIN" and 
# which were published between 1 January 2000 and 31 December 2018
df <- get_records(search_terms = "Treede RD[AU] AND Pain[TA]",
            min_date = '2000/01/01',
            max_date = '2018/12/31',
            api_key = NULL, # Add only if you have one (see documentation)
            pub_type = 'journal article',
            date_type = 'PDAT')
df <- get_records(search_terms = "Treede RD[AU]",
            min_date = '2000/01/01',
            max_date = '2018/12/31',
            pub_type = 'journal article',
            date_type = 'PDAT')

Have a quick look at the output.

# Print first 10 lines
print(df)

# View structure
glimpse(df)

Each author of a paper is found on a separate row, with the rest of the information duplicated down the authors of a given article. Making each row a unique co-author record helps keep the data in a tidy format, and makes filtering records by co-authors easier. The downside is that the returned dataframe can be quite large because of all the duplicated information.


Although not essential for this example, you can added CrossRef citation counts to the records using the citation_metrics function. This function requires you to pass to it the output from get_records.

The addition of citations also can take a while if there are a lot of records.

# Add a column called "crossref_citations" to the first 6 observations
df_citations <- get_citations(head(df))

# View structure
glimpse(df_citations)

Step 3

Now that we have the data we can generate the wordcloud from article titles.

First, select the title column

words <- df %>% 
  # Select the title column
  select(title) %>% 
  # extract unique entries only
  unique(.)

Second, extract 2-ngrams

tidy_words <- words %>%
    unnest_tokens(word, title, token = "ngrams", n = 2) %>%
    # Remove stopwords
    separate(word, into = c('word1', 'word2'), sep = ' ') %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word) %>%
    # Convert terms containing numerals to NA
    mutate(word1 = ifelse(str_detect(word1, '[0-9]'),
                         yes = NA,
                         no = paste(word1))) %>%
    mutate(word2 = ifelse(str_detect(word2, '[0-9]'),
                          yes = NA,
                          no = paste(word2))) %>%
    # Remove NA
    filter(!is.na(word1)) %>%
    filter(!is.na(word2)) %>%
    # Join word columns them back together to form 2-ngrams
    unite(word, word1, word2, sep = ' ')

Third, count the number of occurances of each 2-ngram

ngram_count <- tidy_words %>%
    count(word) %>%
    arrange(desc(n))

Fourth, strip out the top 100 2-ngrams and plot

word_cloud <- ngram_count[1:100, ] %>% 
  rename(freq = n)

wordcloud2(data = word_cloud,
           fontFamily = 'arial',
           size = 0.4,
           color = tableau_color_pal(palette = 'Color Blind')(10))


kamermanpr/pubmedRecords documentation built on Feb. 5, 2023, 1:22 a.m.