library(pubmedRecords) library(dplyr) library(ggthemes) library(stringr) library(tidyr) library(tidytext) library(wordcloud2)
This package provides tools to download records from the NCBI PubMed database based on user-specified search criteria, and to add CrossRef citation data for the returned records. The output is in tidy data format, facilitating downstream analysis using tools from the 'tidyverse'.
This vignette illustrates the use of the package to download data for an author from PubMed and the building a word-cloud from the titles of their publications.
I will be using Prof Rolf-Detlef Treede, a renowned scientist in the field of pain research, in this example.
Load the packages required for this vignette.
# install.packages("devtools") # devtools::install_github("kamermanpr/pubmedRecords") library(pubmedRecords) library(dplyr) library(stringr) library(tidyr) library(tidytext) library(wordcloud2)
Enter search parameters and perform a search using the get_records
function.
The function parameters are:
[TA]: Journal title (e.g., J Pain)
For a full set of search fields tags: PubMed search field tags.
Note that the article publication type, date type, and date range are modified using the pub_type, date_type, min_date and max_date arguments below.
pub_type: A character string specifying the type of publication the search must return. The default value is 'journal article'. For more information: PubMed article types.
min_date: A character string in the format 'YYYY/MM/DD', 'YYYY/MM' or 'YYYY' specifying the starting date of the search. The default value is 1966/01/01'.
max_date: A character string in the format 'YYYY/MM/DD', 'YYYY/MM' or 'YYYY' specifying the end date of the search. The default value is Sys.Date()
.
date_type: A character string specifying the publication date type that is being specified in the search. Available values are:
EDAT: Date the entry was added to PubMed.
has_abstract: Logical specifying whether the returned records should be limited to those records that have an abstract. The default value is TRUE.
api_key: An API character string obtained from the users NBCI account. The key is not essential, but it specifying a key gives substantially faster record query rates.
Returning the records can take a while if there are a lot of records, so I suggest that you use count_records
before get_records
(they use the same parameters) to check how many record queries will be made before executing a request.
# Search for journal articles by RD Treede in the journal "PAIN" and # which were published between 1 January 2000 and 31 December 2018 df <- get_records(search_terms = "Treede RD[AU] AND Pain[TA]", min_date = '2000/01/01', max_date = '2018/12/31', api_key = NULL, # Add only if you have one (see documentation) pub_type = 'journal article', date_type = 'PDAT')
df <- get_records(search_terms = "Treede RD[AU]", min_date = '2000/01/01', max_date = '2018/12/31', pub_type = 'journal article', date_type = 'PDAT')
Have a quick look at the output.
# Print first 10 lines print(df) # View structure glimpse(df)
Each author of a paper is found on a separate row, with the rest of the information duplicated down the authors of a given article. Making each row a unique co-author record helps keep the data in a tidy format, and makes filtering records by co-authors easier. The downside is that the returned dataframe can be quite large because of all the duplicated information.
Although not essential for this example, you can added CrossRef citation counts to the records using the citation_metrics
function. This function requires you to pass to it the output from get_records
.
The addition of citations also can take a while if there are a lot of records.
# Add a column called "crossref_citations" to the first 6 observations df_citations <- get_citations(head(df)) # View structure glimpse(df_citations)
Now that we have the data we can generate the wordcloud from article titles.
words <- df %>% # Select the title column select(title) %>% # extract unique entries only unique(.)
tidy_words <- words %>% unnest_tokens(word, title, token = "ngrams", n = 2) %>% # Remove stopwords separate(word, into = c('word1', 'word2'), sep = ' ') %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% # Convert terms containing numerals to NA mutate(word1 = ifelse(str_detect(word1, '[0-9]'), yes = NA, no = paste(word1))) %>% mutate(word2 = ifelse(str_detect(word2, '[0-9]'), yes = NA, no = paste(word2))) %>% # Remove NA filter(!is.na(word1)) %>% filter(!is.na(word2)) %>% # Join word columns them back together to form 2-ngrams unite(word, word1, word2, sep = ' ')
ngram_count <- tidy_words %>% count(word) %>% arrange(desc(n))
word_cloud <- ngram_count[1:100, ] %>% rename(freq = n) wordcloud2(data = word_cloud, fontFamily = 'arial', size = 0.4, color = tableau_color_pal(palette = 'Color Blind')(10))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.