In kamclean/impactr:

knitr::opts_chunk$set(collapse = FALSE)
library(dplyr);library(impactr)

Search Publication Data

1. Search Pubmed

The search_pubmed() function provides a focussed method

Note: This is not intended to replicate the search function within pubmed. This is intended for the specific use of identifying the publications associated with an individual or group of authors.

At present the search can be refined in 2 ways:

Timeperiod (date_min and date_max): The default is no limitations.
Keywords (keywords): This will search for the keyword(s) within All Fields on PubMed.
- Multiple keywords can be supplied as a list (e.g. keywords = c("surgery", "infection")), and will be searched for in combination (via "OR" Boolean operator).
- Boolean operators ("AND", "OR", and "NOT") can be used to combine search terms within a keyword (e.g. keywords = "surgery AND infection")
- The asterisk wildcard search ("*") is supported to further refine the search.

search <- impactr::search_pubmed(search_list = c("mclean ka", "ots r", "drake tm", "harrison em"),
                        date_min = "2018/01/01", date_max = "2020/05/01")

The output from search_pubmed() is a list of PMIDs resulting from this search (e.g. r length(search) in the above case).

2. Sift Pubmed Results

There may be 1000s of records which meet the search criteria specified

list_pmid - can be identified via the search_pubmed() function or another method.
Note: There is a limit to the number of PMID that can be supplied to the PubMed API at once via this method (approximately 200 PMID). It is highly suggested that if you wish to sift > 200 records, these are supplied to the function separately.
authors = If these authors are linked (e.g. likelihood of being co-authors), then the authors parameter is suggested to be used.
Note: This is generally the rate-limiting step for this function (particularly when publications contain hundreds or thousands or authors).
Author affiliations (affiliations): This will search within the information held on authors for:
A match within any of the affiliations listed for all authors.
A match witin any of the affiliations listed specifically for authors supplied in authors (if provided).
Keywords (keywords): This will search for the keyword(s) within the title, abstract, and journal name. Otherwise this parameter functions as in the search_pubmed() function (see above).

Note: This function

extract <- impactr::extract_pmid(pmid = search, get_authors = TRUE, get_altmetric = F, get_impact = F)

extract <- readr::read_rds( here::here("vignettes/extract_pmid.rds"))

sifted <- extract %>%
  sift(authors = c("mclean ka", "ots r", "drake tm", "harrison em"),
                affiliations = c("edinburgh", "lothian"),
                keyword = c("surg"))

head(sifted$wheat, 10)

head(sifted$chaff, 10)

relevance <- readr::read_csv(here::here("vignettes/sifted_relevance.csv"), show_col_types = F) %>%
  dplyr::mutate(var_id = as.character(var_id),
                relevance = factor(relevance) %>% forcats::fct_rev())

accuracy_check <- bind_rows(sifted$wheat %>% mutate(designation = "Wheat"),
                           sifted$chaff %>% mutate(designation = "Chaff")) %>%
  dplyr::left_join(relevance, by = "var_id") %>%
  dplyr::mutate(designation = factor(designation, levels = c("Wheat", "Chaff")))

Based on the above search of r length(search) identified using pubmed_search() there are:

r nrow(sifted$wheat) identified as wheat (e.g. more likely to be relevant), of which 27/39 (69.2%) were relevant when maunally screened.
r nrow(sifted$chaff) identified as chaff (e.g. less likely to be relevant), of which 12/235 (5.1%) were relevant when maunally screened.

However, it should be noted that all publications that met 2 or more criteria (n=nrow(sifted$wheat %>% dplyr::filter(criteria_met>1))) were relevant. The function has been designed with sensitivity in mind to maximise negative predictive value.

accuracy_table <- accuracy_check %>%
  dplyr::select(designation, relevance) %>%
  table()

accuracy_table

The accuracy of the search entirely depends upon the parameters used - 75.9% (n=22/29) would not have been highlighted if a keyword ("surg") was not used.

venn_data <- accuracy_check %>%
  dplyr::filter(relevance=="Yes") %>%
  dplyr::filter(designation=="Wheat") %>%
  dplyr::mutate(`Multiple Specified Authors` = ifelse(author_multi_n>1, 1, 0),
                `Affiliation (Specified Author)` = ifelse(affiliations_author=="Yes", 1, 0),
                `Affiliation (Any Author)` = ifelse(affiliations_any=="Yes", 1, 0),
                `Keyword` = ifelse(keyword=="Yes", 1, 0)) %>%
  dplyr::select(`Multiple Specified Authors`:`Keyword`) %>%
  impactr::format_intersect() %>%
  dplyr::select(-degree) %>%
  dplyr::filter(combination!="") %>% # patients who are recorded as asymptomatic (on these 3 variables)
  tidyr::pivot_wider(names_from = combination, values_from = n) %>%
  unlist()

grDevices::png(filename = here::here("vignettes/plot/venn_sift.png"),
               height = 3.6, width = 5.6, units = "in", res=300)

plot(eulerr::euler(venn_data),
     edges = list("black", alpha = 0.8),
     fill = list(alpha = 0.5),
     quantities = list(fontsize = 15), legend = list(alpha = 1))

dev.off()

Let's add some more parameters in to try to improve sensitivity. This will include a common coauthor ("wigmore sj"), and another common topic of publications ("liver").

sifted2 <- extract %>%
  sift(authors = c("mclean ka", "ots r", "drake tm", "harrison em", "wigmore sj"),
                affiliations = c("edinburgh", "lothian"),
                keyword = c("surg", "liver"))

Based on the above criteria:

r nrow(sifted2$wheat) identified as wheat (e.g. more likely to be relevant), of which 35/55 (63.6%) were relevant when maunally screened.
r nrow(sifted2$chaff) identified as chaff (e.g. less likely to be relevant), of which 4/212 (1.8%) were relevant when maunally screened.

bind_rows(sifted2$wheat %>% mutate(designation = "Wheat"),
                           sifted2$chaff %>% mutate(designation = "Chaff")) %>%
  dplyr::left_join(relevance, by = "var_id") %>%
  dplyr::mutate(designation = factor(designation, levels = c("Wheat", "Chaff"))) %>%
  dplyr::select(designation, relevance) %>%
  table()