Extract Publication Data

Keeping accurate and up-to-date information regarding the output from an author / research group can be a time-consuming task, however remains essential in order to evaluate the research impact. There are multiple repositories on-line such as PubMed or CrossRef which store this information, and can already be accessed using an API in R in order to extract this data on an automatic basis.

There are excellent packages that already exist to access this data - RISmed for PubMed and rcrossref for CrossRef. While these provide a vast quantity of useful information, this is often outwith a dataframe/tibble format and so requires a significant quantity of post-processing to otherwise achieve an easily usable output.


1. Extract Data

The functions extract_pmid() (PubMed) and extract_doi() (CrossRef) use unique identifiers to extract important information regarding publications. This can be used for the purposes of citation, cataloguing publications, or for further evaluation of impact (see ImpactR: Citations).

However, it should be noted that the information extracted is dependent on the accuracy and completeness of the information within these repositories. Therefore, some additional editing may be required to remove heterogeneity / make corrections / supply missing data.

a). extract_pmid()

The extract_pmid() function only requires a vector/list of PubMed identification numbers to extract publication information.

The function will automatically extract the authors (auth_group, auth_n, authors), and the associated altmetric score (altmetric). However, this functionality has been made optional as it can extend the run time of the function (particularly in the case of a large number of authors).

out_pubmed <- impactr::extract_pmid(pmid = c(26769786, 26195471, 30513129),
                                    get_altmetric = FALSE, get_impact = FALSE)
col_pubmed <- which(colnames(out_pubmed) %in% c("author", "title"))

out_pubmed %>%
  dplyr::mutate(author = paste0(substr(author_list, 1, 50), "...")) %>%
  knitr::kable(format="html",escape = FALSE) %>%
  kableExtra::column_spec(col_pubmed, width_min="6in") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "1000px")

In general, it appears that the information on PubMed tends to be the most accurate/up-to-date, however the Digital Object Identifier (DOI) occasionally is not updated to reflect the final DOI for the paper (this can either be amended or the publisher contacted to correct).


b). extract_doi()

The extract_doi() function only requires a vector/list of Digital Object Identifiers (DOI), and uses the rcrossref package to extract publication information.

The function will automatically extract the authors (auth_group, auth_n, authors), and the associated altmetric score (altmetric). However, this functionality has been made optional as it can extend the run time of the function (particularly in the case of a large number of authors). It should be also noted that crossref tends to record authorship less well (compared to PubMed).

# Example output from user_roles_n()
out_doi <- impactr::extract_doi(doi = out_pubmed$doi,
                       get_authors = TRUE, get_altmetric = FALSE, get_impact = FALSE)
col_doi1 <- which(colnames(out_doi) %in% c("title"))
col_doi2 <- which(colnames(out_doi) %in% c("doi", "author_group","author"))

out_doi %>%
  dplyr::mutate(author = ifelse(, paste0(substr(author, 1, 50), "..."), author)) %>%
  knitr::kable(format="html",escape = FALSE) %>%
  kableExtra::column_spec(col_doi1, width_min="7in") %>%
  kableExtra::column_spec(col_doi2, width_min="2.5in") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "1000px")


