In ropensci/tidypmc: Parse Full Text XML Documents from PubMed Central

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "# "
)

The tidypmc package parses XML documents in the Open Access subset of Pubmed Central. Download the full text using pmc_xml.

library(tidypmc)
doc <- pmc_xml("PMC2231364")
doc

The package includes five functions to parse the xml_document.

|R function |Description | |:--------------|:--------------------------------------------------------------------------| |pmc_text |Split section paragraphs into sentences with full path to subsection titles| |pmc_caption |Split figure, table and supplementary material captions into sentences | |pmc_table |Convert table nodes into a list of tibbles | |pmc_reference|Format references cited into a tibble | |pmc_metadata |List journal and article metadata in front node |

pmc_text splits paragraphs into sentences and removes any tables, figures or formulas that are nested within paragraph tags, replaces superscripted references with brackets, adds carets and underscores to other superscripts and subscripts and includes the full path to the subsection title.

options(width=100)
library(dplyr)
txt <- pmc_text(doc)
txt
count(txt, section)

pmc_caption splits figure, table and supplementary material captions into sentences.

options(width=100)
cap1 <- pmc_caption(doc)
filter(cap1, sentence == 1)

pmc_table formats tables by collapsing multiline headers, expanding rowspan and colspan attributes and adding subheadings into a new column.

options(width=100)
tab1 <- pmc_table(doc)
sapply(tab1, nrow)
tab1[[1]]

Captions and footnotes are added as attributes.

attributes(tab1[[1]])

Use collapse_rows to join column names and cell values in a semi-colon delimited string (and then search using functions in the next section).

options(width=100)
collapse_rows(tab1, na.string="-")

pmc_reference extracts the id, pmid, authors, year, title, journal, volume, pages, and DOIs from reference tags.

options(width=100)
ref1 <- pmc_reference(doc)
ref1

Finally, pmc_metadata saves journal and article metadata to a list.

pmc_metadata(doc)

Searching text

There are a few functions to search within the pmc_text or collapsed pmc_table output. separate_text uses the stringr package to extract any matching regular expression.

options(width=100)
separate_text(txt, "[ATCGN]{5,}")

A few wrappers search pre-defined patterns and add an extra step to expand matched ranges. separate_refs matches references within brackets using \\[[0-9, -]+\\] and expands ranges like [7-9].

options(width=100)
x <- separate_refs(txt)
x
filter(x, id == 8)

separate_genes expands microbial gene operons like hmsHFRS into four separate genes.

options(width=100)
separate_genes(txt)

Finally, separate_tags expands locus tag ranges.

options(width=100)
collapse_rows(tab1, na="-") %>%
  separate_tags("YPO")

Using `xml2`

The pmc_* functions use the xml2 package for parsing and may fail in some situations, so it helps to know how to parse xml_documents. Use cat and as.character to view nodes returned by xml_find_all.

library(xml2)
refs <- xml_find_all(doc, "//ref")
refs[1]
cat(as.character(refs[1]))

Many journals use superscripts for references cited so they usually appear after words like results9 below.

# doc1 <- pmc_xml("PMC6385181")
doc1 <- read_xml(system.file("extdata/PMC6385181.xml", package = "tidypmc"))
gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])

Find the tags using xml_find_all and then update the nodes by adding brackets or other text.

bib <- xml_find_all(doc1, "//xref[@ref-type='bibr']")
bib[1]
xml_text(bib) <- paste0(" [", xml_text(bib), "]")
bib[1]

The text is now separated from the reference. Note the pmc_text function adds the brackets by default.

gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])

Genes, species and many other terms are often included within italic tags. You can mark these nodes using the same code above or simply list all the names in italics and search text or tables for matches, for example three letter gene names in text below.

library(tibble)
x <- xml_name(xml_find_all(doc, "//*"))
tibble(tag=x) %>%
  count(tag, sort=TRUE)
it <- xml_text(xml_find_all(doc, "//sec//p//italic"), trim=TRUE)
it2 <- tibble(italic=it) %>%
  count(italic, sort=TRUE)
it2
filter(it2, nchar(italic) == 3)
separate_text(txt, c("fur", "cys", "hmu", "ybt", "yfe", "yfu", "ymt"))