knitr::opts_chunk$set( collapse = TRUE, comment = "# " )
The tidypmc
package parses XML documents in the Open Access subset of Pubmed Central.
Download the full text using pmc_xml
.
library(tidypmc) doc <- pmc_xml("PMC2231364") doc
The package includes five functions to parse the xml_document
.
|R function |Description |
|:--------------|:--------------------------------------------------------------------------|
|pmc_text
|Split section paragraphs into sentences with full path to subsection titles|
|pmc_caption
|Split figure, table and supplementary material captions into sentences |
|pmc_table
|Convert table nodes into a list of tibbles |
|pmc_reference
|Format references cited into a tibble |
|pmc_metadata
|List journal and article metadata in front node |
pmc_text
splits paragraphs into sentences and removes any tables, figures or
formulas that are nested within paragraph tags, replaces superscripted
references with brackets, adds carets and underscores to other superscripts and
subscripts and includes the full path to the subsection title.
options(width=100) library(dplyr) txt <- pmc_text(doc) txt count(txt, section)
pmc_caption
splits figure, table and supplementary material captions into sentences.
options(width=100) cap1 <- pmc_caption(doc) filter(cap1, sentence == 1)
pmc_table
formats tables by collapsing multiline headers, expanding rowspan and
colspan attributes and adding subheadings into a new column.
options(width=100) tab1 <- pmc_table(doc) sapply(tab1, nrow) tab1[[1]]
Captions and footnotes are added as attributes.
attributes(tab1[[1]])
Use collapse_rows
to join column names and cell values in a semi-colon delimited string (and
then search using functions in the next section).
options(width=100) collapse_rows(tab1, na.string="-")
pmc_reference
extracts the id, pmid, authors, year, title, journal, volume, pages,
and DOIs from reference tags.
options(width=100) ref1 <- pmc_reference(doc) ref1
Finally, pmc_metadata
saves journal and article metadata to a list.
pmc_metadata(doc)
There are a few functions to search within the pmc_text
or collapsed pmc_table
output.
separate_text
uses the stringr package to extract any matching regular expression.
options(width=100) separate_text(txt, "[ATCGN]{5,}")
A few wrappers search pre-defined patterns and add an extra step to expand matched ranges. separate_refs
matches references within brackets using \\[[0-9, -]+\\]
and expands ranges like [7-9]
.
options(width=100) x <- separate_refs(txt) x filter(x, id == 8)
separate_tags
expands locus tag ranges.
options(width=100) collapse_rows(tab1, na="-") %>% separate_tags("YPO")
xml2
The pmc_*
functions use the xml2 package for parsing and may fail in some situations, so
it helps to know how to parse xml_documents
. Use cat
and as.character
to view nodes
returned by xml_find_all
.
library(xml2) refs <- xml_find_all(doc, "//ref") refs[1] cat(as.character(refs[1]))
Many journals use superscripts for references cited so they usually
appear after words like results9
below.
# doc1 <- pmc_xml("PMC6385181") doc1 <- read_xml(system.file("extdata/PMC6385181.xml", package = "tidypmc")) gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])
Find the tags using xml_find_all
and then update the nodes by adding brackets
or other text.
bib <- xml_find_all(doc1, "//xref[@ref-type='bibr']") bib[1] xml_text(bib) <- paste0(" [", xml_text(bib), "]") bib[1]
The text is now separated from the reference. Note the pmc_text
function adds the brackets by default.
gsub(".*\\. ", "", xml_text( xml_find_all(doc1, "//sec/p"))[2])
Genes, species and many other terms are often included within italic tags. You can mark these nodes using the same code above or simply list all the names in italics and search text or tables for matches, for example three letter gene names in text below.
library(tibble) x <- xml_name(xml_find_all(doc, "//*")) tibble(tag=x) %>% count(tag, sort=TRUE) it <- xml_text(xml_find_all(doc, "//sec//p//italic"), trim=TRUE) it2 <- tibble(italic=it) %>% count(italic, sort=TRUE) it2 filter(it2, nchar(italic) == 3) separate_text(txt, c("fur", "cys", "hmu", "ybt", "yfe", "yfu", "ymt"))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.