knitr::opts_chunk$set( collapse = TRUE, comment = "# " )
The Open Access subset of Pubmed Central (PMC) includes 2.5 million articles
from biomedical and life sciences journals. The full text XML files are freely
available for text mining from the REST service or FTP site but can be
challenging to parse. For example, section tags are nested to arbitrary depths,
formulas and tables may return incomprehensible text blobs and superscripted
references are pasted at the end of words. The functions in the tidypmc
package are intended to return readable text and maintain the document
structure, so gene names and other terms can be associated with specific
sections, paragraphs, sentences or table rows.
Use remotes to install the package.
remotes::install_github("ropensci/tidypmc")
Download a single XML document like PMC2231364 from the REST service using
the pmc_xml
function.
options(width=100) library(tidypmc) library(tidyverse) doc <- pmc_xml("PMC2231364") doc
The europepmc package includes additional functions to search PMC
and download full text. Be sure to include the OPEN_ACCESS
field in
the search since these are the only articles with full text XML available.
options(width=100) library(europepmc) yp <- epmc_search("title:(Yersinia pestis virulence) OPEN_ACCESS:Y") select(yp, pmcid, pubYear, title) %>% print(n=5)
Save all r nrow(yp)
results to a list of XML documents using the epmc_ftxt
or pmc_xml
function.
docs <- map(yp$pmcid, epmc_ftxt)
See the PMC FTP vignette for details on parsing the large XML files on the FTP site with 10,000 articles each.
The package includes five functions to parse the xml_document
.
|R function |Description |
|:--------------|:--------------------------------------------------------------------------|
|pmc_text
|Split section paragraphs into sentences with full path to subsection titles|
|pmc_caption
|Split figure, table and supplementary material captions into sentences |
|pmc_table
|Convert table nodes into a list of tibbles |
|pmc_reference
|Format references cited into a tibble |
|pmc_metadata
|List journal and article metadata in front node |
The pmc_text
function uses the tokenizers package to split section paragraphs into
sentences. The function also removes any tables, figures or formulas that are nested
within paragraph tags, replaces superscripted references with brackets, adds carets and
underscores to other superscripts and subscripts and includes the full path to the
subsection title.
options(width=110) txt <- pmc_text(doc) txt count(txt, section, sort=TRUE)
Load the tidytext package for further text processing.
options(width=110) library(tidytext) x1 <- unnest_tokens(txt, word, text) %>% anti_join(stop_words) %>% filter(!word %in% 1:100) filter(x1, str_detect(section, "^Results")) filter(x1, str_detect(section, "^Results")) %>% count(word, sort = TRUE)
The pmc_table
function formats tables by collapsing multiline headers,
expanding rowspan and colspan attributes and adding subheadings into a new column.
options(width=110) tbls <- pmc_table(doc) map_int(tbls, nrow) tbls[[1]]
Use collapse_rows
to join column names and cell values in a semi-colon delimited string (and
then search using functions in the next section).
options(width=110) collapse_rows(tbls, na.string="-")
The other three pmc
functions are described in the package vignette.
There are a few functions to search within the pmc_text
or collapsed
pmc_table
output. separate_text
uses the stringr package to extract any
regular expression or vector of words.
options(width=110) separate_text(txt, "[ATCGN]{5,}")
A few wrappers search pre-defined patterns and add an extra step to expand
matched ranges. separate_refs
matches references within brackets using
\\[[0-9, -]+\\]
and expands ranges like [7-9]
.
options(width=110) separate_refs(txt)
separate_genes
will find microbial genes like tauD (with a
capitalized 4th letter) and expand operons like tauABCD
into
four genes. separate_tags
will find and expand locus tag ranges below.
options(width=110) collapse_rows(tbls, na="-") %>% separate_tags("YPO") %>% filter(id == "YPO1855")
See the vignette for more details including code to parse XML documents using the xml2 package. The PMC FTP vignette has details on parsing XML files at the Europe PMC FTP site.
This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. Feedback, bug reports, and feature requests are welcome here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.