knitr::opts_chunk$set( collapse = TRUE, comment = "# " )
The Europe PMC FTP includes 2.5 million open access articles separated into
files with 10K articles each. Download and unzip a recent series of PMC ids
and load into R using the readr
package. A sample file with the first 10
articles is included in the tidypmc
package.
library(readr) pmcfile <- system.file("extdata/PMC6358576_PMC6358589.xml", package = "tidypmc") pmc <- read_lines(pmcfile)
Find the start of the article nodes.
a1 <- grep("^<article ", pmc) head(a1) n <- length(a1) n
Read a single article by collapsing the lines into a new line separated string.
options(width=100) library(xml2) x1 <- paste(pmc[2:29], collapse="\n") doc <- read_xml(x1) doc
Loop through the articles and save the metadata and text below. All 10K articles takes about 10 minutes to run on a Mac laptop and returns 1.7M sentences.
library(tidypmc) a1 <- c(a1, length(pmc)) met1 <- vector("list", n) txt1 <- vector("list", n) for(i in seq_len(n)){ doc <- read_xml(paste(pmc[a1[i]:(a1[i+1]-1)], collapse="\n")) m1 <- pmc_metadata(doc) id <- m1$PMCID message("Parsing ", i, ". ", id) met1[[i]] <- m1 txt1[[i]] <- pmc_text(doc) }
Combine the list of metadata and text into tables.
options(width=100) library(dplyr) met <- bind_rows(met1) names(txt1) <- met$PMCID txt <- bind_rows(txt1, .id="PMCID") met txt
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.