Parsing Europe PMC FTP files
In tidypmc: Parse Full Text XML Documents from PubMed Central

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "# "
)

The Europe PMC FTP includes 2.5 million open access articles separated into files with 10K articles each. Download and unzip a recent series of PMC ids and load into R using the readr package. A sample file with the first 10 articles is included in the tidypmc package.

library(readr)
pmcfile <- system.file("extdata/PMC6358576_PMC6358589.xml", package = "tidypmc")
pmc <- read_lines(pmcfile)

Find the start of the article nodes.

a1 <- grep("^<article ", pmc)
head(a1)
n <- length(a1)
n

Read a single article by collapsing the lines into a new line separated string.

options(width=100)
library(xml2)
x1 <- paste(pmc[2:29], collapse="\n")
doc <- read_xml(x1)
doc

Loop through the articles and save the metadata and text below. All 10K articles takes about 10 minutes to run on a Mac laptop and returns 1.7M sentences.

library(tidypmc)
a1 <- c(a1, length(pmc))
met1 <- vector("list", n)
txt1 <- vector("list", n)
for(i in seq_len(n)){
  doc <- read_xml(paste(pmc[a1[i]:(a1[i+1]-1)], collapse="\n"))
  m1 <- pmc_metadata(doc)
  id <- m1$PMCID
  message("Parsing ", i, ". ", id)
  met1[[i]] <- m1
  txt1[[i]] <- pmc_text(doc)
}

Combine the list of metadata and text into tables.

options(width=100)
library(dplyr)
met <- bind_rows(met1)
names(txt1) <- met$PMCID
txt <- bind_rows(txt1, .id="PMCID")
met
txt

Any scripts or data that you put into this service are public.

tidypmc documentation built on Aug. 1, 2019, 5:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tidypmc
Parse Full Text XML Documents from PubMed Central

Parsing Europe PMC FTP files
In tidypmc: Parse Full Text XML Documents from PubMed Central

Try the tidypmc package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

tidypmc Parse Full Text XML Documents from PubMed Central

Parsing Europe PMC FTP files In tidypmc: Parse Full Text XML Documents from PubMed Central

Try the tidypmc package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

tidypmc
Parse Full Text XML Documents from PubMed Central

Parsing Europe PMC FTP files
In tidypmc: Parse Full Text XML Documents from PubMed Central