Home

/

GitHub

/

In ropensci/pubchunks: Fetch Sections of XML Scholarly Articles

knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  cache.path = "inst/cache/"
)

pubchunks

Get chunks of XML articles

Package API

cat(paste(" -", paste(getNamespaceExports("pubchunks"), collapse = "\n - ")))

The main workhorse function is pub_chunks(). It allows you to pull out sections of articles from many different publishers (see next section below) WITHOUT having to know how to parse/navigate XML. XML has a steep learning curve, and can require quite a bit of Googling to sort out how to get to different parts of an XML document.

The other main function is pub_tabularize() - which takes the output of pub_chunks() and coerces into a data.frame for easier downstream processing.

Supported publishers/sources

eLife
PLOS
Entrez/Pubmed
Elsevier
Hindawi
Pensoft
PeerJ
Copernicus
Frontiers
F1000 Research

If you know of other publishers or sources that provide XML let us know by opening an issue.

We'll continue adding additional publishers.

Installation

Stable version

install.packages("pubchunks")

Development version from GitHub

remotes::install_github("ropensci/pubchunks")

Load library

library('pubchunks')

Working with files

x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml", 
  package = "pubchunks")

pub_chunks(x, "abstract")
pub_chunks(x, "title")
pub_chunks(x, "authors")
pub_chunks(x, c("title", "refs"))

The output of pub_chunks() is a list with an S3 class pub_chunks to make internal work in the package easier. You can easily see the list structure by using unclass().

Working with the xml already in a string

xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")

Working with xml2 class object

xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")

Working with output of fulltext::ft_get()

install.packages("fulltext")

library("fulltext")
x <- fulltext::ft_get('10.1371/journal.pone.0086169')
pub_chunks(fulltext::ft_collect(x), sections="authors")

Coerce pub_chunks output into data.frame's

x <- system.file("examples/elife_1.xml", package = "pubchunks")
res <- pub_chunks(x, c("doi", "title", "keywords"))
pub_tabularize(res)

Get a random XML article

library(rcrossref)
library(dplyr)

res <- cr_works(filter = list(
    full_text_type = "application/xml", 
    license_url="http://creativecommons.org/licenses/by/4.0/"))
links <- bind_rows(res$data$link) %>% filter(content.type == "application/xml")
download.file(links$URL[1], (i <- tempfile(fileext = ".xml")))
pub_chunks(i)
download.file(links$URL[13], (j <- tempfile(fileext = ".xml")))
pub_chunks(j)
download.file(links$URL[20], (k <- tempfile(fileext = ".xml")))
pub_chunks(k)

unlink(i)
unlink(j)
unlink(k)

ropensci/pubchunks
Fetch Sections of XML Scholarly Articles

In ropensci/pubchunks: Fetch Sections of XML Scholarly Articles

pubchunks

Get chunks of XML articles

Package API

Supported publishers/sources

Installation

Working with files

Working with the xml already in a string

Working with xml2 class object

Working with output of fulltext::ft_get()

Coerce pub_chunks output into data.frame's

Get a random XML article

Meta

R Package Documentation

Browse R Packages

We want your feedback!

ropensci/pubchunks Fetch Sections of XML Scholarly Articles

In ropensci/pubchunks: Fetch Sections of XML Scholarly Articles

pubchunks

Get chunks of XML articles

Package API

Supported publishers/sources

Installation

Working with files

Working with the xml already in a string

Working with xml2 class object

Working with output of fulltext::ft_get()

Coerce pub_chunks output into data.frame's

Get a random XML article

Meta

R Package Documentation

Browse R Packages

We want your feedback!

ropensci/pubchunks
Fetch Sections of XML Scholarly Articles