pub_chunks: Extract chunks of data from articles
In ropensci/pubchunks: Fetch Sections of XML Scholarly Articles

View source: R/chunks.R

pub_chunks

R Documentation

Extract chunks of data from articles

Description

pub_chunks makes it easy to extract sections of an article. You can extract just authors across all articles, or all references sections, or the complete text of each article. Then you can pass the output downstream for visualization and analysis.

Usage

pub_chunks(x, sections = "all", provider = NULL, extract = "xml_text")

Arguments

`x`	one of the following: file path for an XML file a character string of XML, a list (of file paths, or XML in a character string, or `xml_document` objects) or an object of class `fulltext::ft_data`, the output from a call to `fulltext::ft_get()`
`sections`	(character) What elements to get, can be one or more in a vector or list. See `pub_sections()` for options. optional. Default is to get all sections. See Details.
`provider`	(character) a single publisher name. see `pub_providers()` for options. required. If you select the wrong provider for the XML you have you may or may not get what you need :). By default this is `NULL` and we use `pub_guess_publisher()` to guess the publisher; we may get it wrong. You can override our guessing by passing in a name.
`extract`	(character) one of 'xml_text' (default) or 'as.character'. The final step of extracting each part of an article is converting to a character string. By default, we'll use `xml2::xml_text()`, but if you prefer you can use `as.character()` which. The latter can be useful if the chunk being extracted has html tags in it that you do not want removed.

Details

Options for the sections parameter:

front - Publisher, journal and article metadata elements
body - Body of the article
back - Back of the article, acknowledgments, author contributions, references
title - Article title
doi - Article DOI
categories - Publisher's categories, if any
authors - Authors
aff - Affiliation (includes author names)
keywords - Keywords
abstract - Article abstract
executive_summary - Article executive summary
refs - References
refs_dois - References DOIs - if available
publisher - Publisher name
journal_meta - Journal metadata
article_meta - Article metadata
acknowledgments - Acknowledgments
permissions - Article permissions
history - Dates, recieved, published, accepted, etc.

Value

A list, named by the section selected. sections not found or not in accepted list return NULL or zero length list. A ".publisher" list element gets attached to each list output, even when no data is found. When fulltext::ft_get output is passed in here, the list is named by the publisher, then within each publisher is a list of articles named by their identifiers (e.g. DOIs).

Examples

# a file path to an XML file
x <- system.file("examples/elsevier_1.xml", package = "pubchunks")
pub_chunks(x, "title")
pub_chunks(x, "authors")
pub_chunks(x, "acknowledgments")
pub_chunks(x, "refs")
pub_chunks(x, c("title", "refs"))

## Not run: 
# works the same with the xml already in a string
xml <- paste0(readLines(x), collapse = "")
pub_chunks(xml, "title")

# also works if you've already read in the XML (with xml2 pkg)
xml <- paste0(readLines(x), collapse = "")
xml <- xml2::read_xml(xml)
pub_chunks(xml, "title")

# Hindawi
x <- system.file("examples/hindawi_1.xml", package = "pubchunks")
pub_chunks(x, "abstract")$abstract
pub_chunks(x, "abstract", extract="as.character")$abstract
pub_chunks(x, "authors")
pub_chunks(x, "aff")
pub_chunks(x, "title")
pub_chunks(x, "refs")$refs
pub_chunks(x, c("abstract", "title", "authors", "refs"))

# Pensoft
x <- system.file("examples/pensoft_1.xml", package = "pubchunks")
pub_chunks(x, "abstract")
pub_chunks(x, "aff")
pub_chunks(x, "title")
pub_chunks(x, "refs")$refs
pub_chunks(x, c("abstract", "title", "authors", "refs"))

# Peerj
x <- system.file("examples/peerj_1.xml", package = "pubchunks")
pub_chunks(x, "abstract")
pub_chunks(x, "authors")
pub_chunks(x, "aff")
pub_chunks(x, "title")
pub_chunks(x, "refs")$refs
pub_chunks(x, c("abstract", "title", "authors", "refs"))

# Frontiers
x <- system.file("examples/frontiers_1.xml", package = "pubchunks")
pub_chunks(x, "authors")
pub_chunks(x, "aff")
pub_chunks(x, "refs")$refs
pub_chunks(x, c("doi", "abstract", "title", "authors", "refs", "abstract"))

# eLife
x <- system.file("examples/elife_1.xml", package = "pubchunks")
pub_chunks(x, "authors")
pub_chunks(x, "aff")
pub_chunks(x, "refs")$refs
pub_chunks(x, c("doi", "title", "authors", "refs"))

# f1000research
x <- system.file("examples/f1000research_3.xml", package = "pubchunks")
pub_chunks(x, "title")
pub_chunks(x, "aff")
pub_chunks(x, "refs")$refs
pub_chunks(x, c("doi", "title", "authors", "keywords", "refs"))

# Copernicus
x <- system.file("examples/copernicus_1.xml", package = "pubchunks")
pub_chunks(x, c("doi", "abstract", "title", "authors", "refs"))
pub_chunks(x, "aff")
pub_chunks(x, "refs")$refs

# MDPI
x <- system.file("examples/mdpi_1.xml", package = "pubchunks")
x <- system.file("examples/mdpi_2.xml", package = "pubchunks")
pub_chunks(x, "title")
pub_chunks(x, "aff")
pub_chunks(x, "refs")$refs
vv <- pub_chunks(x, c("doi", "title", "authors", "keywords", "refs", 
  "abstract", "categories"))
vv$doi
vv$title
vv$authors
vv$keywords
vv$refs
vv$abstract
vv$categories

# Many inputs at once
x <- system.file("examples/frontiers_1.xml", package = "pubchunks")
y <- system.file("examples/elife_1.xml", package = "pubchunks")
z <- system.file("examples/f1000research_1.xml", package = "pubchunks")
pub_chunks(list(x, y, z), c("doi", "title", "authors", "refs"))

# non-XML files/content are xxx?
# pub_chunks('foo bar')

# Pubmed brief XML files (abstract only)
x <- system.file("examples/pubmed_brief_1.xml", package = "pubchunks")
pub_chunks(x, "title")

# Pubmed full XML files
x <- system.file("examples/pubmed_full_1.xml", package = "pubchunks")
pub_chunks(x, "title")

# using output of fulltext::ft_get()
if (requireNamespace("fulltext", quietly = TRUE)) {
  library("fulltext")

  # single
  x <- fulltext::ft_get('10.7554/eLife.03032')
  pub_chunks(fulltext::ft_collect(x), sections="authors")

  # many
  dois <- c('10.1371/journal.pone.0086169', '10.1371/journal.pone.0155491', 
    '10.7554/eLife.03032')
  x <- fulltext::ft_get(dois)
  pub_chunks(fulltext::ft_collect(x), sections="authors")

  # as.ft_data() function
  x <- ft_collect(as.ft_data())
  names(x)
  x$cached
  pub_chunks(x, "title")
  pub_chunks(x, "title") %>% pub_tabularize()
}

## End(Not run)

ropensci/pubchunks documentation built on Sept. 14, 2022, 7:48 a.m.