pmc_text: Split section paragraphs into sentences

Description Usage Arguments Value Note Author(s) Examples

View source: R/pmc_text.R

Description

Split section paragraph tags into a table with subsection titles and sentences using tokenize_sentences

Usage

1
pmc_text(doc)

Arguments

doc

xml_document from PubMed Central

Value

a tibble with section, paragraph and sentence number and text

Note

Subsections may be nested to arbitrary depths and this function will return the entire path to the subsection title as a delimited string like "Results; Predicted functions; Pathogenicity". Tables, figures and formulas that are nested in section paragraphs are removed, superscripted references are replaced with brackets, and any other superscripts or subscripts are separared with ^ and _.

Author(s)

Chris Stubben

Examples

1
2
3
4
5
6
7
# doc <- pmc_xml("PMC2231364")
doc <- xml2::read_xml(system.file("extdata/PMC2231364.xml",
  package = "tidypmc"
))
txt <- pmc_text(doc)
txt
dplyr::count(txt, section, sort = TRUE)

tidypmc documentation built on Aug. 1, 2019, 5:05 p.m.