doc2keywords: Document to Resolved keywords

Description Usage Arguments Details Value Author(s) Examples

Description

Run the term extractor on a document

Usage

1
2
3
4
5
doc2keywords(doc.file, ecoextract = getEcoExtractPyScript(),
  results.dir = character(), results.file = file.path(results.dir,
  gsub("xml$", "rds", basename(doc.file))), cache.dir = character(),
  cache.file = file.path(cache.dir, gsub("xml$", "rds", basename(doc.file))),
  section.text = load_text(doc.file, cache.file, cache.dir))

Arguments

doc.file

a file to parse, either XML or PDF

ecoextract

file path to the ecoextract.py script

results.dir

optional, directory to store the results as a rds file. If not specified, no results will be saved. If the directory does not currently exist, it will be created.

results.file

optional, file name to use for the results, defaults to the doc.file basename.rds

cache.dir

optional directory to cache the intermediate text results from ReadPDF::getSectionText If not specified, no caching will be performed

cache.file

optional, file name to use for the cached section text

section.text

a list, with one element per section to be processed

Details

This function will run the term extractor (based on EpiTator https://github.com/ecohealthalliance/EpiTator) on a document. The document can be either XML generated by pdftohtml or a PDF document which will be internally converted to a XML document. Additionally, the raw text can also be provided. Results and intermediate text split by sections can be optionally saved.

Value

a list, with one element per section with all resolved keywords arranged in a nested list.

Author(s)

Matt Espe and Duncan Temple Lang

Examples

1
2
3
txt = "This mentions China"
ans = doc2keywords(section.text = short_text)
getLocation(ans)

dsidavis/SpilloverDA documentation built on June 1, 2019, 2:55 p.m.