process_document: Extract and process text from a document
In arete: Automated REtrieval from TExt

process_document

R Documentation

Extract and process text from a document

This function extracts text embedded in a .pdf or .txt file and processes it so it can be safely used by LLM API's.

process_document(path, extra_measures = NULL)

`path`	character. Path leading to the desired PDF file.
`extra_measures`	character. To be implemented. Some documents are especially difficult for LLM to process due to a variety of issues such as size and formatting. `extra_measures` tries to improve future performance by cropping the document given to only the central passage mentioning a specific species. `"header"` and, by extension, `"both"` require an mmd file that is the output of nougatOCR.

character. Fully processed text.

path = arete_data("holzapfelae")
process_document(path)

extra_measures = list("mention", "Tricholathys spiralis")

arete documentation built on Nov. 5, 2025, 6:31 p.m.

arete index

Package workflow

Note that we can't provide technical support on individual packages. You should contact the package authors for that.