process_document: Extract and process text from a document

View source: R/03_process_document.R

process_documentR Documentation

Extract and process text from a document

Description

This function extracts text embedded in a .pdf or .txt file and processes it so it can be safely used by LLM API's.

Usage

process_document(path, extra_measures = NULL)

Arguments

path

character. Path leading to the desired PDF file.

extra_measures

character. To be implemented. Some documents are especially difficult for LLM to process due to a variety of issues such as size and formatting. extra_measures tries to improve future performance by cropping the document given to only the central passage mentioning a specific species. "header" and, by extension, "both" require an mmd file that is the output of nougatOCR.

Value

character. Fully processed text.

Examples

path = arete_data("holzapfelae")
process_document(path)

extra_measures = list("mention", "Tricholathys spiralis")

arete documentation built on Nov. 5, 2025, 6:31 p.m.