ARETE_package: Summary of methods in the arete package
In arete: Automated REtrieval from TExt

arete_package

R Documentation

Summary of methods in the arete package

Description

Package arete seeks to provide easy Automated REtrieval of species data from TExt. To do this it processes user-supplied text, breaks it into API requests to LLM services and processes the output through a variety of machine learning and rule-based methods to deliever species data in a machine-readable format, ready for analysis. For a short and sweet use case of arete, try our vignette(package = "arete"). In broad terms, functions in arete can be placed in 6 different categories:

———————————————————————————————————————

1. Setting up arete

`arete_setup`	Define a default virtual environment and install external dependencies
`install_python_packages`	Install or update python dependencies after setup
`install_OCR_packages`	Install or update the dependencies for our optional OCR utilities
---------------------------	------------------------------------------------------------------------------------------

2. Prepare annotation data

`labels`	Extract the labels and relations in a Webanno TSV 3.3 file to an easy, machine readable format ready for machine learning projects
`labels_unique`	Extract all unique labels and relations in a Webanno TSV 3.3 file
`webanno_open`	Read the contents of a WebAnno TSV 3.3 file and create a `webanno` object, a format for annotated text containing named entities and relations
`webanno_summary`	Summarize the contents of a group of WebAnno tsv files by counting labels and relations present
`create_training_data`	Open files with text and annotated data and build training data for large language models in a variety of formats
`file_comparison`	Detect differences between two WebAnno files of the same text for annotation monitoring
---------------------------	------------------------------------------------------------------------------------------

3. Prepare text data

`process_document`	Extract text embedded in a `.pdf` or `.txt` file and process it so it can be safely used APIs of LLM
`OCR_document` Optional utilities based on `tesseract` and `nougatOCR`
`check_lang`	Check if a given string is mostly (75% of the document) in English
---------------------------	------------------------------------------------------------------------------------------

4. Clean data

`string_to_coords`	Rule-based conversion of character strings containing geographic coordinates to sets of numeric values
`process_species_names`	This function standardizes species names and fixes a number of common typos and mistakes that commonly occur due to OCR
---------------------------	------------------------------------------------------------------------------------------

5. Finetune and extract data

`get_geodata`	Call a Large Language Model (LLM) to extract species geographic data
`gazetteer`	Extract geographic coordinates from strings containing location names, using an online index
---------------------------	------------------------------------------------------------------------------------------

6. Evaluate model performance

`performance_report`	Produce a detailed report on the discrepancies between data extracted by a LLM and human annotated data.
`compare_IUCN`	Calculate EOO for two sets of coordinates for a practical assessment of data proximity
---------------------------	------------------------------------------------------------------------------------------

Contributors

The methods and functions in this package were written by Vasco Branco, with code contribuitions by Vaughn Shirey, Thomas Merrien. Code revision by Pedro Cardoso.

arete documentation built on Nov. 5, 2025, 6:31 p.m.