ARETE_package: Summary of methods in the arete package

arete_packageR Documentation

Summary of methods in the arete package

Description

Package arete seeks to provide easy Automated REtrieval of species data from TExt. To do this it processes user-supplied text, breaks it into API requests to LLM services and processes the output through a variety of machine learning and rule-based methods to deliever species data in a machine-readable format, ready for analysis. For a short and sweet use case of arete, try our vignette(package = "arete"). In broad terms, functions in arete can be placed in 6 different categories:

———————————————————————————————————————

1. Setting up arete

arete_setup Define a default virtual environment and install external dependencies
install_python_packages Install or update python dependencies after setup
install_OCR_packages Install or update the dependencies for our optional OCR utilities
--------------------------- ------------------------------------------------------------------------------------------

2. Prepare annotation data

labels Extract the labels and relations in a Webanno TSV 3.3 file to an easy, machine readable format ready for machine learning projects
labels_unique Extract all unique labels and relations in a Webanno TSV 3.3 file
webanno_open Read the contents of a WebAnno TSV 3.3 file and create a webanno object, a format for annotated text containing named entities and relations
webanno_summary Summarize the contents of a group of WebAnno tsv files by counting labels and relations present
create_training_data Open files with text and annotated data and build training data for large language models in a variety of formats
file_comparison Detect differences between two WebAnno files of the same text for annotation monitoring
--------------------------- ------------------------------------------------------------------------------------------

3. Prepare text data

process_document Extract text embedded in a .pdf or .txt file and process it so it can be safely used APIs of LLM
OCR_document Optional utilities based on tesseract and nougatOCR
check_lang Check if a given string is mostly (75% of the document) in English
--------------------------- ------------------------------------------------------------------------------------------

4. Clean data

string_to_coords Rule-based conversion of character strings containing geographic coordinates to sets of numeric values
process_species_names This function standardizes species names and fixes a number of common typos and mistakes that commonly occur due to OCR
--------------------------- ------------------------------------------------------------------------------------------

5. Finetune and extract data

get_geodata Call a Large Language Model (LLM) to extract species geographic data
gazetteer Extract geographic coordinates from strings containing location names, using an online index
--------------------------- ------------------------------------------------------------------------------------------

6. Evaluate model performance

performance_report Produce a detailed report on the discrepancies between data extracted by a LLM and human annotated data.
compare_IUCN Calculate EOO for two sets of coordinates for a practical assessment of data proximity
--------------------------- ------------------------------------------------------------------------------------------

Contributors

The methods and functions in this package were written by Vasco Branco, with code contribuitions by Vaughn Shirey, Thomas Merrien. Code revision by Pedro Cardoso.


arete documentation built on Nov. 5, 2025, 6:31 p.m.