| arete_package | R Documentation |
Package arete seeks to provide easy Automated REtrieval of species data from TExt. To do this it processes user-supplied text, breaks it into API requests to LLM services and processes the output
through a variety of machine learning and rule-based methods to deliever species data in a machine-readable format, ready for analysis. For a short and sweet use case of arete, try our vignette(package = "arete").
In broad terms, functions in arete can be placed in 6 different categories:
———————————————————————————————————————
arete_setup | Define a default virtual environment and install external dependencies |
install_python_packages | Install or update python dependencies after setup |
install_OCR_packages | Install or update the dependencies for our optional OCR utilities |
| --------------------------- | ------------------------------------------------------------------------------------------ |
labels | Extract the labels and relations in a Webanno TSV 3.3 file to an easy, machine readable format ready for machine learning projects |
labels_unique | Extract all unique labels and relations in a Webanno TSV 3.3 file |
webanno_open | Read the contents of a WebAnno TSV 3.3 file and create a webanno object, a format for annotated text containing named entities and relations |
webanno_summary | Summarize the contents of a group of WebAnno tsv files by counting labels and relations present |
create_training_data | Open files with text and annotated data and build training data for large language models in a variety of formats |
file_comparison | Detect differences between two WebAnno files of the same text for annotation monitoring |
| --------------------------- | ------------------------------------------------------------------------------------------ |
process_document | Extract text embedded in a .pdf or .txt file and process it so it can be safely used APIs of LLM |
OCR_document Optional utilities based on tesseract and nougatOCR | |
check_lang | Check if a given string is mostly (75% of the document) in English |
| --------------------------- | ------------------------------------------------------------------------------------------ |
string_to_coords | Rule-based conversion of character strings containing geographic coordinates to sets of numeric values |
process_species_names | This function standardizes species names and fixes a number of common typos and mistakes that commonly occur due to OCR |
| --------------------------- | ------------------------------------------------------------------------------------------ |
get_geodata | Call a Large Language Model (LLM) to extract species geographic data |
gazetteer | Extract geographic coordinates from strings containing location names, using an online index |
| --------------------------- | ------------------------------------------------------------------------------------------ |
performance_report | Produce a detailed report on the discrepancies between data extracted by a LLM and human annotated data. |
compare_IUCN | Calculate EOO for two sets of coordinates for a practical assessment of data proximity |
| --------------------------- | ------------------------------------------------------------------------------------------ |
The methods and functions in this package were written by Vasco Branco, with code contribuitions by Vaughn Shirey, Thomas Merrien. Code revision by Pedro Cardoso.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.