performance_report: Evaluate the performance of a LLM
In arete: Automated REtrieval from TExt

performance_report

R Documentation

Evaluate the performance of a LLM

Description

Produce a detailed report on the discrepancies between LLM extracted data and human annotated data for the same collection of files.

Usage

performance_report(
  human_data,
  model_data,
  full_locations = "coordinates",
  string_distance = "levenshtein",
  verbose = TRUE,
  rmds = TRUE,
  path = NULL
)

Arguments

`human_data`	matrix. Ground truth dataset to compare the data extracted by a LLM.
`model_data`	matrix. Dataset of location data, following the description under `human_data`.
`full_locations`	character. Defines dataset structure. If `"locations"` then structure follows Species, Location, File. if `"coordinates"` then structure follows Species, Long, Lat, File. if `"both"` then structure follows Species, Location, Long, Lat, File.
`string_distance`	character. Selects the method through which the proximity between two strings is calculated, from those available under `utils::adist()`.
`verbose`	logical. Determines if output should be printed.
`rmds`	logical. Determines if more extensive R Markdown files should be created at `path`.
`path`	character. Directory to which the output of the function is saved.

Details

Four main metrics are calculated to report on the performance of the model for coordinates. These are

Accuracy, \frac{TP}{TP + FP + FN}, here defined as such in a system without True Negatives.
Recall, \frac{TP}{TP + FN}, Kent et al. (1955)
Precision, \frac{TP}{TP + FP}, Kent et al. (1955)
F1 score, \frac{2}{\frac{1}{Precision} + \frac{1}{Sensitivity}}, van Rijsbergen(1979).

Additional metrics are calculated, including: 1) a distance-weighed confusion matrix where the sum of each type of error (False Negatives and False Positives) is done by weights, calculated to be inverse to the mean euclidean distance of that data point to all others. This way errors that are close to existing data for that species will count less than those further way, i.e. a data point was hallucinated that was close to existing data or, a data point was missed that is already represented in the data. This adjusted confusion matrix is also presented along with versions of the four main metrics calculated with these values. To report on the performance of locations, by default the minimum Levenshtein distance (Levenshtein, 1966) between a term and all other terms is calculated. Which is defined as:

lev(a,b) = \begin{cases} |a| & if |b|=0, \\ |b| & if |a|=0, \\ lev(tail(a),tail(b)) & if head(a) = head(b), \\ 1 + min \begin{cases} lev (tail(a),b) \\ lev (a,tail(b)) \\ lev (tail(a),tail(b)) \\ \end{cases} & otherwise \end{cases}

In short, the number of edits needed to turn one string a into string b.

Value

list. A confusion matrix is returned for every species per document, plus one for the entire process.

References

Kent, A. et al. (1955). "Machine literature searching VIII. Operational criteria for designing information retrieval systems", American Documentation, 6(2), pp. 93–101. doi:10.1002/asi.5090060209.
van Rijsbergen, C.J. (1979). "Information Retrieval", Architectural Press. ISBN: 978-0408709293.
Levenshtein, V.I. (1966). "Binary codes capable of correcting deletions, insertions, and reversals", Soviet Physics-Doklady, 10(8), pp. 707–710 [Translated from Russian].

Examples

trial_data = arete::arete_data("holzapfelae-extract")
trial_data = cbind(trial_data[,1:2], arete::string_to_coords(trial_data[,3])[2:1], trial_data[,4:5])

trial_data = list(
  GT = trial_data[trial_data$Type == "Ground truth", 1:5],
  MD = trial_data[trial_data$Type == "Model", 1:5]
)

# make sure you run arete_setup() beforehand!
performance_report(
  trial_data$GT,
  trial_data$MD,
  full_locations = "both",
  verbose = FALSE,
  rmds = FALSE
)

arete documentation built on Nov. 5, 2025, 6:31 p.m.