Package workflow
In arete: Automated REtrieval from TExt

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(arete)

Data extraction

Let's say you want to extract data from a paper, normally you'd run something that looks like this:

geotest = arete::get_geodata(
  path = file_path,
  user_key = list(key = "your key here!", premium = TRUE),
  model = "gpt-4o",
  outpath = "/your/path/here"
  )

As the extraction process depends on an internet connection and your own personal user key, this won't run. Instead we will open a csv with pre-run results. But feel free to try it! get_geodata generates one csv file per pdf in its input parameter. In our example data we have already collected all csvs under a single table.

geotest = arete::arete_data("holzapfelae-extract")

kableExtra::kable(geotest)

In this case we will be as careful as possible and go over outliers separately from get_geodata(). This is a good example of the limitations of the process: geo_geodata() can automatically do the next step for you but in situations where for some reason coordinates are written in text as latitude longitude instead of longitude latitude, some outlier detection methods (env, svm) will fail.

Process coordinates

Let's start by converting all of the coordinates from text to numeric values.

geocoords = string_to_coords(geotest$Coordinates)

kableExtra::kable(geocoords)

Process species names

Often species names between human extracted data and model extracted data will not match, for example as a result of humans using species' abbreviated name as opposed to its full name. Additionally models will sometimes erratically and add characters that might go undetected, especially if OCR extracted text was used. In order to have a good idea of model performance it is then often important to standardize species names. Here is an example for paper 1 in our dataset:

geonames = data.frame(
  human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
  model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
  )

mismatch = c(1:nrow(geonames))[geonames$human_names != geonames$model_names]
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "red")

geonames

By using process_species_names() we standardize our species names and our data is correctly associated as referring to the same species.

geotest$Species = process_species_names(geotest$Species)

geonames = data.frame(
  human_names = geotest[geotest$ID == 1 & geotest$Type == "Ground truth", "Species"],
  model_names = geotest[geotest$ID == 1 & geotest$Type == "Model", "Species"]
  )
geonames = kableExtra::kable(geonames)
geonames = kableExtra::row_spec(geonames, mismatch, color = "green")

geonames

Process outliers

Often it pays off to be suspicious of data generated automatically through machine learning (one could argue this true of human generated data as well). For this we'll use the utilities in package gecko, which arete calls. In order for it to work, gecko needs to be setup which we recommend you do after reading the documentation of functions gecko::gecko.setDir() and gecko::gecko.worldclim(). Setup will require a one-time potentially heavy download of an environmental dataset, WorldClim. Function gecko::outliers.detect will use this data to determine which points are likely outliers through different methods, including calculating the environmental and geographic distance between points and training a support vector machine model on supplied data. The outcome of these methods are collected in separate columns and the total number of methods suggesting a given point as an outlier is shown in column possible.outliers We then have:

geoout = gecko::outliers.detect(geocoords[2:1])

kableExtra::kable(geoout)

Create performance reports

Finally, we can determine how our model performed by processing all of our data through function performance_report(). This function takes two initial tables of equal formatting, one of human extracted data and another of model extracted data and computes a series of metrics that are helpful to get a sense of where mistakes might be found.

geotest = cbind(geotest[,1:2], geocoords, geotest[,4:5])

geotest = list(
  GT = geotest[geotest$Type == "Ground truth", 1:5],
  MD = geotest[geotest$Type == "Model", 1:5]
)

geo_report = performance_report(geotest$GT, geotest$MD, full_locations = "both", verbose = FALSE, rmds = FALSE)

For locations, the Levenshtein distance is calculated between terms. For coordinates, it creates one confusion matrix for every species in common between sets. These are composed of True Positives (TP, perfectly matching coordinates from both tables), False Positives (FP, coordinates showing up only on the model extracted data) and False Negatives (FN, coordinates showing up only on the human extracted data). True Negatives are assumed to not apply. Several metrics are then calculated using the confusion matrix, including accuracy, precision, recall and the F1 score, the details of which can be found in the documentation of performance_report(). An additional global confusion matrix is created which also includes errors (FP and FN) that are the result of species unique to each set. More metrics appear on the extended reports created through rmds = FALSE, including versions of these already mentioned metrics that are weighed by the degree of error being shown. i.e., if the model hallucinates a data point that is close to existing points its weight as a False Positive is less than if it hallucinated a data point completely different from all other points.