estimate_performance: Evaluate and report classification performance of given an...

View source: R/Analysis.R

estimate_performanceR Documentation

Evaluate and report classification performance of given an Annotation file

Description

This function estimate Sensitivity and Efficiency (the latter as "Work saved over random classification", WSoR) of the classification process (i.e., both the automatic classification and the human). A robust estimate of the total number of relevant (positive) records in the whole data set is produced to compute these statistics.

Usage

estimate_performance(
  records,
  model = NULL,
  preds = NULL,
  plot = TRUE,
  quants = getOption("baysren.probs", c(0.05, 0.5, 0.95)),
  nsamples = min(2500, sum(model$fit@sim$n_save)),
  seed = 23797297,
  save_preds = FALSE,
  save_model = FALSE
)

Arguments

records

An Annotation data set produced by enrich_annotation_file() or a file path to it.

model

A brm model built using estimate_positivity_rate_model(). Will be created from records if NULL.

preds

A matrix of posterior predictions as produced by brms::posterior_predict(). If passed they need to be derived by the same model in model.

plot

Whether to plot the cumulative number of positive matches plus the posterior predictive distribution as computed by model, truncated at the number of observed ones.

quants

Point estimate and boundaries of the posterior distributions to use in the results and in the plot.

nsamples

Number of samples to use to build the posterior distribution, lower bounded at the number used to fit the model.

seed

A seed to reproduce the results.

save_preds

Whether to save the posterior prediction matrix. Can be passed to preds.

save_model

Whether to save the model. Can be passed to model

Details

For this purpose, estimate_positivity_rate_model() is employed, which uses a Bayesian logistic model to estimate the probability of a relevant record given the lower boundaries of the PPD produced by the classification model for the records whose label was manually reviewed. This model does not take into account records' other characteristics, providing a simple, maximum uncertainty model.

The model is used to predict the distribution of the number of missed relevant matches among the unreviewed records. This number is then used to compute the expected Sensitivity (i.e., the ratio of observed positive matches and the theoretical ones) and Efficiency (i.e. ratio of the number of reviewed records and the number of records needed to review at random to find the same amount of relevant matches, according to the hypergeometric distribution).

Finally, several summary statistics are reported, describing the observed results of the classification (i.e., number of reviewed records, number of positives found) and the statistics computed using the surrogate logistic model (i.e., Sensitivity, Efficiency and the R^2 of the surrogate model), including their uncertainty intervals.

Optionally, a plot showing the observed cumulative number of positive matches plus its posterior predictive distribution according to the surrogate model.

Value

A data frame with the following columns:

obs_positives

the observed number of positive matches;

pred_positives

the quantiles of the predicted distribution of the number of positive matches.

mod_r2

the surrogate model fit (R^2).

n_reviewed

the number of records reviewed.

total_records

the total records in the Annotation file;

used_prop

the posterior distribution of the proportion of reviewed record over the amount needed with random classification (1 - WSoR).

efficiency

the posterior distribution of one minus the proportion of reviewed record over the amount needed with random classification (WSoR).

Sensitivity

the posterior distribution of the Sensitivity computed over the predicted number of positives according to the surrogate model.

Examples

## Not run: 

annotation_file <- get_session_files("Session1")$Annotations %>% last()

analysis <- estimate_performance(annotation_file)

## End(Not run)


bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.