estimate_performance | R Documentation |
This function estimate Sensitivity and Efficiency (the latter as "Work saved over random classification", WSoR) of the classification process (i.e., both the automatic classification and the human). A robust estimate of the total number of relevant (positive) records in the whole data set is produced to compute these statistics.
estimate_performance( records, model = NULL, preds = NULL, plot = TRUE, quants = getOption("baysren.probs", c(0.05, 0.5, 0.95)), nsamples = min(2500, sum(model$fit@sim$n_save)), seed = 23797297, save_preds = FALSE, save_model = FALSE )
records |
An Annotation data set produced by |
model |
A |
preds |
A matrix of posterior predictions as produced by
|
plot |
Whether to plot the cumulative number of positive matches plus
the posterior predictive distribution as computed by |
quants |
Point estimate and boundaries of the posterior distributions to use in the results and in the plot. |
nsamples |
Number of samples to use to build the posterior distribution,
lower bounded at the number used to fit the |
seed |
A seed to reproduce the results. |
save_preds |
Whether to save the posterior prediction matrix. Can be
passed to |
save_model |
Whether to save the model. Can be passed to |
For this purpose, estimate_positivity_rate_model()
is employed, which uses
a Bayesian logistic model to estimate the probability of a relevant record
given the lower boundaries of the PPD produced by the classification model
for the records whose label was manually reviewed. This model does not take
into account records' other characteristics, providing a simple, maximum
uncertainty model.
The model is used to predict the distribution of the number of missed relevant matches among the unreviewed records. This number is then used to compute the expected Sensitivity (i.e., the ratio of observed positive matches and the theoretical ones) and Efficiency (i.e. ratio of the number of reviewed records and the number of records needed to review at random to find the same amount of relevant matches, according to the hypergeometric distribution).
Finally, several summary statistics are reported, describing the observed results of the classification (i.e., number of reviewed records, number of positives found) and the statistics computed using the surrogate logistic model (i.e., Sensitivity, Efficiency and the R^2 of the surrogate model), including their uncertainty intervals.
Optionally, a plot showing the observed cumulative number of positive matches plus its posterior predictive distribution according to the surrogate model.
A data frame with the following columns:
obs_positives |
the observed number of positive matches; |
pred_positives |
the quantiles of the predicted distribution of the number of positive matches. |
mod_r2 |
the surrogate model fit (R^2). |
n_reviewed |
the number of records reviewed. |
total_records |
the total records in the Annotation file; |
used_prop |
the posterior distribution of the proportion of reviewed record over the amount needed with random classification (1 - WSoR). |
efficiency |
the posterior distribution of one minus the proportion of reviewed record over the amount needed with random classification (WSoR). |
Sensitivity |
the posterior distribution of the Sensitivity computed over the predicted number of positives according to the surrogate model. |
## Not run: annotation_file <- get_session_files("Session1")$Annotations %>% last() analysis <- estimate_performance(annotation_file) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.