get_roc_stats: Generate ROC statistics

View source: R/roc.R

get_roc_statsR Documentation

Generate ROC statistics

Description

Use this function to generate the most useful statistics related to the generation of a basic ROC (Receiver Operating Characteristic) curve.

Usage

get_roc_stats(df, pred_col, label_col, direction = "<")

Arguments

df

a data.frame with (at least) two columns. See next two parameters for what values these two columns should have (which should match one to one).

pred_col

string. The name of the column of the df data.frame that has the prediction values. The values can be any numeric, negative, positive or zero. What matters is the ranking of these values which is clarified with the direction parameter.

label_col

string. The name of the column of the df data.frame that has the true positive labelings/observed classes for the prediction values. This column must have either 1 or 0 elements representing either a positive or negative classification label for the corresponding values.

direction

string. Can be either > or < (default value) and indicates the direction/ranking of the prediction values with respect to the positive class labeling (for a specific threshold). If smaller prediction values indicate the positive class/label use < whereas if larger prediction values indicate the positive class/label (e.g. probability of positive class), use >.

Value

A list with two elements:

  • roc_stats: a tibble which includes the thresholds for the ROC curve and the confusion matrix stats for each threshold as follows: TP (#True Positives), FN (#False Negatives), TN (#True Negatives), FP (#False Positives), FPR (False Positive Rate - the x-axis values for the ROC curve) and TPR (True Positive Rate - the y-axis values for the ROC curve). Also included are the dist-from-chance (the vertical distance of the corresponding (FPR,TPR) point to the chance line or positive diagonal) and the dist-from-0-1 (the euclidean distance of the corresponding (FPR,TPR) point from (0,1)).

  • AUC: a number representing the Area Under the (ROC) Curve.

The returned results provide an easy way to compute two optimal cutpoints (thresholds) that dichotomize the predictions to positive and negative. The first is the Youden index, which is the maximum vertical distance from the ROC curve to the chance line or positive diagonal. The second is the point of the ROC curve closest to the (0,1) - the point of perfect differentiation. See examples below.

Examples

# load libraries
library(readr)
library(dplyr)

# load test tibble
test_file = system.file("extdata", "test_df.tsv", package = "usefun", mustWork = TRUE)
test_df = readr::read_tsv(test_file, col_types = "di")

# get ROC stats
res = get_roc_stats(df = test_df, pred_col = "score", label_col = "observed")

# Plot ROC with a legend showing the AUC value
plot(x = res$roc_stats$FPR, y = res$roc_stats$TPR,
  type = 'l', lwd = 2, col = '#377EB8', main = 'ROC curve',
  xlab = 'False Positive Rate (FPR)', ylab = 'True Positive Rate (TPR)')
legend('bottomright', legend = round(res$AUC, digits = 3),
  title = 'AUC', col = '#377EB8', pch = 19)
grid()
abline(a = 0, b = 1, col = '#FF726F', lty = 2)

# Get two possible cutoffs
youden_index_df = res$roc_stats %>%
  filter(dist_from_chance == max(dist_from_chance))
min_classification_df = res$roc_stats %>%
  filter(dist_from_0_1 == min(dist_from_0_1))


bblodfon/usefun documentation built on April 29, 2024, 12:36 p.m.