In da-zar/ROCket: Simple and Fast ROC Curves

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)

library(ROCket)
library(data.table)

n <- 100000
data <- data.table(score = c(rnorm(n), rnorm(n, 1)), pos = rep(0:1, each = n))[sample(2*n)]
data_agg <- data[, .(totals = .N, positives = sum(pos)), by = .(score = round(score, 0))]

ROCket

Intro

ROCket was primarily build for ROC curve estimation in the presence of aggregated data. Nevertheless, it can also handle raw samples. In general, aggregating data can be very beneficial when dealing with large datasets. Whenever a dataset consists of categorical or discretized continuous features, the data size can be effectively reduced by calculating sufficient statistics for each constellation of feature values. This saves memory and reduces the time needed to train classification models. Also model accuracy assessment can be done on aggregated data, which yields similar benefits. To this end, ROCket provides functions for ROC curve estimation and AUC calculation.

Installation

# From CRAN
install.packages("ROCket")

# From GitHub
# install.packages("devtools")
devtools::install_github("da-zar/ROCket")

Example

Data preparation

The easiest way to get started is to prepare a dataset containing all distinct predicted score values together with their count and the number of positive cases. Your dataset could look like this:

nrow(data_agg)
head(data_agg)

You can now pass this data to the rkt_prep function in order to create an object that will be later used for estimating ROC curves (possibly with several different algorithms).

prep_data_agg <- rkt_prep(
  scores = data_agg$score, 
  positives = data_agg$positives, 
  totals = data_agg$totals
)

It is not necessary to use an aggregated dataset. It's also possible to have each single observation in a separate row -- in this case the positives argument is the regular indicator (a numeric vector is required) for positive observations and the totals argument is not needed anymore (default is 1).

You can print the object, to get some information about the content, or plot it:

prep_data_agg
plot(prep_data_agg)

ROC curves

Estimates of the ROC curve can be calculated with the rkt_roc function. It takes two arguments. The first one is the rkt_prep object, which contains all the needed data, and the second one is an integer saying which method of estimation should be used. A list of implemented methods can be retrieved with the show_methods function.

show_methods()

In ROCket, we distinguish two types of ROC curve representations:

parametric curves -- TPR and FPR are functions of a parameter (the score),
functions -- TPR is a function of FPR.

In the first case we estimate two functions, the x and y coordinates of the ROC curve (FPR, TPR). These two functions are returned as a list. In the second case the output is a regular function.

Let's now calculate estimates of the ROC curve using all available methods.

roc_list <- list()
for (i in 1:4){
  roc_list[[i]] <- rkt_roc(prep_data_agg, method = i)
}

The output of rkt_roc can be used to plot the ROC curve estimates and calculate the AUC.

par(mfrow = c(2, 2))
for (i in 1:4){
  plot(
    roc_list[[i]], 
    main = show_methods()[i, desc], 
    sub = sprintf('AUC: %f', auc(roc_list[[i]]))
  )
}