Applicability domain methods for binary data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(ggplot2)
theme_set(theme_bw())
# TODO
#- Mention different input data types: data.frame, recipes, matrix, etc.
#- Maybe make a (better) conclusion?
#- Explain the reason why the training set is diverse.

Introduction

library(applicable)

Similarity statistics can be used to compare data sets where all of the predictors are binary. One of the most common measures is the Jaccard index.

For a training set of size n, there are n similarity statistics for each new sample. These can be summarized via the mean statistic or a quantile. In general, we want similarity to be low within the training set (i.e., a diverse training set) and high for new samples to be predicted.

To analyze the Jaccard metric, applicable provides the following methods:

Example

The example data is from two QSAR data sets where binary fingerprints are used as predictors.

data(qsar_binary)

Let us construct the model:

jacc_sim <- apd_similarity(binary_tr)
jacc_sim

As we can see below, this is a fairly diverse training set:

library(ggplot2)

# Plot the empirical cumulative distribution function for the training set
autoplot(jacc_sim)

We can compare the similarity between new samples and the training set:

# Summarize across all training set similarities
mean_sim <- score(jacc_sim, new_data = binary_unk)
mean_sim

Samples 3 and 5 are definitely extrapolations based on these predictors. In other words, the new samples are not similar to the training set and so predictions on them may not be very reliable.



Try the applicable package in your browser

Any scripts or data that you put into this service are public.

applicable documentation built on Aug. 21, 2022, 1:06 a.m.