knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-" )
library(ggplot2) library(dplyr) library(tidyr) library(purrr)
cutpointr is an R package for tidy calculation of "optimal" cutpoints. It supports several methods for calculating cutpoints and includes several metrics that can be maximized or minimized by selecting a cutpoint. Some of these methods are designed to be more robust than the simple empirical optimization of a metric. Additionally, cutpointr can automatically bootstrap the variability of the optimal cutpoints and return out-of-bag estimates of various performance metrics.
You can install cutpointr from CRAN using the menu in RStudio or simply:
install.packages("cutpointr")
For example, the optimal cutpoint for the included data set is 2 when maximizing the sum of sensitivity and specificity.
library(cutpointr) data(suicide) head(suicide) cp <- cutpointr(suicide, dsi, suicide, method = maximize_metric, metric = sum_sens_spec)
summary(cp)
plot(cp)
When considering the optimality of a cutpoint, we can only make a judgement based on the sample at hand. Thus, the estimated cutpoint may not be optimal within the population or on unseen data, which is why we sometimes put the "optimal" in quotation marks.
cutpointr
makes assumptions about the direction of the dependency between
class
and x
, if direction
and / or pos_class
or neg_class
are not
specified. The same result as above can be achieved by manually defining direction
and
the positive / negative classes which is slightly faster, since the classes and direction
don't have to be determined:
opt_cut <- cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes", neg_class = "no", method = maximize_metric, metric = youden)
opt_cut
is a data frame that returns the input data and the ROC curve
(and optionally the bootstrap results) in a
nested tibble. Methods for summarizing and plotting the data and
results are included (e.g. summary
, plot
, plot_roc
, plot_metric
)
To inspect the optimization, the function of metric values per cutpoint can be
plotted using plot_metric
, if an optimization function was used that returns
a metric column in the roc_curve
column. For example, the maximize_metric
and minimize_metric
functions do so:
plot_metric(opt_cut)
Predictions for new data can be made using predict
:
predict(opt_cut, newdata = data.frame(dsi = 0:5))
The included methods for calculating cutpoints are:
maximize_metric
: Maximize the metric functionminimize_metric
: Minimize the metric functionmaximize_loess_metric
: Maximize the metric function after LOESS smoothingminimize_loess_metric
: Minimize the metric function after LOESS smoothingmaximize_gam_metric
: Maximize the metric function after smoothing via Generalized Additive Modelsminimize_gam_metric
: Minimize the metric function after smoothing via Generalized Additive Modelsmaximize_boot_metric
: Bootstrap the optimal cutpoint when maximizing a metricminimize_boot_metric
: Bootstrap the optimal cutpoint when minimizing a metricoc_manual
: Specify the cutoff value manuallyoc_mean
: Use the sample mean as the "optimal" cutpointoc_median
: Use the sample median as the "optimal" cutpointoc_youden_kernel
: Maximize the Youden-Index after kernel smoothing
the distributions of the two classesoc_youden_normal
: Maximize the Youden-Index parametrically
assuming normally distributed data in both classesThe included metrics to be used with the minimization and maximization methods are:
accuracy
: Fraction correctly classifiedabs_d_sens_spec
: The absolute difference of sensitivity and specificityabs_d_ppv_npv
: The absolute difference between positive predictive
value (PPV) and negative predictive value (NPV)roc01
: Distance to the point (0,1) on ROC spacecohens_kappa
: Cohen's Kappasum_sens_spec
: sensitivity + specificitysum_ppv_npv
: The sum of positive predictive value (PPV) and negative
predictive value (NPV)prod_sens_spec
: sensitivity * specificityprod_ppv_npv
: The product of positive predictive value (PPV) and
negative predictive value (NPV)youden
: Youden- or J-Index = sensitivity + specificity - 1odds_ratio
: (Diagnostic) odds ratiorisk_ratio
: risk ratio (relative risk)p_chisquared
: The p-value of a chi-squared test on the confusion
matrixmisclassification_cost
: The sum of the misclassification cost of
false positives and false negatives. Additional arguments: cost_fp, cost_fntotal_utility
: The total utility of true / false positives / negatives.
Additional arguments: utility_tp, utility_tn, cost_fp, cost_fnF1_score
: The F1-score (2 * TP) / (2 * TP + FP + FN)metric_constrain
: Maximize a selected metric given a minimal value of
another selected metricsens_constrain
: Maximize sensitivity given a minimal value of specificityspec_constrain
: Maximize specificity given a minimal value of sensitivityacc_constrain
: Maximize accuracy given a minimal value of sensitivityFurthermore, the following functions are included which can be used as metric
functions but are more useful for plotting purposes, for example in
plot_cutpointr
, or for defining new metric functions:
tp
, fp
, tn
, fn
, tpr
, fpr
, tnr
, fnr
, false_omission_rate
,
false_discovery_rate
, ppv
, npv
, precision
, recall
, sensitivity
, and specificity
.
The inputs to the arguments
method
and metric
are functions so that user-defined functions can easily
be supplied instead of the built-in ones.
Cutpoints can be separately estimated on subgroups that are defined by a third variable,
gender
in this case. Additionally,
if boot_runs
is larger zero, cutpointr
will carry out the usual cutpoint
calculation on the full sample, just as before, and additionally on
boot_runs
bootstrap samples. This offers a way of gauging the out-of-sample
performance of the cutpoint estimation method. If a subgroup is given,
the bootstrapping is carried out separately for every
subgroup which is also reflected in the plots and output.
set.seed(12) opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 1000) opt_cut
The returned object has the additional column boot
which is a nested tibble that
includes the cutpoints per bootstrap sample along with the metric calculated using
the function in metric
and
various default metrics. The
metrics are suffixed by _b
to indicate in-bag results or _oob
to indicate
out-of-bag results:
opt_cut$boot
The summary and plots include additional elements that summarize or display the bootstrap results:
summary(opt_cut) plot(opt_cut)
Using foreach
and doRNG
the bootstrapping can be parallelized easily. The
doRNG package is being used to make the bootstrap sampling reproducible.
if (suppressPackageStartupMessages(require(doParallel) & require(doRNG))) { cl <- makeCluster(2) # 2 cores registerDoParallel(cl) registerDoRNG(12) # Reproducible parallel loops using doRNG opt_cut <- cutpointr(suicide, dsi, suicide, gender, pos_class = "yes", direction = ">=", boot_runs = 1000, allowParallel = TRUE) stopCluster(cl) opt_cut }
It has been shown that bagging can substantially improve performance of a wide range of types of models in regression as well as in classification tasks. This method is available for cutpoint estimation via the maximize_boot_metric
and minimize_boot_metric
functions. If one of these functions is used as method
, boot_cut
bootstrap samples are drawn, the cutpoint optimization is carried out in each one and a summary (e.g. the mean) of the resulting optimal cutpoints on the bootstrap samples is returned as the optimal cutpoint in cutpointr
. Note that if bootstrap validation is run, i.e. if boot_runs
is larger zero, an outer bootstrap will be executed. In the bootstrap validation routine boot_runs
bootstrap samples are generated and each one is again bootstrapped boot_cut
times. This may lead to long run times, so activating the built-in parallelization may be advisable.
The advantages of bootstrapping the optimal cutpoint are that the procedure doesn't possess parameters that have to be tuned, unlike the LOESS smoothing, that it doesn't rely on assumptions, unlike the Normal method, and that it is applicable to any metric that can be used with minimize_metric
or maximize_metric
, unlike the Kernel method. Furthermore, like Random Forests cannot be overfit by increasing the number of trees, the bootstrapped cutpoints cannot be overfit by running an excessive amount of boot_cut
repetitions.
set.seed(100) cutpointr(suicide, dsi, suicide, gender, method = maximize_boot_metric, boot_cut = 200, summary_func = mean, metric = accuracy, silent = TRUE)
When using maximize_metric
and minimize_metric
the optimal cutpoint is
selected by searching the maximum or minimum of the metric function. For
example, we may want to minimize the misclassification cost. Since false
negatives (a suicide attempt was not anticipated) can be regarded as much more
severe than false positives we can set the cost of a false negative cost_fn
for example to ten times the cost of a false positive.
opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric, metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
plot_metric(opt_cut)
As this "optimal" cutpoint may depend on minor differences between the
possible cutoffs, smoothing of the function of metric values by
cutpoint value might be desirable, especially in small samples. The
minimize_loess_metric
and maximize_loess_metric
functions can be used
to smooth the function so that the optimal cutpoint is selected based on the
smoothed metric values. Options to modify the smoothing, which is implemented using
loess.as
from the fANCOVA package, include:
criterion
: the criterion for automatic smoothing parameter selection: "aicc" denotes bias-corrected AIC criterion, "gcv" denotes generalized cross-validation.degree
: the degree of the local polynomials to be used. It can be 0, 1 or 2.family
: if "gaussian" fitting is by least-squares, and if "symmetric" a re-descending M estimator is used with Tukey's biweight function.user.span
: the user-defined parameter which controls the degree of smoothing.Using parameters for the LOESS smoothing of criterion = "aicc"
, degree = 2
,
family = "symmetric"
, and user.span = 0.7
we get the following smoothed
versions of the above metrics:
opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_loess_metric, criterion = "aicc", family = "symmetric", degree = 2, user.span = 0.7, metric = misclassification_cost, cost_fp = 1, cost_fn = 10)
plot_metric(opt_cut)
The optimal cutpoint for the female subgroup changes to 3. Note, though, that there
are no reliable rules for selecting the "best" smoothing parameters. Notably,
the LOESS smoothing is sensitive to the number of unique cutpoints. A large
number of unique cutpoints generally leads to a more volatile curve of
metric values by cutpoint value, even after smoothing. Thus, the curve
tends to be undersmoothed in that scenario. The unsmoothed metric
values are returned in opt_cut$roc_curve
in the column
m_unsmoothed
.
In a similar fashion, the function of metric values per cutpoint can be smoothed
using Generalized Additive Models with smooth terms. Internally, mgcv::gam
carries out the smoothing which can be customized via the arguments
formula
and optimizer
, see help("gam", package = "mgcv")
. Most importantly,
the GAM can be specified by altering the default formula, for example the
smoothing function could be configured to apply cubic regression splines ("cr"
)
as the smooth term. As the suicide
data has only very few unique cutpoints,
it is not very suitable for showcasing the GAM smoothing, so we will use two
classes of the iris
data here. In this case, the purely empirical method and
the GAM smoothing lead to identical cutpoints, but in practice the GAM smoothing
tends to be more robust, especially with larger data. An attractive feature of
the GAM smoothing is that the default values tend to work quite well and
usually require no tuning, eliminating researcher degrees of freedom.
library(ggplot2) exdat <- iris exdat <- exdat[exdat$Species != "setosa", ] opt_cut <- cutpointr(exdat, Petal.Length, Species, method = minimize_gam_metric, formula = m ~ s(x.sorted, bs = "cr"), metric = abs_d_sens_spec) plot_metric(opt_cut)
The Normal method in oc_youden_normal
is a parametric method for maximizing the Youden-Index or equivalently the sum of $Se$ and $Sp$. It relies on the assumption that the predictor for both the negative and positive observations is normally distributed. In that case it can be shown that
$$c^* = \frac{(\mu_P \sigma_N^2 - \mu_N \sigma_P^2) - \sigma_N \sigma_P \sqrt{(\mu_N - \mu_P)^2 + (\sigma_N^2 - \sigma_P^2) log(\sigma_N^2 / \sigma_P^2)}}{\sigma_N^2 - \sigma_P^2}$$
where the negative class is normally distributed with $\sim N(\mu_N, \sigma_N^2)$ and the positive class independently normally distributed with $\sim N(\mu_P, \sigma_P^2)$ provides the optimal cutpoint $c^$ that maximizes the Youden-Index. If $\sigma_N$ and $\sigma_P$ are equal, the expression can be simplified to $c^ = \frac{\mu_N + \mu_P}{2}$. However, the oc_youden_normal
method in cutpointr always assumes unequal standard deviations. Since this method does not select a cutpoint from the observed predictor values, it is questionable which values for $Se$ and $Sp$ should be reported. Here, the Youden-Index can be calculated as
$$J = \Phi(\frac{c^ - \mu_N}{\sigma_N}) - \Phi(\frac{c^ - \mu_P}{\sigma_P})$$
if the assumption of normality holds. However, since there exist several methods that do not select cutpoints from the available observations and to unify the reporting of metrics for these methods, cutpointr reports all metrics, e.g. $Se$ and $Sp$, based on the empirical observations.
cutpointr(suicide, dsi, suicide, gender, method = oc_youden_normal)
A nonparametric alternative is the Kernel method [@fluss_estimation_2005]. Here, the empirical distribution functions are smoothed using the Gaussian kernel functions $\hat{F}N(t) = \frac{1}{n} \sum^n{i=1} \Phi(\frac{t - y_i}{h_y})$ and $\hat{G}P(t) = \frac{1}{m} \sum^m{i=1} \Phi(\frac{t - x_i}{h_x})$ for the negative and positive classes respectively. Following Silverman's plug-in "rule of thumb" the bandwidths are selected as $h_y = 0.9 * min{s_y, iqr_y/1.34} * n^{-0.2}$ and $h_x = 0.9 * min{s_x, iqr_x/1.34} * m^{-0.2}$ where $s$ is the sample standard deviation and $iqr$ is the inter quartile range. It has been demonstrated that AUC estimation is rather insensitive to the choice of the bandwidth procedure [@faraggi_estimation_2002] and thus the plug-in bandwidth estimator has also been recommended for cutpoint estimation. The oc_youden_kernel
function in cutpointr uses a Gaussian kernel and the direct plug-in method for selecting the bandwidths. The kernel smoothing is done via the bkde
function from the KernSmooth package [@wand_kernsmooth:_2013].
Again, there is a way to calculate the Youden-Index from the results of this method [@fluss_estimation_2005] which is
$$\hat{J} = max_c {\hat{F}_N(c) - \hat{G}_N(c) }$$
but as before we prefer to report all metrics based on applying the cutpoint that was estimated using the Kernel method to the empirical observations.
cutpointr(suicide, dsi, suicide, gender, method = oc_youden_kernel)
When running cutpointr
, a ROC curve is by default returned in the column roc_curve
.
This ROC curve can be plotted using plot_roc
. Alternatively, if only the
ROC curve is desired and no cutpoint needs to be calculated, the ROC curve
can be created using roc()
and plotted using plot_cutpointr
.
The roc
function, unlike cutpointr
, does not determine direction
, pos_class
or neg_class
automatically.
roc_curve <- roc(data = suicide, x = dsi, class = suicide, pos_class = "yes", neg_class = "no", direction = ">=") auc(roc_curve) head(roc_curve) plot_roc(roc_curve)
So far - which is the default in cutpointr
- we have considered all unique values of the predictor as possible cutpoints. An alternative could be to use a sequence of equidistant values instead, for example in the case of the suicide
data all integers in $[0, 10]$. However, with very sparse data and small intervals between the candidate cutpoints (i.e. a 'dense' sequence like seq(0, 10, by = 0.01)
) this leads to the uninformative evaluation of large ranges of cutpoints that all result in the same metric value. A more elegant alternative, not only for the case of sparse data, that is supported by cutpointr is the use of a mean value of the optimal cutpoint and the next highest (if direction = ">="
) or the next lowest (if direction = "<="
) predictor value in the data. The result is an optimal cutpoint that is equal to the cutpoint that would be obtained using an infinitely dense sequence of candidate cutpoints and is thus usually more efficient computationally. This behavior can be activated by setting use_midpoints = TRUE
, which is the default. If we use this setting, we obtain an optimal cutpoint of 1.5 for the complete sample on the suicide
data instead of 2 when maximizing the sum of sensitivity and specificity.
Assume the following small data set:
dat <- data.frame(outcome = c("neg", "neg", "neg", "pos", "pos", "pos", "pos"), pred = c(1, 2, 3, 8, 11, 11, 12))
Since the distance of the optimal cutpoint (8) to the next lowest
observation (3) is rather large we arrive at a range of possible cutpoints that
all maximize the metric. In the case of this kind of sparseness it might for example be
desirable to classify a new observation with a predictor value of 4 as belonging
to the negative class. If use_midpoints
is set to TRUE
, the mean of the
optimal cutpoint and the next lowest observation is returned as the optimal
cutpoint, if direction is >=
. The mean of the optimal cutpoint and the next
highest observation is returned as the optimal cutpoint, if direction = "<="
.
opt_cut <- cutpointr(dat, x = pred, class = outcome, use_midpoints = TRUE) plot_x(opt_cut)
A simulation demonstrates more clearly that setting use_midpoints = TRUE
avoids
biasing the cutpoints. To simulate the bias of the metric functions, the
predictor values of both classes were drawn from normal distributions with
constant standard deviations of 10, a constant mean of the negative class of 100
and higher mean values of the positive class that are selected in such a way
that optimal Youden-Index values of 0.2, 0.4, 0.6, and 0.8 result in the population.
Samples of 9 different sizes were drawn and the cutpoints that maximize the
Youden-Index were estimated. The simulation was repeated 10000 times. As can be
seen by the mean error, use_midpoints = TRUE
eliminates the bias that is
introduced by otherwise selecting the value of an observation as the optimal
cutpoint. If direction = ">="
, as in this case, the observation that
represents the optimal cutpoint is the highest possible cutpoint that leads to the
optimal metric value and thus the biases are positive. The methods oc_youden_normal
and oc_youden_kernel
are always unbiased, as they don't select a cutpoint
based on the ROC-curve or the function of metric values per cutpoint.
plotdat_nomidpoints <- structure(list(sim_nr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, 32L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 34L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L), method = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("emp", "normal", "loess", "boot", "spline", "spline_20", "kernel", "gam" ), class = "factor"), n = c(30, 30, 30, 30, 30, 30, 30, 30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, 100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, 150, 150, 150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, 500, 500, 500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, 750, 750, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30, 30, 30, 30, 30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, 100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, 150, 150, 150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, 500, 500, 500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, 750, 750, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30, 30, 30, 30, 30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, 100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, 150, 150, 150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, 500, 500, 500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, 750, 750, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 30, 30, 30, 30, 30, 30, 30, 30, 50, 50, 50, 50, 50, 50, 50, 50, 75, 75, 75, 75, 75, 75, 75, 75, 100, 100, 100, 100, 100, 100, 100, 100, 150, 150, 150, 150, 150, 150, 150, 150, 250, 250, 250, 250, 250, 250, 250, 250, 500, 500, 500, 500, 500, 500, 500, 500, 750, 750, 750, 750, 750, 750, 750, 750, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000), mean_err = c(0.532157164015659, 0.0344907054484091, 1.09430750651166, 0.847845162156675, 1.72337372126503, 0.893756658507988, 0.0430309247027736, 0.785821459035346, 0.368063404388512, 0.0256197760404459, 0.54480529648463, 0.54385597929651, 0.657325657699579, 0.578611116865437, 0.0400491342691897, 0.515688005217413, 0.256713912589642, 0.0444582875885996, 0.326975493112402, 0.371128780921122, 0.473515115741104, 0.389519558405289, 0.105044360789378, 0.301924717299333, 0.207750921776918, -0.00318128936770314, 0.215170156089776, 0.27218780048926, 0.260519564021842, 0.236792923882582, 0.0209319074923902, 0.232957055204834, 0.0726605917614469, -0.00282823355849125, 0.0753216783313991, 0.147121931849656, 0.0986417724955371, 0.10048009778446, -0.0117861260923649, 0.0650845904350442, 0.0985144485083747, 0.00601003227249322, 0.107439979908118, 0.120777421732797, 0.098470489820427, 0.0940946984227826, 0.0340166854141625, 0.107851118082414, 0.0249685210781582, -0.00275219600614378, 0.0258069390207201, 0.0303381972274654, -0.000994602151869198, 0.00196854833683764, -0.0172319489562159, 0.0230957871932473, 0.00787486424680835, -0.018438041997315, -0.000808033567394628, -0.00151904153864496, -0.0258118523805697, -0.020984156892953, -0.0411584927473141, -0.0462075435919094, 0.0481149217843661, -0.0115241085997692, 0.0278045708419731, 0.0358588316426625, 0.0424473909450939, 0.0379773233328197, 0.0298772985879321, 0.0494939492036379, 0.561286668337778, 0.0210874502384648, 0.607711822769155, 0.944733906256477, 1.32069801051061, 0.623604782428556, 0.0138075109769806, 0.640859854412358, 0.284873604303057, -0.0170357985365701, 0.273426417633118, 0.524432737895336, 0.355003110979807, 0.312837607951434, -0.0316296929553873, 0.270109834098986, 0.174110335581819, 0.0253101719615279, 0.199956222742702, 0.375485416120771, 0.278956745944806, 0.244245525945888, 0.0325126314233263, 0.253352659868514, 0.154840760004461, 0.00231589709639472, 0.154412480165179, 0.264847742842386, 0.196572744185608, 0.182934783774992, 0.0207139021497755, 0.17351041376412, 0.14190910348156, -0.000766834484010096, 0.159975205214477, 0.191222128926019, 0.119768252112669, 0.12033372914036, -0.00429047209392067, 0.120982527821078, 0.0756304869484406, -0.00890884219048113, 0.0727782693168392, 0.118690444738942, 0.0814898789647033, 0.0799724348001957, 0.0182926240912726, 0.0887155007804252, 0.00799604720502299, 0.000599148388616836, -0.00567769035990384, 0.0358412514670032, 0.0308474979074875, 0.0341668723768997, -0.000180318451026095, 0.0180733341290925, 0.00456876236626807, 0.00150574966485876, 0.011152095953916, 0.0176039119729626, 0.00608274255434991, 0.0146257828313115, -0.0108877417404102, 0.00341198000323035, -0.00198459880370283, 0.0026551895445694, 0.00199040664534129, 0.0150165794544221, 0.00646287144368147, 0.00999205240904708, -0.00850278571195971, 0.000833666619266177, 0.714730067273087, -0.00916546079360956, 0.662799490366986, 1.18552468844156, 1.25901933062308, 0.672701515532179, 0.0311066140197676, 0.699068058809396, 0.451908043813962, 0.0131716226592205, 0.429551369887738, 0.697928133235757, 0.445367768988423, 0.408463448982185, 0.0318707241721211, 0.406284953982951, 0.257736307754364, -0.00525924423719458, 0.236977000055322, 0.429144726596141, 0.291381752107184, 0.267557606428613, 0.0103657879176852, 0.254728590646094, 0.187783398487578, -0.00216064381479362, 0.209025103860707, 0.318293592390017, 0.216751610346408, 0.195630579126633, -0.00355723971644246, 0.174111826408428, 0.151010324964235, 0.0152409223899092, 0.159002511320467, 0.214643583389694, 0.136211731513269, 0.138948149207635, 0.00736196817594524, 0.115637867729083, 0.0491348055596302, -0.00133957946235943, 0.0507437758212659, 0.103956325245849, 0.0641182216839426, 0.0721933081297794, -0.0124376134651938, 0.0632317888879588, 0.0322195712438111, 0.00170122889182022, 0.0287526766624194, 0.0589662164030242, 0.0348535721709848, 0.039527944642463, -0.00617539706415593, 0.0274246010641889, 0.0325877909680824, 0.00530528253248245, 0.0221776555499961, 0.0389702052631117, 0.0221602091288215, 0.0254478639695596, -0.0016189234058987, 0.0197144417326668, -0.00632485262604172, -0.00364979854195596, -0.00276076468388984, 0.0126267527301874, 0.0123592498266038, 0.0154921632247644, -0.00591512196680815, 0.0098685016547149, 1.19276916750486, -0.0296831640401583, 0.99406393593888, 1.95758669445116, 1.45842790978446, 0.916899913902239, -0.0240222410217233, 1.00771193034927, 0.748151091428865, 0.001671855025917, 0.665180535306263, 1.1777049634557, 0.578603609273264, 0.546625362141714, 0.0292152981607387, 0.615230814912951, 0.417886753756131, 0.00324593885807739, 0.406076310942717, 0.732191741449251, 0.352684769616612, 0.326901376027897, -0.000759357576337989, 0.350075431324921, 0.310927617707656, -0.0107255472998434, 0.28102101085112, 0.514683023356017, 0.24913510139508, 0.235155452507568, -0.0220885572014814, 0.243370611433649, 0.209652330609093, -0.00502865663759991, 0.2172246261346, 0.356540958804122, 0.172121720418057, 0.17487914828986, 0.00365942442127361, 0.176594681455494, 0.126956927057327, -0.00270525933073803, 0.120116234221594, 0.210827536708082, 0.101520409193932, 0.101379097920023, 0.00238043252144371, 0.113027315928011, 0.0598624378953727, -0.00538838415690431, 0.0568400730102315, 0.0978115258288965, 0.0454207684906316, 0.0473140143579152, -0.00165813015281622, 0.0521772135812508, 0.0530224090961669, -0.000416993405198653, 0.0353236911458531, 0.0605493601241619, 0.0316204159297213, 0.0344789374555544, -0.00446984887315054, 0.0328807595695966, 0.0396438546423947, -0.00331466369719113, 0.0379029847219126, 0.0572435100638761, 0.0253269328104989, 0.0235663211070417, 0.00220241478536399, 0.0307132312422208), youden = c(0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8 )), row.names = c(NA, -288L), group_sizes = c(8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L ), biggest_group_size = 8L, class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(sim_nr = 1:36, .rows = list(1:8, 9:16, 17:24, 25:32, 33:40, 41:48, 49:56, 57:64, 65:72, 73:80, 81:88, 89:96, 97:104, 105:112, 113:120, 121:128, 129:136, 137:144, 145:152, 153:160, 161:168, 169:176, 177:184, 185:192, 193:200, 201:208, 209:216, 217:224, 225:232, 233:240, 241:248, 249:256, 257:264, 265:272, 273:280, 281:288)), row.names = c(NA, -36L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))
library(dplyr) ggplot(plotdat_nomidpoints, aes(x = n, y = mean_err, color = method, shape = method)) + geom_line() + geom_point() + facet_wrap(~ youden, scales = "fixed") + scale_shape_manual(values = 1:nlevels(plotdat_nomidpoints$method)) + scale_x_log10(breaks = c(30, 50, 75, 100, 150, 250, 500, 750, 1000)) + ggtitle("Bias of all methods when use_midpoints = FALSE", "normally distributed data, 10000 repetitions of simulation")
By default, most packages only return the "best" cutpoint and disregard other cutpoints with quite similar performance, even if the performance differences are minuscule. cutpointr makes this process more explicit via the tol_metric
argument. For example, if all cutpoints are of interest that achieve at least an accuracy within 0.05
of the optimally achievable accuracy, tol_metric
can be set to 0.05
and also those cutpoints will be returned.
In the case of the suicide
data and when maximizing the sum of sensitivity and specificity, empirically the cutpoints 2 and 3 lead to quite similar performances. If tol_metric
is set to 0.05
, both will be returned.
opt_cut <- cutpointr(suicide, dsi, suicide, metric = sum_sens_spec, tol_metric = 0.05, break_ties = c) library(tidyr) opt_cut %>% select(optimal_cutpoint, sum_sens_spec) %>% unnest(cols = c(optimal_cutpoint, sum_sens_spec))
Using the oc_manual
function the optimal cutpoint will not be determined
based on, for example, a metric but is instead set manually using the
cutpoint
argument. This is useful for supplying and evaluating cutpoints that were found
in the literature or in other external sources.
The oc_manual
function could also be used to set the cutpoint to the sample
mean using cutpoint = mean(data$x)
. However, this may introduce a bias into the
bootstrap validation procedure, since the actual mean of the population is
not known and thus the mean to be used as the cutpoint should be automatically determined in every resample.
To do so, the oc_mean
and oc_median
functions can be used.
set.seed(100) opt_cut_manual <- cutpointr(suicide, dsi, suicide, method = oc_manual, cutpoint = mean(suicide$dsi), boot_runs = 30) set.seed(100) opt_cut_mean <- cutpointr(suicide, dsi, suicide, method = oc_mean, boot_runs = 30)
The arguments to cutpointr
do not need to be enclosed in quotes. This is
possible thanks to nonstandard evaluation of the arguments, which are
evaluated on data
.
Functions that use nonstandard evaluation are often not suitable for
programming with. The use of nonstandard evaluation may lead to scoping
problems and subsequent obvious as well as possibly subtle errors.
cutpointr uses tidyeval internally and accordingly the same rules as
for programming with dplyr
apply. Arguments can be unquoted with !!
:
myvar <- "dsi" cutpointr(suicide, !!myvar, suicide)
Alternatively, we can map the standard evaluation version cutpointr
to
the column names. If direction
and / or pos_class
and neg_class
are unspecified, these parameters
will automatically be determined by cutpointr so that the AUC values for all
variables will be $> 0.5$.
We could do this manually, e.g. using purrr::map
, but to make this task more convenient
multi_cutpointr
can be used
to achieve the same result. It maps multiple predictor columns to
cutpointr
, by default all numeric columns except for the class column.
mcp <- multi_cutpointr(suicide, class = suicide, pos_class = "yes", use_midpoints = TRUE, silent = TRUE) summary(mcp)
data
, roc_curve
, and boot
The object returned by cutpointr
is of the classes cutpointr
, tbl_df
,
tbl
, and data.frame
. Thus, it can be handled like a usual data frame. The
columns data
, roc_curve
, and boot
consist of nested data frames, which means that
these are list columns whose elements are data frames. They can either be accessed
using [
or by using functions from the tidyverse. If subgroups were given,
the output contains one row per subgroup and the function
that accesses the data should be mapped to every row or the data should be
grouped by subgroup.
# Extracting the bootstrap results set.seed(123) opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000) # Using base R to summarise the result of the bootstrap summary(opt_cut$boot[[1]]$optimal_cutpoint) summary(opt_cut$boot[[2]]$optimal_cutpoint) # Using dplyr and tidyr library(tidyr) opt_cut %>% group_by(subgroup) %>% select(boot) %>% unnest(boot) %>% summarise(sd_oc_boot = sd(optimal_cutpoint), m_oc_boot = mean(optimal_cutpoint), m_acc_oob = mean(acc_oob))
By default, the output of cutpointr
includes the optimized metric and several
other metrics. The add_metric
function adds further metrics.
Here, we're adding the negative predictive value (NPV) and
the positive predictive value (PPV) at the optimal cutpoint per subgroup:
cutpointr(suicide, dsi, suicide, gender, metric = youden, silent = TRUE) %>% add_metric(list(ppv, npv)) %>% select(subgroup, optimal_cutpoint, youden, ppv, npv)
In the same fashion, additional metric columns can be added to a roc_cutpointr
object:
roc(data = suicide, x = dsi, class = suicide, pos_class = "yes", neg_class = "no", direction = ">=") %>% add_metric(list(cohens_kappa, F1_score)) %>% select(x.sorted, tp, fp, tn, fn, cohens_kappa, F1_score) %>% head()
User-defined functions can be supplied to method
, which is the function that
is responsible for returning the optimal cutpoint.
To define a new method function, create a function that may take
as input(s):
data
: A data.frame
or tbl_df
x
: (character) The name of the predictor variableclass
: (character) The name of the class variablemetric_func
: A function for calculating a metric, e.g. accuracy. Note
that the method function does not necessarily have to accept this argumentpos_class
: The positive classneg_class
: The negative classdirection
: ">="
if the positive class has higher x values, "<="
otherwisetol_metric
: (numeric) In the built-in methods, all cutpoints will be returned that lead to a metric
value in the interval [m_max - tol_metric, m_max + tol_metric] where
m_max is the maximum achievable metric value. This can be used to return
multiple decent cutpoints and to avoid floating-point problems.use_midpoints
: (logical) In the built-in methods, if TRUE (default FALSE) the returned optimal
cutpoint will be the mean of the optimal cutpoint and the next highest
observation (for direction = ">") or the next lowest observation
(for direction = "<") which avoids biasing the optimal cutpoint....
: Further arguments that are passed to metric
or that can be captured
inside of method
The function should return a data frame or tibble with
one row, the column optimal_cutpoint
, and an optional column with an arbitrary name
with the metric value at the optimal cutpoint.
For example, a function for choosing the cutpoint as the mean of the independent variable could look like this:
mean_cut <- function(data, x, ...) { oc <- mean(data[[x]]) return(data.frame(optimal_cutpoint = oc)) }
If a method
function does not return a metric column, the default sum_sens_spec
, the sum of sensitivity and
specificity, is returned as the extra metric column in addition to accuracy,
sensitivity and specificity.
Some method
functions that make use of the additional arguments (that are
captured by ...
) are already included in cutpointr, see
the list at the top. Since these functions are arguments to cutpointr
their code can be accessed by simply typing their name, see for example
oc_youden_normal
.
User defined metric
functions can be used as well. They are mainly useful in
conjunction with method = maximize_metric
, method = minimize_metric
, or one of
the other minimization and maximization functions.
In case of a different method
function metric
will only be used as the main
out-of-bag metric when plotting the result. The metric
function should
accept the following inputs as vectors:
tp
: Vector of true positivesfp
: Vector of false positivestn
: Vector of true negativesfn
: Vector of false negatives...
: Further argumentsThe function should return a numeric vector, a matrix, or a data.frame
with one column. If the column is named, the name will be included in the output and plots. Avoid using names that are identical to the column names that are by default returned by cutpointr, as such names will be prefixed by metric_
in the output. The inputs (tp
, fp
, tn
, and fn
) are vectors.
The code of the included metric functions can be accessed by simply typing their name.
For example, this is the misclassification_cost
metric function:
misclassification_cost
cutpointr includes several convenience functions for plotting data from a
cutpointr
object. These include:
plot_cutpointr
: General purpose plotting function for cutpointr or roc_cutpointr objectsplot_cut_boot
: Plot the bootstrapped distribution of optimal cutpointsplot_metric
: If maximize_metric
or minimize_metric
was used this function
plots all possible cutoffs on the x-axis vs. the respective metric values on
the y-axis. If bootstrapping was run, a confidence interval based on the
bootstrapped distribution of metric values at each cutpoint can be displayed.
To display no confidence interval set conf_lvl = 0
.plot_metric_boot
: Plot the distribution of out-of-bag metric valuesplot_precision_recall
: Plot the precision recall curveplot_sensitivity_specificity
: Plot all cutpoints vs. sensitivity and specificityplot_roc
: Plot the ROC curveplot_x
: Plot the distribution of the predictor variableset.seed(102) opt_cut <- cutpointr(suicide, dsi, suicide, gender, method = minimize_metric, metric = abs_d_sens_spec, boot_runs = 200, silent = TRUE) opt_cut plot_cut_boot(opt_cut) plot_metric(opt_cut, conf_lvl = 0.9) plot_metric_boot(opt_cut) plot_precision_recall(opt_cut) plot_sensitivity_specificity(opt_cut) plot_roc(opt_cut)
All plot functions, except for the standard plot method that returns
a composed plot, return ggplot
objects
than can be further modified. For example, changing labels, title, and the theme
can be achieved this way:
p <- plot_x(opt_cut) p + ggtitle("Distribution of dsi") + theme_minimal() + xlab("Depression score")
Using plot_cutpointr
any metric can be chosen to be plotted on the x- or
y-axis and results of cutpointr()
as well as roc()
can be plotted.
If a cutpointr
object is to be plotted, it is thus irrelevant which metric
function was chosen for cutpoint estimation. Any metric that can be calculated
based on the ROC curve can be subsequently plotted as only the true / false
positives / negatives over all cutpoints are needed.
That way, not only the above plots can be produced, but also any
combination of two metrics (or metric functions) and / or cutpoints. The built-in
metric functions as well as user-defined functions or anonymous functions can
be supplied to xvar
and yvar
. If bootstrapping was run, confidence intervals
can be plotted around the y-variable. This is especially useful if the cutpoints,
available in the cutpoints
function, are placed on the x-axis.
Note that confidence intervals can only be correctly plotted if the values of
xvar
are constant across bootstrap samples. For example, confidence intervals
for TPR by FPR (a ROC curve) cannot be plotted easily, as the values of the false
positive rate vary per bootstrap sample.
set.seed(500) oc <- cutpointr(suicide, dsi, suicide, boot_runs = 1000, metric = sum_ppv_npv) # metric irrelevant for plot_cutpointr plot_cutpointr(oc, xvar = cutpoints, yvar = sum_sens_spec, conf_lvl = 0.9) plot_cutpointr(oc, xvar = fpr, yvar = tpr, aspect_ratio = 1, conf_lvl = 0) plot_cutpointr(oc, xvar = cutpoint, yvar = tp, conf_lvl = 0.9) + geom_point()
Since cutpointr
returns a data.frame
with the original data, bootstrap
results, and the ROC curve in nested tibbles, these data can be conveniently
extracted and plotted manually. The relevant
nested tibbles are in the columns data
, roc_curve
and boot
. The following
is an example of accessing and plotting the grouped data.
set.seed(123) opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000) opt_cut %>% select(data, subgroup) %>% unnest %>% ggplot(aes(x = suicide, y = dsi)) + geom_boxplot(alpha = 0.3) + facet_grid(~subgroup)
To offer a comparison to established solutions,
cutpointr 1.0.0 will be benchmarked against optimal.cutpoints
from OptimalCutpoints 1.1-4, ThresholdROC 2.7 and custom functions based on
ROCR 1.0-7 and pROC 1.15.0. By generating data of different sizes,
the benchmarks will offer a comparison of the scalability of the different
solutions.
Using prediction
and performance
from the ROCR package and roc
from the
pROC package, we can write functions for computing the cutpoint that maximizes the sum of sensitivity and
specificity. pROC has a built-in function to optimize a few metrics:
# Return cutpoint that maximizes the sum of sensitivity and specificiy # ROCR package rocr_sensspec <- function(x, class) { pred <- ROCR::prediction(x, class) perf <- ROCR::performance(pred, "sens", "spec") sens <- slot(perf, "y.values")[[1]] spec <- slot(perf, "x.values")[[1]] cut <- slot(perf, "alpha.values")[[1]] cut[which.max(sens + spec)] } # pROC package proc_sensspec <- function(x, class) { r <- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = "<") pROC::coords(r, "best", ret="threshold", transpose = FALSE)[1] }
The benchmarking will be carried out using the microbenchmark package and randomly
generated data. The values of the x
predictor variable are drawn from a normal distribution
which leads to a lot more unique values than were encountered before in the
suicide
data. Accordingly, the search for an optimal cutpoint is much more
demanding, if all possible cutpoints are evaluated.
Benchmarks are run for sample sizes of 100, 1000, 1e4, 1e5, 1e6, and 1e7.
For low sample sizes cutpointr is slower than the other
solutions. While this should be of low practical importance, cutpointr scales
more favorably with increasing sample size. The speed disadvantage in small
samples that leads to the lower limit of around 25ms is mainly due to the nesting
of the original data and the results that makes the compact output of cutpointr
possible. This observation is emphasized by the fact that cutpointr::roc
is
quite fast also in small samples. For sample sizes > 1e5 cutpointr
is a little faster than the function based on ROCR and pROC. Both of these
solutions are generally faster than OptimalCutpoints and ThresholdROC with the exception of
small samples. OptimalCutpoints and ThresholdROC had to be excluded from benchmarks with
more than 1e4 observations due to high memory requirements and/or excessive run times, rendering
the use of these packages in larger samples impractical.
library(OptimalCutpoints) library(ThresholdROC) n <- 100 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) x_pos <- dat$x[dat$y == 1] x_neg <- dat$x[dat$y == 0] bench_100 <- microbenchmark::microbenchmark( cutpointr(dat, x, y, pos_class = 1, neg_class = 0, direction = ">=", metric = youden, break_ties = mean), rocr_sensspec(dat$x, dat$y), proc_sensspec(dat$x, dat$y), optimal.cutpoints(X = "x", status = "y", tag.healthy = 0, methods = "Youden", data = dat), thres2(k1 = x_neg, k2 = x_pos, rho = 0.5, method = "empirical", ci = FALSE), times = 100, unit = "ms" ) n <- 1000 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) x_pos <- dat$x[dat$y == 1] x_neg <- dat$x[dat$y == 0] bench_1000 <- microbenchmark::microbenchmark( cutpointr(dat, x, y, pos_class = 1, neg_class = 0, direction = ">=", metric = youden, break_ties = mean), rocr_sensspec(dat$x, dat$y), proc_sensspec(dat$x, dat$y), optimal.cutpoints(X = "x", status = "y", tag.healthy = 0, methods = "Youden", data = dat), thres2(k1 = x_neg, k2 = x_pos, rho = 0.5, method = "empirical", ci = FALSE), times = 100, unit = "ms" ) n <- 10000 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) x_pos <- dat$x[dat$y == 1] x_neg <- dat$x[dat$y == 0] bench_10000 <- microbenchmark::microbenchmark( cutpointr(dat, x, y, pos_class = 1, neg_class = 0, direction = ">=", metric = youden, break_ties = mean, silent = TRUE), rocr_sensspec(dat$x, dat$y), optimal.cutpoints(X = "x", status = "y", tag.healthy = 0, methods = "Youden", data = dat), proc_sensspec(dat$x, dat$y), thres2(k1 = x_neg, k2 = x_pos, rho = 0.5, method = "empirical", ci = FALSE), times = 100 ) n <- 1e5 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1e5 <- microbenchmark::microbenchmark( cutpointr(dat, x, y, pos_class = 1, neg_class = 0, direction = ">=", metric = youden, break_ties = mean), rocr_sensspec(dat$x, dat$y), proc_sensspec(dat$x, dat$y), times = 100, unit = "ms" ) n <- 1e6 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1e6 <- microbenchmark::microbenchmark( cutpointr(dat, x, y, pos_class = 1, neg_class = 0, direction = ">=", metric = youden, break_ties = mean), rocr_sensspec(dat$x, dat$y), proc_sensspec(dat$x, dat$y), times = 30, unit = "ms" ) n <- 1e7 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1e7 <- microbenchmark::microbenchmark( cutpointr(dat, x, y, pos_class = 1, neg_class = 0, direction = ">=", metric = youden, break_ties = mean), rocr_sensspec(dat$x, dat$y), proc_sensspec(dat$x, dat$y), times = 30, unit = "ms" ) results <- rbind( data.frame(time = summary(bench_100)$median, Solution = summary(bench_100)$expr, n = 100), data.frame(time = summary(bench_1000)$median, Solution = summary(bench_1000)$expr, n = 1000), data.frame(time = summary(bench_10000)$median, Solution = summary(bench_10000)$expr, n = 10000), data.frame(time = summary(bench_1e5)$median, Solution = summary(bench_1e5)$expr, n = 1e5), data.frame(time = summary(bench_1e6)$median, Solution = summary(bench_1e6)$expr, n = 1e6), data.frame(time = summary(bench_1e7)$median, Solution = summary(bench_1e7)$expr, n = 1e7) ) results$Solution <- as.character(results$Solution) results$Solution[grep(pattern = "cutpointr", x = results$Solution)] <- "cutpointr" results$Solution[grep(pattern = "rocr", x = results$Solution)] <- "ROCR" results$Solution[grep(pattern = "optimal", x = results$Solution)] <- "OptimalCutpoints" results$Solution[grep(pattern = "proc", x = results$Solution)] <- "pROC" results$Solution[grep(pattern = "thres", x = results$Solution)] <- "ThresholdROC" results$task <- "Cutpoint Estimation"
# These are the original results on our system # dput(results) results <- structure(list(time = c(4.5018015, 1.812802, 0.662101, 2.2887015, 1.194301, 4.839401, 2.1764015, 0.981001, 45.0568005, 36.2398515, 8.5662515, 5.667101, 2538.612001, 4.031701, 2503.8012505, 45.384501, 43.118751, 37.150151, 465.003201, 607.023851, 583.0950005, 5467.332801, 7850.2587, 7339.356101), Solution = c("cutpointr", "ROCR", "pROC", "OptimalCutpoints", "ThresholdROC", "cutpointr", "ROCR", "pROC", "OptimalCutpoints", "ThresholdROC", "cutpointr", "ROCR", "OptimalCutpoints", "pROC", "ThresholdROC", "cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC"), n = c(100, 100, 100, 100, 100, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, 10000, 10000, 1e+05, 1e+05, 1e+05, 1e+06, 1e+06, 1e+06, 1e+07, 1e+07, 1e+07), task = c("Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation", "Cutpoint Estimation")), row.names = c(NA, -24L), class = "data.frame")
# ROCR package rocr_roc <- function(x, class) { pred <- ROCR::prediction(x, class) perf <- ROCR::performance(pred, "sens", "spec") return(NULL) } # pROC package proc_roc <- function(x, class) { r <- pROC::roc(class, x, algorithm = 2, levels = c(0, 1), direction = "<") return(NULL) }
n <- 100 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_100 <- microbenchmark::microbenchmark( cutpointr::roc(dat, "x", "y", pos_class = 1, neg_class = 0, direction = ">="), rocr_roc(dat$x, dat$y), proc_roc(dat$x, dat$y), times = 100, unit = "ms" ) n <- 1000 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1000 <- microbenchmark::microbenchmark( cutpointr::roc(dat, "x", "y", pos_class = 1, neg_class = 0, direction = ">="), rocr_roc(dat$x, dat$y), proc_roc(dat$x, dat$y), times = 100, unit = "ms" ) n <- 10000 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_10000 <- microbenchmark::microbenchmark( cutpointr::roc(dat, "x", "y", pos_class = 1, neg_class = 0, direction = ">="), rocr_roc(dat$x, dat$y), proc_roc(dat$x, dat$y), times = 100, unit = "ms" ) n <- 1e5 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1e5 <- microbenchmark::microbenchmark( cutpointr::roc(dat, "x", "y", pos_class = 1, neg_class = 0, direction = ">="), rocr_roc(dat$x, dat$y), proc_roc(dat$x, dat$y), times = 100, unit = "ms" ) n <- 1e6 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1e6 <- microbenchmark::microbenchmark( cutpointr::roc(dat, "x", "y", pos_class = 1, neg_class = 0, direction = ">="), rocr_roc(dat$x, dat$y), proc_roc(dat$x, dat$y), times = 30, unit = "ms" ) n <- 1e7 set.seed(123) dat <- data.frame(x = rnorm(n), y = sample(c(0:1), size = n, replace = TRUE)) bench_1e7 <- microbenchmark::microbenchmark( cutpointr::roc(dat, "x", "y", pos_class = 1, neg_class = 0, direction = ">="), rocr_roc(dat$x, dat$y), proc_roc(dat$x, dat$y), times = 30, unit = "ms" ) results_roc <- rbind( data.frame(time = summary(bench_100)$median, Solution = summary(bench_100)$expr, n = 100), data.frame(time = summary(bench_1000)$median, Solution = summary(bench_1000)$expr, n = 1000), data.frame(time = summary(bench_10000)$median, Solution = summary(bench_10000)$expr, n = 10000), data.frame(time = summary(bench_1e5)$median, Solution = summary(bench_1e5)$expr, n = 1e5), data.frame(time = summary(bench_1e6)$median, Solution = summary(bench_1e6)$expr, n = 1e6), data.frame(time = summary(bench_1e7)$median, Solution = summary(bench_1e7)$expr, n = 1e7) ) results_roc$Solution <- as.character(results_roc$Solution) results_roc$Solution[grep(pattern = "cutpointr", x = results_roc$Solution)] <- "cutpointr" results_roc$Solution[grep(pattern = "rocr", x = results_roc$Solution)] <- "ROCR" results_roc$Solution[grep(pattern = "proc", x = results_roc$Solution)] <- "pROC" results_roc$task <- "ROC curve calculation"
# Our results results_roc <- structure(list(time = c(0.7973505, 1.732651, 0.447701, 0.859301, 2.0358515, 0.694802, 1.878151, 5.662151, 3.6580505, 11.099251, 42.8208515, 35.3293005, 159.8100505, 612.471901, 610.4337005, 2032.693551, 7806.3854515, 7081.897251), Solution = c("cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC", "cutpointr", "ROCR", "pROC"), n = c(100, 100, 100, 1000, 1000, 1000, 10000, 10000, 10000, 1e+05, 1e+05, 1e+05, 1e+06, 1e+06, 1e+06, 1e+07, 1e+07, 1e+07), task = c("ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation", "ROC curve calculation")), row.names = c(NA, -18L), class = "data.frame")
results_all <- dplyr::bind_rows(results, results_roc) ggplot(results_all, aes(x = n, y = time, col = Solution, shape = Solution)) + geom_point(size = 3) + geom_line() + scale_y_log10(breaks = c(0.5, 1, 2, 3, 5, 10, 25, 100, 250, 1000, 5000, 1e4, 15000)) + scale_x_log10(breaks = c(100, 1000, 1e4, 1e5, 1e6, 1e7)) + ylab("Median Time (milliseconds, log scale)") + xlab("Sample Size (log scale)") + theme_bw() + theme(legend.position = "bottom", legend.key.width = unit(0.8, "cm"), panel.spacing = unit(1, "lines")) + facet_grid(~task)
res_table <- tidyr::spread(results_all, Solution, time) %>% arrange(task) knitr::kable(res_table)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.