knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(ggplot2) library(seqbtests) library(scmamp) library(rNPBST) library(tidyverse)
This Vignette provides a brief and simple introduction to the seqbtests
package. The purpose of the package is to provide a flexible and easy to use
framework to evaluate the performance of benchmark results.
This package implements statistical testing in order to assess statistical differences between the performances of Machine Learning (ML) algorithms for classification. The package and vignette are mainly based on @benavoli2017. Benavoli et al. argue against the use of null hypothesis significance testing, due to some drawbacks, like not accounting for magnitude of the effects or uncertainty in the estimates. Further, the estimation of the p-value depends on sampling intentions and the sample size. Instead they are promoting the use of Bayesian testing.
The seqbtests
package provides both, frequentist and Bayesian, tests for
comparison of algorithms on one or multiple data sets. As a new element, the
package offers a sequential approach to Bayesian testing. Since, evaluation of
ML algorithms can be quite time consuming, the approach offers a new possibility
to optimize the evaluation process and make it more effective. After each
iteration of evaluation, the sequential tests compare the estimated posterior
probabilities to a defined threshold and stop further testing, if a decision
based on the probabilities can be made. The package offers the possibility to
evaluate the algorithms performance, while simultaneously testing the results,
thus, in the case of early stopping this procedure can considerably save
evaluation time.
This vignette is divided in four parts. The first introduces the data structure that needs to be provided for testing. The second and third part respectively describe the included frequentist and Bayesian tests, as well as their graphical visualization. The last section introduces the sequential approach to Bayesian testing.
For all tests and additional functions that are provided in the package, the data must be in a particular format. The following table shows how such a data frame should look like. The specified columns are mandatory.
| problem | algorithm | replication | measure_* | |:------:|:------:|:------:|:------:| | problem_a | algo_1 | 1 | 0.4150943 | | problem_a | algo_2 | 1 | 0.7240566 | | problem_a | algo_1 | 2 | 0.4292453 | | problem_a | algo_2 | 2 | 0.7240566 | | problem_a | algo_1 | 3 | 0.4056604 | | problem_a | algo_2 | 3 | 0.7476415 | | ... | ... | ... | ... |
For the sake of simplicity, in the following, the data frame containing the
evaluation data and used to test the algorithms against one another is
referred to as data frame, while the data sets used to evaluate the classifiers
performance are referred to as problem sets and are stored in the problem
column.
The algorithm
column contains the names of each algorithm. replication
describes the number of the replication, i.e. the iteration of performance
evaluation. The performance measures are displayed in the
column named measure_*
. The data format allows including more than one
performance measure in the data frame by setting a specification after
measure_
. For example one can include measure_acc
and measure_f1
for
accuracy and F1 Score, the harmonic mean of precision and recall. However, for
each test or plot one specific measure column has to be defined since it is not
possible to compare multiple measures at once.
The package offers the advantage that there is no need to provide a complete
data set when using the sequential tests. The user can define the function
get_replication
to build the data frame. To be able to save time when
evaluating the algorithms the sequential tests are built in loops. In each loop
the tests calls a new iteration through get_replication
to test the algorithms
performances. Each run of evaluation is only called if the predefined threshold
that determines if an algorithm works better than the competing one is not
reached, and further evaluation needs to be done. Thus, if a decision based on
the estimated posterior probabilities can be made, the evaluation for this
algorithm is stopped, which can save the user a lot of time.
To run the tests the data frame must not contain any NAs. If a complete data
frame is used, the package provides a function to search for and delete groups
containing NAs based on either algorithm
or problem
.
data_with_na <- test_benchmark_small data_with_na$measure_col[1] <- NA na_check(df = data_with_na, check_var = "algorithm")
For each value of the defined check_var
(either problem
or algorithm
)
na_check
shows the number and ratio of NAs in the measure
column. To perform
any of the tests for each problem the same number of algorithms and replications
must be included in the data frame. Thus, when comparing the results of
na_check
one can decide rather to drop the
specific values of either problem
or algorithm
. To remove one specific group
of values the package provides the function na_drop
. Via the same check_var
the user can specify to either drop all values containing NAs for either
problem
or algorithm
. It is sensible to select the variable that the least
observations are lost by.
complete_data <- na_drop(df = data_with_na, check_var = "algorithm")
To test if the data frame is in the right format, one can check the
structure with check_structure
. If the data provided by the user is
in the specified format, the function returns TRUE. If there are any problems
concerning the format, a corresponding error message will be issued.
check_structure(complete_data)
When the frequentist or Bayesian tests are run, the correct format and whether all necessary columns are available is checked internally.
\pagebreak
The package provides data that show examples for performance measures of different algorithms evaluated on a number of different problem sets. They are used to introduce the package functions.
data("test_benchmark_small")
To assess statistical differences between the performance measures obtained by different algorithms the package provides several frequentist hypothesis tests.
One can differentiate between parametric and non-parametric tests. The advantage of non-parametric tests is that no assumptions need to be made. If the assumptions for the parametric tests hold, those tests are more powerful, if, however, the assumptions do not apply one should rather use non-parametric tests since in this case they are more powerful.
Assumptions:
Before testing the differences one should always have a further look on the
algorithms by visualizing their aggregated performance on all problem sets with plot_densities
.
plot_densities(test_benchmark_small)
The function plots the density of all algorithms performances in the considered
data frame. As an argument the user has to define the data frame (df
). If the
data frame contains more than one measure column the user should define which
column to use (per default the first column is used). Based on the densities one
can already determine whether an algorithm is likely to be normally distributed.
For further visualization of the data the mean performance of the algorithms can
be observed using the function plot_boxplot
.
plot_boxplot(df = test_benchmark_small)
The box plots show the mean performances of the algorithms across all problem sets in the data frame. Thus, it's easy to observe strong outliers.
After the visualization, statistical differences in the performances of the algorithms can be tested by three frequentist tests implemented in the package, namely, the correlated t-test and the non-parametric Friedman and Wilcoxon signed ranks test. The null hypothesis for these tests states that there are no differences among the algorithms, thus all algorithm score the same performance.
| 2 algorithms x 1 problem set | 2 algortihms x n problem sets | n algorithms x n problem sets| |:------:|:------:|:------:| | correlated t-test | Wilcoxon signed rank test | Friedman test | | | Nemenyi pos-hoc test | |
The correlated t-test is an extension of the common t-test that additionally takes into account that the algorithms performances are evaluated on overlapping trainings sets and are therefore correlated. It is used, for the analysis of performance results on a single problem set.
results_corr <- corr_t_test(df= test_benchmark_small, problem = "problem_a", baseline = "algo_1", algorithm = "algo_2")
results_corr$data_frame
The common practice in frequentist hypothesis testing is to assume a significance level of $\alpha = 0.05$. If the calculated p-value is below this threshold ($p \le \alpha$) the accuracy of the considered algorithms is significantly different in the specific problem set and the null hypothesis can be rejected.
For the test a baseline algorithm needs to be defined (baseline
). The baseline
is tested either against another defined algorithm (algorithm
) or, if no
algorithm
is defined, against all other algorithms in the data frame. Note
that, for the frequentist hypothesis tests, one has to take multiple testing
problems into account and may adjust the significance level via correction
procedures such as Bonferroni correction.
The Wilcoxon signed ranks test is recommended by @demsar2006 for the comparison of two classifiers on multiple problem sets. It is a non-parametric test that ranks and compares the performances of two classifiers, based on their ranks.
results_wilcoxon <- wilcoxon_signed_test(df = test_benchmark, baseline = "algo_1", algorithm = "algo_2", problem = "problem_a")
results_wilcoxon$data_frame
The null hypothesis states that the algorithms performances on the considered problem set are the same. If the computed p-value is smaller than 0.05 the null hypothesis can be rejected, i.e. there are significant differences among the compared algorithms.
The Friedman test is a non-parametric test that checks whether differences among the algorithms appear. It has been recommended by @demsar2006 together with the corresponding Nemenyi post-hoc test for pairwise comparison of classifiers among multiple problem sets. The Friedman test checks if there is one or multiple algorithms that perform significantly different from the others. It is the non-parametric equivalent to the ANOVA test.
results_friedman <- friedman_test(test_benchmark)
results_friedman$data_frame
The null hypothesis states that all algorithms are equal. The p-value shows if the null hypothesis can safely be rejected, i.e. revealing that there are significant differences among the algorithms. In case of rejection one can proceed with a post-hoc test.
The post-hoc tests are pairwise comparison of all algorithms. When there are significant differences among the algorithms a post-hoc test finally shows among which algorithms the differences appear.
If the null hypothesis that states that all algorithms are equal is rejected one can proceed with a post-hoc test. This package provides one post-hoc test, the Nemenyi post-hoc test, which is recommended by @demsar2006.
The Nemenyi test is close to the Tukey test for the ANVOA, it compares each algorithm to one another. If the average ranks of two algorithms differ by at least the value of Critical Difference (CD) the two algorithms are significantly different.
results_nemenyi <- nemenyi_test(test_benchmark_small)
results_nemenyi$data_frame results_nemenyi$matrix
The performance of all algorithms can be graphically visualized via the Critical Differences (CD) plots, introduced by @demsar2006.
plot_cd(test_benchmark_small)
In the plot the compared algorithms are lined up according to their ranks. Algorithms that do not differ significantly are connected by horizontal lines.
It is possible that the Friedman test finds significant differences that post-hoc tests are not able to detect. This case happens quite rarely and is due to the post-hoc tests lower power. When this case occurs one can only state that there significant differences between the algorithms appear.
\pagebreak
While in many other scientific fields a shift from frequentist tests towards Bayesian testing has taken place for some time now, the analysis of benchmark results is still based on null hypothesis significance testing (i.e. frequentist analysis). This package is based on the paper of @benavoli2017, which proposed the use of Bayesian tests instead of frequentist tests for comparing the performance of machine learning algorithms. They emphasize the many advantages of the Bayesian approach:
While the null hypothesis of the frequentist test states that there are no differences between the algorithms performances, this hypothesis can only be rejected. In Bayesian testing further information is available by querying the posterior distribution. First, a prior distribution is set up that expresses the prior beliefs about the parameters of interest, when no data is yet taken into account. The posterior distribution is calculated using the prior distribution about a parameter and a likelihood model providing actual information about the given parameter through the observed data. Using the posterior distribution different probabilities can be computed: the probability that the baseline algorithms is better than the compared algorithm (P(baseline >> algorithm)), the probability that the algorithm performs better than the baseline (P(baseline << algorithm)). Additional, an interval around the null value is established to be able to tell if two algorithms are practically equivalent (P(baseline = algorithm)). This interval is referred to as the ROPE (Region of Practical Equivalence). Usually two algorithms are practically equivalent if the mean differences of accuracy are less than 1%.
The package provides four Bayesian tests, which have been summarized and recommended in @benavoli2017. The Bayesian correlated t test, is used for the comparison of two algorithms on one data set. The other three tests, namely the Bayesian sign test, the Bayesian signed rank test and the Bayesian hierarchical correlated t-test are used to test algorithms on multiple problem sets.
| 2 algorithms x 1 problem set | 2 algorithms x n problem sets | |:------:|:------:| | Bayesian correlated t-test | Bayesian sign test | | | Bayesian signed rank test | | | Bayesian hierarchical correlated t-test |
The Bayesian correlated t-test compares the performance of two competing
classifiers on a single problem set. It is the Bayesian counterpart to the
frequentist correlated t-test and thus takes correlation due to overlapping
trainings data into account. Like the frequentist test, the
Bayesian counterpart is a parametric test, which assumes Gaussian distribution
of the data. If the correlation, denoted by rho
, is 0 the Bayesian correlated
t-test and the frequentist correlated t-test are numerically equivalent. Yet,
the approaches inferences are different. Thus, the interpretation of the same
numerical values differs. The default value for rho
is 0.01.
results_b_corr <- b_corr_t_test(df= test_benchmark_small, problem = "problem_a", baseline = "algo_1", algorithm = "algo_2")
results_b_corr$data_frame
The correlation rho
has to be between 0 and 1, the variance needs to be > 0.
The baseline algorithm can be tested either against a specific algorithm, or
against all algorithms, if no specification for algorithm
is made. The test
returns a data frame with each row containing the results of one comparison. The
column labeled algorithm
displays the algorithms names that are tested against
the baseline. The posterior probabilities are shown in the columns left
,
rope
and right
.
The column labeled left
defines the posterior probability of the mean
differences of accuracy being below the bound of the ROPE
(P(baseline << algorithm)). rope
determines the posterior probability of the
classifiers being practically equivalent (P(baseline = algorithm)). Thus, the
right
column shows the probability of the mean differences being above the
upper bound (P(baseline >> algorithm)).
The column probabilities
returns the
decisions that can be made based on the posterior probabilities. The decisions
are made according to the probability (prob
) defined by the user (per default
95% is assumed). They also depend on whether the user has determined that the
algorithms should be tested to perform either better than or at least as good as
the other. This can be defined by the argument compare
by specifying either
better
or equal
. This is especially important when it comes to comparing two
almost similar algorithms, which differ in their computational effort. In this
case it can often be helpful to know that the computationally less complex
algorithm performs at least as well as the other one. If no decision could be
made the column states no decision
for the specific algorithm.
The decisions based on the posterior probabilities can be outlined as follows:
| Decisions | |:------:| | $algo_1 > algo_2$ if right probability > 0.95 | | $algo_1 = algo_2$ if rope probability > 0.95 | | $algo_1 < algo_2$ if left probability > 0.95 | | $algo_1 \ge algo_2$ if right + rope probability > 0.95 | | $algo_1 \le algo_2$ if left + rope probability > 0.95 |
The posterior probability can be plotted by the function plot_posterior
.
plot_posterior(results_b_corr, method = "b_corr_t_test")
To plot the posterior distribution the user needs to define the applied Bayesian
test (method
). In addition, the test results that should be visualized needs to
be labeled. Even though, the Bayesian tests allow the comparison of multiple
algorithms against the baseline, to plot the results one algorithm needs to be
defined. The plot visualizes the posterior probabilities, as well as the ROPE.
P(baseline = algorithm) is the integral of the posterior distribution in-between
the ROPE, which is defined by the two vertical lines. The width of the plots
corresponds to the uncertainty in the data. In case of high uncertainty, the
posterior has a larger width and encloses more parameter values. With decreasing
uncertainty, the plot becomes more narrow.
The probability mass that is in the interval (-$\infty$, -0.01) represents the probability of algo_2 being better than algo_1. For the probability mass in the interval (0.01, $\infty$) this equals the probability of algo_1 being better than algo_2. The mass falling in the interval (-0.01, 0.01) can be interpreted as the probability of both algorithms being practically equivalent. If all the probability mass is inside the ROPE one can conclude that the Baseline and compared algorithm are practically equivalent.
The Bayesian sign and signed ranks tests compare classifiers on multiple problem sets. They are respectively the Bayesian counterparts to the non-parametric sign and Wilcoxon signed ranks test and assume a Dirichlet Process (DP) prior. Different than the Bayesian and frequentist correlated t-test these tests don't take correlation into account. However, due to their non-parametric nature, they do not assume normality of the samples mean, are robust regarding outliers and do not assume commensurability.
The Bayesian sign test is less powerful than the Bayesian signed rank test, since the signed ranks test does not only take the sign into account, but also the magnitude of differences. It ranks the absolute values of differences and compares these ranks, while considering the signs.
results_b_sign <- b_sign_test(df= test_benchmark_small, problem = "problem_a", baseline = "algo_1", algorithm = "algo_2")
results_b_sign$data_frame
results_b_signed <- b_signed_rank_test(df= test_benchmark_small, baseline = "algo_1", algorithm = "algo_2")
results_b_signed$data_frame
The results are displayed and can be interpreted identically as for the Bayesian correlated t-test.
\pagebreak
The graphical visualization can be called the same way as for correlated
t-tests by plot_posterior
, while adjusting the applied method
.
plot_posterior(results_b_sign, method = "b_sign_test") plot_posterior(results_b_signed, method = "b_signed_rank_test")
Different to the correlated t-test the plot now corresponds to a large triangle, which is divided in three different regions. The posterior samples are represented by the cloud of points in the triangle. The regions correspond to three probabilities. If all the points fall in one region, one can conclude that such hypothesis is true, with a probability of almost 100%.
\pagebreak
The Bayesian hierarchical correlated t-test is an extension of the correlated t-test that works for the comparison of classifiers on multiple problem sets and still takes correlation into account. For every problem set a multivariate normal distribution is assumed for the means. The aim is to estimate the mean difference of accuracies on all considered problem sets. In Bayesian analysis hierarchical models are particularly powerful and flexible. Computations are obtained by Markov-Chain Monte Carlo sampling.
results_b_hierarchical <- b_hierarchical_test(df= test_benchmark_small, baseline = "algo_1", algorithm = "algo_3", rho=0.1, rope=c(-0.01, 0.01))
The hierarchical approach is recommended over the Bayesian signed test by @benavoli2017. Yet, it is quite time consuming, which is inconvenient if tests needs to be run often.
The graphical visualization shows a similar triangle plot as the Bayesian sign and signed rank test.
plot_posterior(results_b_hierarchical, method = "b_hierarchical_test")
\pagebreak
While all the former tests have been implemented in similar forms in other packages, this package is trying to optimize the process of performance evaluation. The procedure of assessing the algorithms performances mostly is time-consuming. To counteract this issue this package provides a sequential approach for comparison while using Bayesian tests.
In sequential testing, the sample size is not fixed in advance, but data are evaluated as they are collected. An optional stopping rule is applied, that terminates further evaluation and testing, if a pre-defined level of evidence or a maximum number of evaluations is reached.
The stopping rule can be outlined as follows:
The threshold probability can be defined by the user by adjusting prob
. If
either of the stopping rules applies, further testing is stopped. If the
baseline is compared to multiple algorithms evaluation and testing is only
stopped for the specific algorithm, for which the stopping rule is reached.
The procedure continues for the other algorithms.
Thus, the sequential procedure can be outlined as follows:
To control the evaluation scope, the user can specify the maximum number of
replications with max_repls
. If a complete record is provided max_repls
should correspond to the number of replications in the record.
Two main issues need to be considered for sequential testing, namely multiple testing problems and bias through early stopping. In frequentist hypothesis tests the probability of incorrectly rejecting the null hypothesis increases (Type 1 error). Even though Bayesian testing does not rely on Type 1 errors, multiple testing should be taken into account. @gelman2012 argues that multiple testing problems can be mitigated by using multilevel modeling, to achieve more reliable estimates, and by using the ROPE, which decreases the asymptotic false alarm rate.
Extensive statistical experiments have shown that a possible bias caused through
early stopping can be mitigated by setting a minimum number of replications,
that need to be evaluated before the optional stopping rule is activated. The
user can determine this number by min_repls
. The default for min_repls
is
five replications, since simulation studies have shown that the number is
sufficient to achieve valid results.
results_seq_corr <- seq_b_corr_t_test(df = test_benchmark_small, rho=0.1, problem = "problem_c", baseline = "algo_1", compare = "better", max_repls = 10)
\pagebreak
results_seq_corr$data_frame
results_seq_sign <- seq_b_sign_test(df = test_benchmark_small, baseline = "algo_1", max_repls = 10) results_seq_ranks <- seq_b_signed_rank_test(df = test_benchmark_small, baseline = 'algo_1', max_repls = 10) results_seq_hierarchical <- seq_b_hierarchical_test(df = test_benchmark_small, baseline = 'algo_1', algorithm = "algo_3", min_repls = 8, max_repls = 10)
The results of the sequential tests are structured and can be interpreted the
same way as the non-sequential tests. The output also specifies the
number of replications that were considered. If no decision concerning the
probabilities could be made, the replications were examined up to max_repls
.
To be able to save time through sequential testing, no complete data set needs
to be provided, but the data can be evaluated within the tests. The user can
define the function get_replication
to build each replication of evaluation.
In each loop of the tests, a new replication is generated and the data is
tested.
If the user provides a complete data frame, there is no need to define
get_replication
but the default function can be used. The complete data frame
can be called by the tests argument df
.
In the following an exemplary function for get_replication
is built to replace
the default function. The default function needs to be unlocked in order to
lock the replacement.
sepcific_get_replication <- function(i){ set.seed(1234) data <- data.frame(algorithm = rep(c("algo_a", "algo_b"), length = i*2), measure_accuracy = rnorm(n = i*2, mean = c(0.4, 0.7), sd = 0.05)) data$problem <- "problem_1" for (i in unique(data$algorithm)) { data$replication[data$algorithm == i] <- seq_len(sum(data$algorithm == i)) } return(data) } unlockBinding("get_replication", as.environment("package:seqbtests")) assignInNamespace("get_replication", sepcific_get_replication, ns="seqbtests", envir=as.environment("package:seqbtests")) assign("get_replication", sepcific_get_replication, envir=as.environment("package:seqbtests")) lockBinding("get_replication", as.environment("package:seqbtests")) results <- seq_b_corr_t_test(problem = "problem_1", baseline = "algo_a", algorithm = "algo_b")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.