In b0rxa/scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems

Time for a change ...

Since the very beginning, the use of statistical tests has been surrounded by controversy. For decades staticians have highlighted the weak points of these statistical tools and, in the last years, a concern about their misuse in scientific works has been added. Indeed, in different domains researches have started to move from classical statistical tests to other methods that provide further insights about the data analysed.

As its title suggests, with @benavoli2017 this recommendation has also arrived to the empirical comparison of algorithms (in the particular topic of classification algorithms). This paper proposes some Bayesian approaches to compare pairs of classifiers in individual datasets as well as in groups of datasets.

We have included the methods proposed in @benavoli2017 in scmamp, and this tutorial will show you how you can make use of them. The code included is an adapttion of that provided by the authors of the paper, available at github.

As an example, we have also included in the package a set of results that can be used to test the methods, so first we will load the required packages and the data.

library("scmamp")
library("ggplot2")
data(data_kcv_example)

The example dataset contains the AUC value obtained by four algorithms (AlgA to AlgD) in all the folds of a 10 times 10 fold crossvalidation in 10 different datasets. The first three columns identify the dataset, repetition and fold respectively.

Before we start with the analysis, we will explore the data to see how the samples are distributed. For that, we can use the plotDensities function (remember that this function makes use of the ggplot2 package, @wickham2009, and thus you can further manipulate the output to change the look of the plot).

algorithms <- names(data.kcv.example)[4:7]
db <- 5
plotDensities (data=data.kcv.example[data.kcv.example$DB==db, algorithms], size=1.1)

The code above plots the densities of the four algorithms in the fifth dataset. The last three algorithms have a density that could be roughly considered as normal, but we can see that the density of the first algorithm, AlgA, is quite skewed. You can check the rest of the densities modifying the db variable.

Bayesian approach

Before going into the details, in this section we will briefly review the basic idea behind the Bayesian approach followed here. For further information you can check @benavoli2017.

Although in Bayesian inference it is possible to test hipothesis, in this package we have included the estimation approach presented in @benavoli2017, where the methods used will model (or approximate) the posterior probability density function of the parameters. Then, the posterior probability is used to compute/estimate different probabilities of interest.

In particular, we will be interested in providing an answer to the question is Algorithm A better/equal/worse than Algorithm B?. Of special interest is the concept of "equal", which is closely related with the concept of null hypothesis in hypothesis testing. However, from a Bayesian point of view, such a definition of equality is useless (for any continuous parameter, the probability of that parameter being equal in two populations will be, in practical terms, zero).

To solve this problem, and in order to provide a sensible definition of equality, in @benavoli2017 the authors recommend using the concept of rope, which is simply a segment of the possible values of the parameter where we regard both algorithms as equal. In all the cases covered in this vignette, the parameter of interest is the performance difference between two algorithms, which typicaly ranges between -1 and 1. In such a contest, a possible rope would be [-0.01, 0.01], meaning that any difference smaller that 0.01 in magnitude is regarded as irrelevant. Of course, this rope can be adapted to the particularities of any application.

Comparison of two classifiers in a single dataset

The most basic situation is the comparison of two given algorithms in a single dataset. In that case, we face two problems when the samples come from a cross validation. The first problem is that the samples we have are not independent, due to the cross validation scheme. The second one is the assumption of a parametric distribution for the data. The problem here is that the only way we have to include the correlation in the analysis is assuming that the data follows a Gaussian distribution. In such case, we can make use of the Bayesian equivalent to the correlated t-test proposed in @nadeau2003.

If we cannot assume normality for the data, then we need to move to a non-parametric alternative, after getting rid of the correlation by averaging the results of the cross validation. In the following subsections we will explore these alternatives.

Bayesian version of the correlated t-test

Let us assume that we regard the densities plotted above as "reasonably" normal. We can perform the comparison of any two algorithms (say AlgA and AlgB) using the Bayesian alternative to the correlated t-test implemented in the function bCorrelatedTtest as follows:

db <- 5
sample.a <- data.kcv.example[data.kcv.example$DB==db, "AlgA"]
sample.b <- data.kcv.example[data.kcv.example$DB==db, "AlgB"]
results <- bCorrelatedTtest(x=sample.a, y=sample.b, rho=0.1, rope=c(-0.01, 0.01))
results$posterior.probabilities

The function returns a list with a number of elements. The single most importants is posterior.probabilities, which contains the probability of the difference between the two algorithms (x-y) being below the lower bound of the rope (column labelled as Left), inside the rope (column labelled as Rope) and the probability of being above the upper bound of the rope (column labelled as Right). In this particular case we can see that most of the probability mass (r round(results$posterior.probabilities["Right"], 3) is in the right, meaning that we can be quite certain that Algorithm A is better (in terms of performance) than Algorithm B.

The result of this function includes also the posterior density function of the average difference (results$posterior), so we can use it to visually assess our uncertainty about the conclusion. For that, you can use the function plotPosterior as follows:

plotPosterior(results, plot.rope=TRUE)

For this particular case, the relevant paremeters of the function are plot.rope, to indicate whether the rope should be plotted or not and num.points, to indicate the number of points used to plot the function (1000 by default).

In the plot we can see that there is quite uncertainty about the difference (most of the density spreads between 0.5 and 2.5), although it is quite clear that the difference is above 0.01. In some other cases the decision may not be clear, as in the following example.

db <- 5
sample.a <- data.kcv.example[data.kcv.example$DB==db, "AlgC"]
sample.b <- data.kcv.example[data.kcv.example$DB==db, "AlgD"]
results <- bCorrelatedTtest(x=sample.a, y=sample.b, rho=0.1, rope=c(-0.01, 0.01))
results$posterior.probabilities

Now there is not a clear answer, the uncertainty is too high to conclude anything. We can, again, see this graphically:

plotPosterior(results, plot.rope=TRUE)

This is one of the difference with the classical frequentist analysis. While with statistical test we can either say that there are differences or that there is not enough evidence to say so, with the Bayesian analysis of the results we can have high certainty about a certain answer or, too much uncertainty to conclude anything. Moreover, we can have high certainty about a number of possible situations::

Algorithm A is better than B (when the right probability is above, say 0.9)
Both algorithms are practically equivalent (when the rope probability is above, say 0.9)
Algorithm B is better than A (when the left probability is above, say 0.9)
Algorithm A is not worse than B (when the right probability + the rope probability is above, say 0.9)
Algorithm B is not worse than A (when the left probability + the rope probability is above, say 0.9)
Algorithms A and B are not equivalent (when the right probability + the left probability is above, say 0.9)

Bayesian version of the signed-rank test

In the previous subsection we have assumed that the difference in performance follows a Gaussian distribution. In some cases, this is hardly true (e.g., plot the samples for dataset number 9 or 10) and thus, we should move to a non-parametric approach. The downside is that we cannot account for the correlation when no particular parametric distribution is considered. Instead, we have to get rid of it by averaging the values obtained in the k folds. Unfortunately, this way we also reduce the sample size and, thus, the reduction of the uncertainty will be smaller.

As an example, we will analyse the results obtained with the nineth dataset. As a first step, we have to average the results of the 10 folds.

db <- 9
summarized.data <- aggregate(data.kcv.example[, algorithms], 
                             by=data.kcv.example[, 1:2], FUN=mean)
sample.a <- summarized.data[summarized.data$DB==db, "AlgC"]
sample.b <- summarized.data[summarized.data$DB==db, "AlgD"]

Now we can proceed with the analysis. In this case, as we have mentioned, we will apply the alternative to the signed-rank test presented in @benavoli2017. This analysis is accessible through the function bSignedRankTest:

results <- bSignedRankTest(x=sample.a, y=sample.b,rope=c(-0.01, 0.01))
results$posterior.probabilities

As can be from the results, we can be quite confident that Algorithms C and D, in the ninth dataset have an almos equal behaviour (i.e., the differences are almost surely inside the rope). Actually, the interpretation in this case is somewhat different to the previous example, as now the estimations are based on sampling (see @benavoli2017 for more details). In particular, the method samples the probability of being bellow, inside or above the rope, as in the previous example. These probabilities are also reported by the function, inside the posterior element of the list (now we do not have a parametric function to represent the posterior distribution).

head(results$posterior)

From these results we can get the expected probability of each region just averaging the columns.

colMeans(results$posterior)

As we have the distribution of the triplet of probabilities, we can in this case compute a different probability, namely, the probability of each region being the one with the highest probability. That is, for each sample we can identify which region is the most probable one and, then, estimate the probability of the righ, rope, left segments being the most probable; these are the probabilities collected in results$posterior.probabilities.

In order to clarify these probabilities we can show them graphically using a Simplex plot. You can produce it using the plotSimplex function.

plotSimplex(results, A="Algorithm C", B="Algorithm D")

This plot represents all the samples obtained from the posterior distribution of the probabilities of the three segments (i.e., the triplets in results$posterior). The plot is in barycentric coordinates, being each vertex one of the probabilities in the results$posterior matrix. As can be seen, the triangle is divided into three equal areas, each corresponding to a vertex. All the points inside those areas have the particularity of being closest to the corresponding vertex, which in turn means that the highest probability for that point is the corresponding to that vertex.

The plot, therefore, represents the empirical distribution of the triplets, while the probabilities in results$posterior.probabilities correspond to the probability of the triplet falling into each area.

In this particular case the expected probability of falling in the rope is the highest, but the expected probability of falling in the right part is not negligible. However, there is very low uncertainty about the probability of the rope being the highest (results$posterior.probabilities["Rope"]) and, thus, we can conclude that both algorithms are practically equal in this dataset.

If we repeat the analysis with the results of the eight dataset the conclusions are different:

db <- 8
summarized.data <- aggregate(data.kcv.example[, algorithms], 
                             by=data.kcv.example[, 1:2], FUN=mean)
sample.a <- summarized.data[summarized.data$DB==db, "AlgC"]
sample.b <- summarized.data[summarized.data$DB==db, "AlgD"]
results <- bSignedRankTest(x=sample.a, y=sample.b,rope=c(-0.01, 0.01))
results$posterior.probabilities
colMeans(results$posterior)
plotSimplex(results, plot.density=FALSE, A="Algorithm C", B="Algorithm D", 
            posterior.label=TRUE)

In this case the expected probability for the rope (r round(colMeans(results$posterior)["Rope"], 3)) and the right ((r round(colMeans(results$posterior)["Right"], 3))) region are quite similar and, as a consequence, there is more uncertainty about which is the region with the highest probability (in this example you can see how you can include this information in the plot). Therefore, in this case there is more uncertainty about C being better or equal to D, but we can conclude that Algorithm C is not worse than Algorithm D.

Comparison of two classifiers in several datasets

In the previous section we have seen that the conclusions we can draw in different datasets may differ. In some cases it may be of interest knowing the behaviour of the algorithms in eacn dataset, but quite often we are looking for a global answer to the question of which algorithm (if any) is better. We can answer to this question by comparing the results obtained in all the datasets at the same time.

The particularity of this analysis is that, even if we can assume normality for the cross validated results in each dataset, merging all of them toghether makes no sense. For that reason, in @benavoli2017 the authors propose two methodologies, one non-parametric (the method presented at the end of the previous section) and a hierarchical model that allows us to take into account, separately, the differences inside each dataset and among datasets, as well as the correlation due to the cross validation. In the following subsections we will show you how you can run the proposed analyses.

Bayesian signed-rank test to compare two algorithms in multiple datasets

Using and interpreting the Bayesian equivalent to the signed-rank test is very similar to what we did in the previous section, with the exception that now we will average the results for each dataset and will use these results as the samples for the analysis.

First, we produce the summaries of the data:

summarized.data <- aggregate(data.kcv.example[, algorithms], 
                             by=data.frame(DB=data.kcv.example[, 1]), FUN=mean)
sample.a <- summarized.data[, "AlgC"]
sample.b <- summarized.data[, "AlgD"]

Now we can proceed with the analysis using the function bSignedRankTest:

results <- bSignedRankTest(x=sample.a, y=sample.b,rope=c(-0.01, 0.01))
results$posterior.probabilities

From the results we can see that, according to the 10 datasets in our data, either algorithms C and D are equivalent or C is better than D, but certainly it is very unlikely that algorithm D outperforms algorithm C. However, there is too much uncertainty to be sure that C outperforms D. We can see this graphically:

plotSimplex(results, A="Algorithm C", B="Algorithm D", plot.density=FALSE, alpha=0.5)

Conversely, if we compare algorithms A and B, the conclusion is quite clear.

summarized.data <- aggregate(data.kcv.example[, algorithms], 
                             by=data.frame(DB=data.kcv.example[, 1]), FUN=mean)
sample.a <- summarized.data[, "AlgA"]
sample.b <- summarized.data[, "AlgB"]
results <- bSignedRankTest(x=sample.a, y=sample.b,rope=c(-0.01, 0.01))
results$posterior.probabilities

We can visually assess the uncertainty using the Simplex plot:

plotSimplex(results, A="Algorithm A", B="Algorithm B", plot.density=FALSE, alpha=0.5)

In the plot we can clearly see that algorithm A is better than algorithm B. Moreover, all the points are in the rope-A line, meaning that the probability of algorithm B being better than A is very small. We can assess this with the expected probabilities as calculated in the previous section:

colMeans(results$posterior)

Hierarchical Bayesian model for the comparison of algorithms in multiple datasets

In the analysis above we use the averged values for each dataset. We do so to get rid of the correlation between samples. In this section we will use a hierarchical model that will allow us considering each observation individually, modelling the correlation in each dataset.

Briefly, the model will assume that the samples of each dataset follow a Gaussian distribution with a certain mean and variance. The priors for these parameters are a Students t distribution for the means (i.e., the model assumes that all the individual means are independent sample from a Student's t distribution) and a uniform distribution for the variances. The model further goes up in the hierarchy modeling the prior for the parameters of the Student's t distribution. For further information, please see @benavoli2017.

We can use this model through the function bHierarchicalTest. In this case, as we are not averaging the results and each dataset is modelled independently, we need to build two sample matrices where each row is a dataset. Let us prepare the data to compare algorithms C and D. Note that, in our data, the results are ordered by the dataset. If the arrangement of the data is different you will need to build the matrices in a different way.

sample.a <- matrix(data.kcv.example$AlgC, byrow=TRUE, nrow=10)
sample.b <- matrix(data.kcv.example$AlgD, byrow=TRUE, nrow=10)

Now we proceed with the analysis. As in the non-parametric approach, we cannot analyitically compute the posterior of the parameters and, thus, we need to simulate (sample) it. In this case, due to the complexity of the model, we need to use MCMC methods to obtaine the samples. The function that wil will use is based on a Stan program, so you need to have the rstan package installed. The sampling in this cases can be quite slow, depending on the problem, so be patient ...

results <- bHierarchicalTest(sample.a, sample.b, rho=0.1, rope=c(-0.01, 0.01), nsim=2000, nchains=5)

As we can see, in this function we again have the rho parameter, which is the correlation factor used in the model of each dataset. Additionally we have two parameters, nsim and nchains. These paremeters have to do with the simulation through the MCMC algorithm. The first parameter is the number of simulations obtained for each chain and the second is the number of chains used in the simulation. By default, in each chain half the of the samples are used as burn-in (Warmup in the output of Stan program). Therefore, the number of samples we will obtain is nchains*nsim/2, 5000 in our example.

To assess the results, we can directly plot the samples in the Simplex, together with the probabilities associated to each region.

plotSimplex(results, A="Alg. C", B="Alg. D", posterior.label=TRUE, alpha=0.5)

In the plot above we can see that we cannot be sure about whether Algorithm C is equal or better to D, but quite certainly it is not worse.

In addition to this global information, we can analyse the results per dataset. This information is contained in the additional element of the results list.

results$additional$per.dataset

The results shown in the table above are similar to those obtained with the Bayesian version of the correlated t-test (plus the expected mean difference under the posterior distribution) but with one exception. However, there is a subtle but important difference: the ten mean differences share a common, original distribution. The consequence is a shrinkage in the estimation of the averaged distributions (see @benavoli2017 for more details).