In b0rxa/scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems

One of the latest additions to the package is a Bayesian model to analyse rankings of algorithms (@calvo2018; @calvo2019). In this vignette we will give a brief introduction to this model and some basic ideas about what kind of inference can we do using it.

Plackett-Luce model for rankings

The Bayesian model implemented in the package is based on the Plackett-Luce (PL) model for rankings (@Plackett1975TheAO). In the Plackett-Luce model we have a parameter, the weight, for each element in the ranking. We will denote as $w_i$ the weight of the $i$-th element. This model assumes that the probability of an item being before another item in a ranking is independent of the rest of the items.

For a given ranking (or permutation) $\sigma=(\sigma_1,\ldots,\sigma_n)$ and a given set of weights $\mathbf{w}=(w_1, \ldots,w_n)$, the probability of $\sigma$ is:

$$ P(\sigma|\mathbf{w})=\prod_{i=1}^n \frac{w_{\sigma_i}}{\sum_{j=i}^n w_{\sigma_j}} $$ As can be inferred from the equation above, any common factor in the weights does not affect to the probabilities. For that reason we can assume, without loss of generality, that the sum of all the weights is equal to 1. This provides us with a direct interpretation of the weights: $w_i=P(\sigma_1=i)$, that is, the probability that the $i$-th element is the top ranked one.

Bayesian Plackett-Luce model

Although there are other Bayesian models based on the PL model, in this package we have implemented a Bayesian model where the prior distribution of parameters is modelled as a Dirichlet distribution. Thus, the hyperparemeters of the model will be the $(\alpha_1,\ldots,\alpha_n)$ parameters of the prior Diriechlet distributions.

The model cannot be analytically solved, but samples of the posterior distribution of the weights can be obtained using MCMC methods.

Using the Bayesian PL model

This model is based on rankings and, thus, our input data have to be rankings. The standard PL model does not consider the possibility of ties and, thus, in the current version of the package no ties are allowed. To overcome this limitation we can solve ties at random.

Usually our starting point will be a matrix containing the result of a number of algorithms in a number of instances of a problem. Note that, in order to transform this data into rankings, the results have to be paired. If they are not (e.g., we we have independent repetitions) the data should be processed in order to have one measure per instance.

As a running example we will use the data.blum.2015 dataset. In this dataset we have the performance of 8 algorithms in some randomly generated instances. These instances have been generated in groups of 30 instances generated using each of 30 random generators. Thus, globally we have 900 different random instances and, thus, we can take this data as paired. Moreover, we can analyse subsets of these instances attending, for example, to their size. More details about the dataset can be found in @blum2015.

First, we have to load the data:

library(scmamp)
data("data_blum_2015")
head(data.blum.2015)

The function that implements the model is bPlackettLuceModel. We just need to provide the data, the criterion for ranking the data (either minimising or maximising) and, optionally, the details of the MCMC simulation (number of chains simulated and number of samples from each chain). In our example the goal is maximising the objective function and, thus, the algorithm with the highest value should have rank 1. To get so we have to set min to FALSE. In this example we will analyse the results when the instance size is 100.

data.100    <- subset(data.blum.2015, subset=data.blum.2015$Size==100)
results.100 <- bPlackettLuceModel(x.matrix=data.100[,-c(1,2)], min=FALSE, 
                                  nsim=2000, nchains=10,parallel=TRUE)

The method will return a list with different elements. The single most important is posterior.weights, as this table contains the weights sampled from the posterior distribution and we can get any usefull distribution or estimation from this sample. The result includes two particular estimations, the expected probability of an algorithm being the best and the expected mode ranking (i.e., the expected rank of each algorithm in the ranking with highest probability).

results.100$expected.win.prob
results.100$expected.mode.rank

What we can see in the code above is that, with an expected probability of 0.94 FrogCOL is the best algorithm for problems of size 100. Moreover, in the most probable ranking (the ranking mode), we expect to find FrogCOL in first position, FrogMIS in second position and FruitFly in thrid position.

These are estimations, but the true advantage of using Bayesian methods is that they provide a natural way to assess the uncertainty about these estimations. For example, we can see the posterior distribution of the probability of FrogCOL being the overall best algorithm just plotting its sampled weight.

hist(results.100$posterior.weights[,"FrogCOL"], 
     main="", xlab="Prob. FrogCOL being the best")

From this analyisis we can say that we are almost sure that the probability of FrogCOL being the best algorithm in problems of size 100 is above 0.9.

There is more information we can extract from the analysis. To illustrate it, let us move to a different scenario: problems of size 1000.

data.1000    <- subset(data.blum.2015, subset=data.blum.2015$Size==1000)
results.1000 <- bPlackettLuceModel(x.matrix=data.1000[,-c(1,2)], 
                                   min=FALSE, nsim=2000, nchains=10,parallel=TRUE)

For this new comparison we will visualize the distribution of the probability of being the best for all the algorithms. We can do this with a simple boxplot:

boxplot(results.1000$posterior.weights)

Again, as in the previous analysis, the three best algorithms are FrogCOL, FrogMIS and FruitFly. Now, in order to simplify the analysis we can focus our attention on these three algorithms. Note that, due to the properties of the Plackett-Luce model, we can just take the weights of these algorithms in order to compute probabilities and that, given that any common factor does not affect those computations, we can normalize the weights so they sum 1. We need to do this to show the following plot:

weights <- results.1000$posterior.weights[,c(1, 7, 8)]
weights <- weights / rowSums(weights)
plotBarycentric(weights)

We can see that there is not very much uncertainty about FrogCOL being the best algorithm. We can put this in numbers analysing the probability of FrogCOL being better than FrogMIS and FrogCOL being better than FruitFly. This can be estimated directly using the odds-ratio of the weights:

fc.better.fm <- weights[, 2] / (weights[, 2] + weights[, 3])
fc.better.ff <- weights[, 2] / (weights[, 2] + weights[, 1])
exp.fc.vs.fm <- mean(fc.better.fm)
exp.fc.vs.ff <- mean(fc.better.ff)

hist(fc.better.fm, main=paste("Expected probability =", 
                               round(exp.fc.vs.fm,3)), 
     xlab="Probability of FrogCOL better than FrogMIS")

hist(fc.better.ff, main=paste("Expected probability =", 
                               round(exp.fc.vs.ff,3)), 
     xlab="Probability of FrogCOL better than FruitFly")

This probability could be directly estimated from the data:

mean(data.1000[,"FrogCOL"]>data.1000[,"FruitFly"])

As we can see, both estimations disagree quite a lot. The explanation for this big difference has to do with the rest of the algorithms. It is important to remember that the model is fitted to account for all the preferences, and thus all the algorithms have a potential effect on the comparison of two algorithms. In the estimation above we are only considering the cases in which FrogCOL is better than FruitFly, but if we see what happens with, say, Ikeda we can see that FrogCOL is always better than Ikeda, but FruitFly only outperforms Ikeda in 90% of the experiments.

mean(data.1000[,"FrogCOL"]>data.1000[,"Ikeda"])
mean(data.1000[,"FrogMIS"]>data.1000[,"Ikeda"])
mean(data.1000[,"FruitFly"]>data.1000[,"Ikeda"])

This information is incorporated in the model, resulting in a worse valoration of FruitFly with respect to FrogCOL. As a result, if we limit our experiment to a subset of the algorithms the resulting model is not (necessarily) exactly the same and, thus, the estimations may change.

data.1000.sub    <- subset(data.blum.2015[, c(3, 9,10)], subset=data.blum.2015$Size==1000)
results.1000.sub <- bPlackettLuceModel(x.matrix=data.1000.sub, 
                                   min=FALSE, nsim=2000, nchains=10,parallel=TRUE)

Now we analyse the results.

weights.sub <- results.1000.sub$posterior.weights

plotBarycentric(weights.sub)

fc.better.fm <- weights.sub[, 2] / (weights.sub[, 2] + weights.sub[, 3])
fc.better.ff <- weights.sub[, 2] / (weights.sub[, 2] + weights.sub[, 1])
exp.fc.vs.fm <- mean(fc.better.fm)
exp.fc.vs.ff <- mean(fc.better.ff)

hist(fc.better.fm, main=paste("Expected probability =", 
                               round(exp.fc.vs.fm,3)), 
     xlab="Probability of FrogCOL better than FrogMIS")

hist(fc.better.ff, main=paste("Expected probability =", 
                               round(exp.fc.vs.ff,3)), 
     xlab="Probability of FrogCOL better than FruitFly")

One may think that it makes no sense having (so) different estimations depending on the set of algorithms analysed, but we have to remember that these analyses focus on the ranking of the algorithms, not on the magnitude of the differences. In that regard, additional algorithm can correct for this situation when some of the algorithms have very bad results in certain cases (as happens with FruitFly).

To sum up, regarding the experimental design, it is important to have in mind that this analysis involves a model and, thus, a simplification of the reality. Also, that using a model for the ranking of a number of algorithms the conclusions are linked to that set of algorithms (for comparing two algorithms we can use other type of methods).