Statistical Assessment of the Differences

Statistical Assessment of the Differences

This vignette shows the use of the package scmamp to assess the statistical differences between the results obtained by a number of algorithms in different problems. This is a typical task in areas such as Machine Learning or Optimization, where algorithms are typically compared measuring their performance in different instances of problems, datasets, etc. However, a similar procedure may be used in other contexts.

The package and this vignette is based mainly on the papers @garcia2010 and @garcia2008, which is an extenstion of Demšar's paper (@demsar2006).

If you are familiar with these papers and want a quick guide, jump to the last section of this document (Summary). Then, you can review the rest of the vignette for more details.

This vignette is divided into three different parts. The first reviews the global analysis for any algorithm behaving differently, and the other two are concerned with the post-hoc tests run in case not all the algorithms have the same performance. The second part shows how all pairwise tests can be conducted and the third how comparisons with respect to a control algorithm can be done. The election of the comparison will depend on the type of experimentation and the conclusions we want to draw.

As a guiding examples, we will use the results included in @garcia2008 (Table 2) and the part of the results in @blum2015. These data is available in the package and can be loaded typing:

library("scmamp")
library("ggplot2")
library("Rgraphviz")
data(data_blum_2015)
data(data_gh_2008)
head(data.blum.2015)
head(data.gh.2008)

These data represents the accuracy obtained by some algorithms in different datasets. Any other data can be used, provided that it is a named matrix or data.frame. For more details about how to load experimental results, please see the vignette concerning the loading and manipulation of data.

Parametric vs. non-parametric

One of the very first things we need to decide is whether we can safely use parametric tests to assess the differences between algorithms. This is quite a tricky question, as using parametric tests when the assumptions hold yields a more powerful test, but the opposite may be true if they do not hold.

The classical parametric tests assume that the data is distributed according to a Gaussian distribution and, in most cases, that the variance is the same for all the samples. When this is true we can use these tests to have an increased power (compared with non parametric tests). Although there are statistical tests to check both the normality and the homocedasticity ---many of them can be found in R, e.g., shapiro.test and bartlett.test---, they are not very powerful and this, together with the typically small samples, render them as non-effective tools.

For this reason, in this package we have included a couple of functions to visually valorate the assumptions of normality and homocedasticity. Note that, in some cases, there is no need to test this, as the data may be evidently non unimodal. An example of such situation is the data.blum.2015 data, where we have different types of problems, each with values in a different scale---You can check this by using the following functions to visualize the data.

The first plot we can crate is a density plot, using the function plotDensities. This function uses a kernel density estimation (KDE) of the distribution of the samples to visualize it.

plotDensities (data=data.gh.2008, size=1.1)

The first and only mandatory argument is the matrix that includes the results for each algorithm. The plots are created using ggplot2, which is a powerfull tool to create plots in R. Morover, the result of this function is an object that can be further modified, as we will see in other plots. The function also accepts additional parameters that are directly passed to ggplot2's function geom_line, which is the one that actually creates the lines; the size = 1.1 argument is an example of this.

In this plot we can see that most of the samples can be hardly regarded as normal, mainly due to their lack of symmetry and unimodality. Not only the assumption of normality does not hold, neither the assumption of equal variances seems to be true.

An additional kind of plot we can use to visually check the goodness of fit of the samples is the classical quantile-quantile plot, which represents the empirical and theoretical quantiles---assuming a Gaussian distribution. When all the points lay in the diagonal of the plot, both theoretical and empirical quantiles are equal and, thus, we can assume that the data can be approached with a Gaussian distribution. We can create these plots for each column using the qqplotGaussian function.

qqplot <- qqplotGaussian (data.gh.2008[,"k-NN(k=1)"], size=5 , col="orchid")
qqplot + theme_classic()

As can be seen in the plot, there are regions where the sample points are away of the diagonal. This is particularly evident in the left part of the plot, due to the relatively long left tail of the empirical distribution. Additionally, the example shows one of the possible ways in which the result of the function can be modify to change its appearence. For the interested reader, there is an excelent book covering the use and the phylosophy behind ggplot2 (@wickham2009).

As a conclusion, in @demsar2006 the author arguments against the use of parametric tests in the context of Machine Learning experiment analysis (see @demsar2006, page 10); similar arguments can be applied to the evaluation of optimization algorithms.

Testing for differences

Once the question parametric/non parametric is clear, the next step should be the use of a statistical test to check whether there are differences among the algorithms or not. In other words, determine if there is one or more algorithms whose performance can be regarded as significantly different.

In case the required assumptions are reasonably true, F-test for K population means (ANOVA) test can be used to assess the differences between the algorithms (the package include the function anovaTest to do this). However, as we have seen, in our running example it is clear that we cannot assume normality in the data. Therefore, in this example we will restrict the use of the package to non parametric methods.

This package includes two non parametric methods to compare multiple algorithms, the classical Friedman test (@friedman1937) and a modification by Iman and Davenport (@iman1980). Although R's base installation includes the former, we have reimplemented it in this as explained in @demsar2006, page 11. This tests are available through functions friedmanTest and imanDavenportTest. In addition, the Friedman's Aligned Rank Test and Quade Test presented in @garcia2010 have been implemented---Note that this paper includes some errors in the computations due to some bugs in their code; for this reason, the results obtained with this package may not match those in the paper.

friedmanTest(data.gh.2008)
imanDavenportTest(data.gh.2008)
friedmanAlignedRanksTest(data.gh.2008)
quadeTest(data.gh.2008)

The obtained p-values indicate that we can safely reject the null hypothesis that all the algorithms perform the same. Therefore, we can proceed with the post-hoc test.

We have two options, comparing all the algorithms among them or comparing all with a control. The latter is the typical situation where we are comparing a new proposal with the state of the art algorithms while the former fits better in a review of existing methods.

Pairwise differences

Once we have verified that not all the performances of the algorithms are the same, the next step is analyzing which are different. For that, we have different possibilities.

Nemenyi post hoc test

In @demsar2006 the author proposes the use of the Nemenyi test that compares all the algorithms pairwise. It is the non parametric equivalent to the Tukey post hoc test for ANOVA (which is also available through the tukeyPost function), and is based on the absolute difference of the average rankings of the classifiers. For a significance level $\alpha$ the test determines the critical difference (CD); if the difference between the average ranking of two algorithms is grater than CD, then the null hypothesis that the algorithms have the same performance is rejected. The function nemenyiTest computes the critical difference and all the pairwise differences.

test <- nemenyiTest (data.gh.2008, alpha=0.05)
test
test$diff.matrix
abs(test$diff.matrix) > test$statistic

As the code above shows, with a significance of $\alpha = 0.05$ any two algorithms with a difference in the mean rank above r round(test$statistic,3) will be regarded as non equal. The test also returns a matrix with all the pair differences, so it can be used to see for which pairs the null hypothesis is rejected. As an example, the performance of C4.5 and 1NN are different, but we cannot state that C4.5 and Naive Bayes have a different behaviour.

In @demsar2006 the author proposes a plot to visually check the differences, the critical differece plot. This kind of plot can be created using the plotCD function, which has two parameters, the data. matrix and the significance level. In the plot, those algorithms that are not joined by a line can be regarded as different.

plotCD (data.gh.2008, alpha=0.05, cex=1.25)
plotCD (data.gh.2008, alpha=0.01, cex=1.25)

Note that the text in the plot is defined in absolute size, while the rest is relative to the size of the plot. The default size (0.75) is tuned for a plot width of, roughly, 7 inches. In case the dimensions of the plot need to be bigger, the default size can be changed with the cex option, as in the example above (the dimension of these plots is 12x4 inches).

This procedure is, among those implemented in the package, the one most conservative---i.e., the one with the less statistical power. Howerver, it provides an intiutive way to visualize the results.

Corrected pairwise tests

The second approach consists in using a classical test to assess all the pairwise differences between algorithms and then correct the p-values for multiple testing. In a parametric context the typicall election would be a paired t-test but, given that we cannot assume normality, we should use a non parametric test, such as Wilcoxon signed-rank test or the corresponding post hoc tests for Friedman, Friedman's Aligned Ranks and Quade tests (see @garcia2008, Section 2.1 and @garcia2010, Section 5).

The package includes the implementations of the post hoc tests mentioned in @garcia2010 through functions friedmanPost, friedmanAlignedRanksPost and quadePost.

friedmanPost(data=data.gh.2008, control=NULL)
quadePost(data=data.gh.2008, control=NULL)
pv.matrix <- friedmanAlignedRanksPost(data=data.gh.2008, control=NULL)

For the sake of flexibility, there is a special wrapper function, customPost, that allows applying any test. This function has a special argument, test, that has to be a function with, at least, two arguments, x and y, that performs the desired test. For more information, type ?customPost.

The chosen test is applied to the $\frac{k(k-1)}{2}$ pairwise comparisons, where $k$ is the number of algorithms. Due to the multiple application of the test, some p-value correction method has to be used in order to control the familywise error rate.

There are many general methods to correct this p-values, such as the well known Bonferroni procedure or Holm's step-down method (@holm1979). However, these methods do not take into account the particular situation of pair-wise comparisons, where not any combination of null hypothesis can be true at the same time. As an example, suppose that we know that algorithms A and B are equal and, simultneously, A and C are also equal. Then, we cannot reject the hypothesis that A and C are equal.

This problem was tackled by Juliet P. Shaffer (@shaffer1986). There are two procedures to correct the p-values, accoding to this paper. In the first one (sometimes called Shaffer static) the particular ordering of the null hypothesis is not taken into account and only the maximum number of simultaneous hypothesis is considered. The second one further limits the number of possible hypothesis by considering which particular hypothesis have been rejected. This increases the power of the method, but it is computationally very expensive. Instead of this procedure, in @garcia2008, the authors propose to use Bergmann and Hommel's method (@bergmann1988).

These procedures can be applied to a matrix of raw p-values using functions adjustShaffer and adjustBergmannHommel.

pv.matrix
adjustShaffer(pv.matrix)
pv.adj <- adjustBergmannHommel(pv.matrix)
pv.adj

The package also includes other correction methods, as we will see in the comparisons with a control algorithm. However, as these do not take into account the particular interactions between hypothesis, they are more restrictive approaches.

Bergmann and Hommel's correction is extremely expensive method---in computational terms. However, the structures required to perform the correction are stored in the disk and, thus, it is computationally tracktable up to 9 algorithms.

Graphical representations

Conversely to what happen with Nemenyi test, it makes no sense to draw a critical difference plot, since the critical differences are not constant throughout the comparisons. In absence of this intuitive plot, the package includes two types of plots to graphically display the results.

The first function is drawAlgorithmGraph, which plots a graph where the algorithms are the nodes and two nodes are linked in the null hypothesis of being equal cannot be rejected. This function makes use of the Rgraphviz package, so it has to be installed in order to use this function. The package is currently in Bioconductor, so it can be installed as follows.

source("http://www.bioconductor.org/biocLite.R")
biocLite("Rgraphviz")

The plot can incorporate information about each algorithm. In this case we will print the average ranking, in a similar way as in the critical difference plot.

r.means <- colMeans(rankMatrix(data.gh.2008))
drawAlgorithmGraph(pvalue.matrix=pv.adj, mean.value=r.means, alpha=0.05,
                 font.size=10, node.width=3, node.height=1)

In the code above we can see that there is a parameter called font.size, that can be used to change the font size to adapt it to the size of the plot (in a similar way as it happens in the critical difference plot). In addition to this, there is a number of parameters that can allow the user to customize the plot. The options are:

The Rgraphviz package has a number of layouts that can be used to plot graphs (called 'dot', the one used by default, 'twopi', 'neato', 'circo' and 'fdp'). These layouts can be used including them right after the two first parameters.

r.means <- colMeans (rankMatrix(data.gh.2008))
drawAlgorithmGraph (pvalue.matrix=pv.adj, mean.value=r.means, alpha=0.05, 'fdp',
                    highlight.color="red", node.color="white", font.color="black",
                    font.size=10, node.width=2, node.height=1)

This graph is the one corresponding to Bergmann and Hommel dynamic procedure. From its comparision with the previous one, we can check its increased power, as with the same $\alpha$ it rejects two more hypothesis, namely, that CN2 is equal to Naive Bayes and C4.5.

The second plot can be used to directly visualize the p-value matrix generated when doing all the pairwise comparisons. The function that creates such a plot is plotPvalues.

plt <- plotPvalues(pvalue.matrix=pv.adj, 
                   alg.order=order(r.means, decreasing=FALSE))
plt + 
  labs(title="Corrected p-values using Bergmann and Hommel procedure") + 
  scale_fill_gradientn("Corrected p-values" , colours = c("skyblue4" , "orange"))

The code above also shows how to modify some aesthetic aspects using ggplot2.

Comaprison with a control

In some experimentations we will be interested in comparing a set of algorithms with a control one---our proposal, typically. All the tests presented in the previous section can be also used in this case, fixing the control parameter to one of the algorithms in the data. When this parameter is not fixed ---or set as NULL---, all the pairwise comparisons are performed, but when it takes a (valid) value, all the algorithms are compared with a reference.

friedmanAlignedRanksPost(data.gh.2008, control = "NaiveBayes")
pv <- quadePost(data.gh.2008, control = 2)

As can be seen in the code above, the reference can be set either using the column name or its index. The values computed in this way can be corrected to cope with the problem of multiple testing. However, in this case, using Shaffer and Brgmann and Hommel procedures makes no sense, as we do not have all the comparisons. Instead, we can use any of the methods listed in @garcia2010. Some of these are implemented in the package and other are available through R's p.adjust function. In particular, the methods implemented are:

adjustHolland(pvalues=pv)
adjustFinner(pvalues=pv)
adjustRom(pvalues=pv, alpha=0.05)
adjustLi(pvalues=pv)

Comparisons by groups of problems

In some empirical evaluations we may be interested in analyzing the results obtained in different groups of instances. For example, in the case of the data from @blum2015 we may be interested in evaluating the algorithms in each problem size (100, 1000 and 5000). Computing the p-value in such an scenario is as simple as with a single group, but the correction of the p-values is by no means trivial, as all the comparisons should be considered. This is particularly complex in Shaffer and Bergmann and Hommel corrections, as the information about multiple pairwise comparisons has to be introduced.

Global functions

With the aim at further simplifying the use of the package we have defined two wrapper functions, multipleComparisonsTest, to perform the multiple comparison tests and postHocTest to run the individual comparisons. Both methods can be used either grouping the problems or using the whole dataset. Note that, in case all the problems are grouped the number of tests performed increases and, thus, the global number of tests should be considered when the p-values are adjusted.

In the next section there are some example of the use of these functions, so here we will briefly descibe the main common arguments. Further information about the functions can be obtained from their help pages.

In the case of the postHocTest function, there are four additional parameters:

Regarding the output of the functions, it depends on whether the problems are grouped or not. In the case of multipleComparisonsTest function, if the data is not grouped the result is an htest object, as any of the functions that performs this type of test. If the data is grouped, then the output is a matrix with the p-values (raw and adjusted) obtained for each group. In the case of postHocTest, in both cases the function outputs the summarized data (grouped or not), the raw p-values and the corrected p-values. In case the data is grouped and all the pairwise comparisons are performed, then the p-values are in a three dimensional array, being the last dimension the group to which the p-values correspond.

Summary

This section shows a couple of examples of typical comparisons done in the context of algorithm comparisons. In the first one all the data is included in a single comparison while in the second the data will be grouped according the the problem features.

The typical sequence of analysis includes, first, testing the presence of any algorithm that behaves differently, using a test that compares simultaneously all the algorithms. Then, provided that the null hypothesis is rejected, a post hoc can be conducted. In case we can designate a control method, then the rest are tested against the control; in any other case, all the pairwise comparisons are performed.

For the first example we will use the dataset from @garcia2008.

alpha <- 0.05
data <- data.gh.2008

friedmanTest(data)

Alternatively, we can use any of the other methods implemented (e.g., imanDavenportTest or quadeTest), or the wrapper function multipleComparisonTest:

multipleComparisonTest(data=data, test="iman")

Provided that the p-value obtained is below $\alpha$, if we have no control method then we can proceed with all the pairwise comparisons using the postHocTest wrapper function ---alternatively you can use directly the functions that implement all the tests and corrections. In this case we can select any test for the comparisons. For the p-value correction, any method can be used, but in this particular case it is advisable using Bergman Hommel's procedure if the number of algorithms to compare is 9 or less and Shaffer's method in case they are 10 or more. The reason is that these methods include the particularities of the pairwise comparisons in order to perform a less conservative correction, leading to statistically more powerfull methods.

post.results <- postHocTest(data=data, test="aligned ranks", correct="bergmann", 
                            use.rank=TRUE)
post.results

alg.order <- order(post.results$summary)
plt <- plotPvalues(post.results$corrected.pval, alg.order=alg.order) 
plt + labs(title=paste("Corrected p-values using Bergmann and Hommel procedure",sep=""))
drawAlgorithmGraph(post.results$corrected.pval, mean.value=post.results$summary, 
                   alpha=alpha,  font.size=10)

For the second example we will use the dataset data.blum.2015, which contains two columns, Size and Radius, that allow us grouping the problems. First, we will search for differences in each comination of size and radius. Then, given that in all the cases the null hypothesis can be safely rejected, we will proceed with the comparison of all the methods with a control, the FrogCOL algorithm.

data <- data.blum.2015
group.by <- c("Size","Radius")
multipleComparisonTest(data=data, group.by=group.by, 
                       test="quade", correct="finner")

control <- "FrogCOL"
post.results <- postHocTest(data=data, group.by=group.by, control=control, 
                            test="aligned ranks", correct="rom", use.rank=FALSE)

The results can be used to create a LaTeX table where the results without significant differences with respect to our control are highlighted in italic and the best results in bold font.

avg.val <- post.results$summary
best <- apply(avg.val, MARGIN=1, 
              FUN=function(x){
                m <- max(x[-(1:2)])
                return(c(FALSE, FALSE, x[-(1:2)]==m))
              })
best <- t(best)
no.diff <- post.results$corrected.pval > alpha
# The size and radius columns set as false
no.diff[,1:2] <- FALSE
no.diff[is.na(no.diff)] <- FALSE
writeTabular(table=avg.val, format='f', bold=best, italic=no.diff, 
             hrule=c(0, 10, 20, 30), vrule=2, digits=c(0, 3, rep(2, 8)), 
             print.row.names = FALSE)

As an alternative analysis, we will compare, for each graph size, all the algorithms. Note that, in this case, as the data contains the Radius column that should not be included in the comparison, we have to specify the columns that contain the algorithms---or, alterantively, remove the column from the data.

control <- NULL
group.by <- "Size"
post.results <- postHocTest(data=data, algorithms=3:10, group.by=group.by, 
                            control=control, test="aligned ranks", correct="holland", 
                            use.rank=TRUE)

# Plot the matrix for the first group
i <- 1
alg.order <- order(post.results$summary[i,-1])
plotPvalues(post.results$corrected.pval[, , i], alg.order=alg.order) 

# Plot the matrix for the second group
i <- 2
alg.order <- order(post.results$summary[i,-1])
plotPvalues(post.results$corrected.pval[, , i], alg.order=alg.order) 

# Plot the matrix for the third group
i <- 3
alg.order <- order(post.results$summary[i,-1])
plotPvalues(post.results$corrected.pval[, , i], alg.order=alg.order) 

References



Try the scmamp package in your browser

Any scripts or data that you put into this service are public.

scmamp documentation built on May 1, 2019, 10:10 p.m.