knitr::opts_chunk$set( collapse = TRUE, fig.path = "man/figures/" )
A package for efficient computations of standard clustering comparison measures
Stable version on the CRAN.
install.packages("aricode")
The development version is available via:
devtools::install_github("jchiquet/aricode")
Computation of measures for clustering comparison (ARI, AMI, NID and even the $\chi^2$ distance) are usually based on the contingency table. Traditional implementations (e.g., function adjustedRandIndex
of package mclust) are in $\Omega(n + u v)$ where
In aricode we propose an implementation, based on radix sort, that is in $\Theta(n)$ in time and space.
Importantly, the complexity does not depends on $u$ and $v$.
Our implementation of the ARI for instance is one or two order of magnitude faster than some standard implementation in R
.
The functions included in aricode are:
ARI
: computes the adjusted rand indexChi2
: computes the Chi-square statisticsMARI/MARIraw
: computes the modified adjusted rand index (Sundqvist et al, in preparation)NVI
: computes the the normalized variation informationNID
: computes the normalized information distanceNMI
: computes the normalized mutual informationAMI
: computes the adjusted mutual informationexpected_MI
: computes the expected mutual informationentropy
: computes the conditional and joint entropiesclustComp
: computes all clustering comparison measures at onceHere are some timings to compare the cost of computing the adjusted Rand Index with aricode or with the commonly used function adjustedRandIndex
of the mclust package: the cost of the latter can be prohibitive for large vectors:
library(aricode) library(mclust) library(ggplot2) time.aricode <- function(times, c1, c2){ replicate(times, system.time(ARI(c1, c2))[3]) } time.mclust <- function(times, c1, c2){ replicate(times, system.time(mclust::adjustedRandIndex(c1, c2))[3]) } time.method <- function(times, c1, c2, n){ rbind( data.frame(time = time.aricode(times, c1, c2), expr = "aricode", n = n), data.frame(time = time.mclust(times, c1, c2), expr = "mclust", n = n) ) } # with similar classif, number of classes grows with n sim.timings <- function(n, times = 10) { c1 <- sample(1:(n/200), n, replace=TRUE);c2 <- c1; i_change <- sample(1:n, n/50, replace=FALSE) c2[i_change] <- c2[rev(i_change)] out <- time.method(times, c1, c2, n) data.frame(time=out$time, method=out$expr, n = n) }
# with similar classif, number of classes grows with n ns <- sort(c(200 * 2^(3:14), 150 * 2^(3:15))) timings <- do.call("rbind", lapply(ns, sim.timings))
p.timings <- ggplot(timings, aes(x=n, y=time, colour=method)) + geom_smooth(data = dplyr::filter(timings, n > 1e4), method = "lm") + geom_point(size=0.25, alpha=0.9) + labs(y="time (sec.)") + scale_x_log10( breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x)) ) + scale_y_log10(breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) + annotation_logticks() p.timings + ggtitle("number of classes grows with n") + theme_bw()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.