In xhtai/clusterTruster: Benchmarking Minimax Linkage

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(clusterTruster)

This vignette demonstrates how to reproduce the results in our paper, Benchmarking Minimax Linkage in Hierarchical Clustering (https://arxiv.org/abs/1906.03336), using the clusterTruster package. The following description is brief; for full details refer to the paper.

Data sets

For the benchmark study, we use a total of 7 data sets. For the observations in each data set, we need to define a distance metric in order to run hierarchical clustering. The data sets and distance metrics used are shown below.

| Data set | Distance metric | |:---------|:------| | Olivetti faces | $\ell_2$ distance | | Colon cancer | Correlation | | Prostate cancer | Correlation | | Iris | $\ell_2$ distance | | NBIDE | Maximum cross-correlation | | FBISW | Maximum cross-correlation | | Simulations: | | | - Spherical | $\ell_1, \ell_2$ distance | | - Elliptical | $\ell_1, \ell_2$ distance | | - Outliers | $\ell_1, \ell_2$ distance |

Perform clustering and generate comparisons between different linkage methods

The following is code used for each of the above data sets. The code shows the source of the data as well as how to generate comparison metrics.

Olivetti faces

library(RnavGraphImageData)
data("faces")

# https://rdrr.io/github/jlmelville/snedata/src/R/olivetti.R
olivetti_faces <- function() {
  if (!requireNamespace("RnavGraphImageData", quietly = TRUE,
                        warn.conflicts = FALSE)) {
    stop("olivetti_faces function requires 'RnavGraphImageData' package")
  }
  faces <- NULL
  utils::data("faces", envir = environment())

  df <- as.data.frame(t(faces))
  npeople <- 40
  nposes <- 10
  colnames(df) <- sapply(seq(1, 4096), function(x) { paste0("px", x)})
  rownames(df) <- apply(expand.grid(seq(1, nposes), seq(1, npeople)), 1,
                        function(x) { paste(x[2], x[1], sep = "_") })
  df$Label <-  factor(as.numeric(cut(1:nrow(df), npeople)))

  df
}
faces <- olivetti_faces()
myPairwise <- genPairwise(datasetName = faces, labelColNum = 4097)
myPairwise <- genSimDiff(datasetName = faces, featureCols = 1:4096, allPairwise = myPairwise, pairColNums = 1:2, measure = "l2dist") 

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l2dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/faces", "_", linkages[l], ".Rdata"))
}

Colon cancer

library(HiDimDA)
data(AlonDS)

myPairwise <- genPairwise(datasetName = AlonDS, labelColNum = 1)
myPairwise <- genSimDiff(datasetName = AlonDS, featureCols = 2:2001, myPairwise, pairColNums = 1:2, "correlation")

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "correlation", myDist = FALSE)
  save(outMetrics, file = paste0("./outMetrics/AlonDS", "_", linkages[l], ".Rdata"))
}

Prostate cancer

There are various possible sources of these data. The version that we use can be accessed from within the package using data(singh2002). It is downloaded from https://stat.ethz.ch/~dettling/bagboost.html. Do not use the data from the sda package in R, as this is scaled differently.

data(singh2002) # 102x6034

myPairwise <- genPairwise(datasetName = singh2002, labelColNum = 6034)
myPairwise <- genSimDiff(datasetName = singh2002, featureCols = 1:6033, myPairwise, pairColNums = 1:2, "correlation")

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "correlation", myDist = FALSE)
  save(outMetrics, file = paste0("./outMetrics/singh2002", "_", linkages[l], ".Rdata"))
}

Iris

This is the famous iris data set that is included within R.

data(iris)
myPairwise <- genPairwise(datasetName = iris, labelColNum = 5)
iris[, 1:4] <- scale(iris[, 1:4])
myPairwise <- genSimDiff(datasetName = iris, featureCols = 1:4, allPairwise = myPairwise, pairColNums = 1:2, measure = "l2dist")

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l2dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/iris", "_", linkages[l], ".Rdata"))
}

NBIDE

This data set is of 144 images of cartridge cases, taken from NIST's Ballistics Toolmarks Research Database. There are 12 images each taken from 12 different guns, from the NBIDE study. Pairwise comparisons have been pre-computed using the cartridges3D package, and stored in removedDups.

In this case, we do not need to use the functions genPairwise() or genSimDiff(), and can simply input the appropriate data and column information into getMetrics, as follows.

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = removedDups, pairColNums = 1:2, matchColNum = "match", distSimCol = "corrMax", linkage = linkages[l], myDist = FALSE)
  save(outMetrics, file = paste0("./outMetrics/NBIDE", "_", linkages[l], ".Rdata"))
}

FBISW

Similarly to NBIDE, data from FBISW have been precomputed, and are stored in FBISW.

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = FBISW, pairColNums = 1:2, matchColNum = "match", distSimCol = "corrMax", linkage = linkages[l], myDist = FALSE)
  save(outMetrics, file = paste0("./outMetrics/FBISW", "_", linkages[l], ".Rdata"))
}

Simulations

### 1. spherical
set.seed(7)
c1 <- MASS::mvrnorm(n = 100, rep(0, 10), Sigma = diag(10))
c2 <- MASS::mvrnorm(n = 100, c(2, 2, rep(0, 8)), Sigma = diag(10))
c3 <- MASS::mvrnorm(n = 100, c(0, 2, 2, rep(0, 7)), Sigma = diag(10))
spherical <- data.frame(clust = rep(1:3, each = 100), rbind(c1, c2, c3))

myPairwise <- genPairwise(datasetName = spherical, labelColNum = 1)
myPairwise <- genSimDiff(datasetName = spherical, featureCols = 2:11, allPairwise = myPairwise, pairColNums = 1:2, measure = "l2dist")
myPairwise <- genSimDiff(datasetName = spherical, featureCols = 2:11, allPairwise = myPairwise, pairColNums = 1:2, measure = "l1dist")

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l2dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/sphericall2", "_", linkages[l], ".Rdata"))

  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l1dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/sphericall1", "_", linkages[l], ".Rdata"))
}

### 2. elliptical
set.seed(7)
c1 <- MASS::mvrnorm(n = 100, rep(0, 10), Sigma = diag(x = c(1, 1, 1, 2, 2, 1, 1, 1, 1, 1)))
c2 <- MASS::mvrnorm(n = 100, c(2, 2, rep(0, 8)), Sigma = diag(x = c(1, 1, 1, 2, 2, 1, 1, 1, 1, 1)))
c3 <- MASS::mvrnorm(n = 100, c(0, 2, 2, rep(0, 7)), Sigma = diag(x = c(1, 1, 1, 2, 2, 1, 1, 1, 1, 1)))
elliptical <- data.frame(clust = rep(1:3, each = 100), rbind(c1, c2, c3))

myPairwise <- genPairwise(datasetName = elliptical, labelColNum = 1)
myPairwise <- genSimDiff(datasetName = elliptical, featureCols = 2:11, allPairwise = myPairwise, pairColNums = 1:2, measure = "l2dist")
myPairwise <- genSimDiff(datasetName = elliptical, featureCols = 2:11, allPairwise = myPairwise, pairColNums = 1:2, measure = "l1dist")

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l2dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/ellipticall2", "_", linkages[l], ".Rdata"))

  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l1dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/ellipticall1", "_", linkages[l], ".Rdata"))
}


### 3. outliers
set.seed(7)
c1 <- MASS::mvrnorm(n = 100, rep(0, 10), Sigma = diag(10))
c2 <- MASS::mvrnorm(n = 98, c(2, 2, rep(0, 8)), Sigma = diag(10))
c2out <- MASS::mvrnorm(n = 2, c(5, 5, rep(0, 8)), Sigma = diag(10))
c3 <- MASS::mvrnorm(n = 98, c(0, 2, 2, rep(0, 7)), Sigma = diag(10))
c3out <- MASS::mvrnorm(n = 2, c(0, 5, 5, rep(0, 7)), Sigma = diag(10))
outliers <- data.frame(clust = rep(1:3, each = 100), rbind(c1, c2, c2out, c3, c3out))

myPairwise <- genPairwise(datasetName = outliers, labelColNum = 1)
myPairwise <- genSimDiff(datasetName = outliers, featureCols = 2:11, allPairwise = myPairwise, pairColNums = 1:2, measure = "l2dist")
myPairwise <- genSimDiff(datasetName = outliers, featureCols = 2:11, allPairwise = myPairwise, pairColNums = 1:2, measure = "l1dist")

linkages <- c("single", "complete", "average", "centroid", "minimax")
for (l in 1:length(linkages)) {
  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l2dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/outliersl2", "_", linkages[l], ".Rdata"))

  outMetrics <- getMetrics(allPairwise = myPairwise, pairColNums = 1:2, matchColNum = 3, linkage = linkages[l], distSimCol = "l1dist", myDist = TRUE)
  save(outMetrics, file = paste0("./outMetrics/outliersl1", "_", linkages[l], ".Rdata"))
}

Plots

This is some example code for generating plots, assuming that metrics have been generated using each linkage method and stored in a Rdata files with their corresponding names, as demonstrated above. Assuming we have a folder called outMetrics/ with the following contents: faces_average.Rdata, faces_centroid.Rdata, faces_complete.Rdata, faces_minimax.Rdata, faces_single.Rdata, we can run the following code.

plotResultsGG("faces", plot_type = "minimax")
plotResultsGG("faces", plot_type = "misclass")
plotResultsGG("faces", plot_type = "pr")
plotResultsGG("faces", write_plot = FALSE, plot_type = "all")

For the NBIDE data set, we can run the following code, if we have outMetrics/ with the following contents: NBIDE_average.Rdata, NBIDE_centroid.Rdata, NBIDE_complete.Rdata, NBIDE_minimax.Rdata, NBIDE_single.Rdata.

plotResultsGG("NBIDE", correlation = TRUE, write_plot = FALSE, plot_type = "all")

xhtai/clusterTruster documentation built on May 22, 2020, 10:56 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xhtai/clusterTruster
Benchmarking Minimax Linkage

In xhtai/clusterTruster: Benchmarking Minimax Linkage

Data sets

Perform clustering and generate comparisons between different linkage methods

Olivetti faces

Colon cancer

Prostate cancer

Iris

NBIDE

FBISW

Simulations

Plots

R Package Documentation

Browse R Packages

We want your feedback!

xhtai/clusterTruster Benchmarking Minimax Linkage

In xhtai/clusterTruster: Benchmarking Minimax Linkage

Data sets

Perform clustering and generate comparisons between different linkage methods

Olivetti faces

Colon cancer

Prostate cancer

Iris

NBIDE

FBISW

Simulations

Plots

R Package Documentation

Browse R Packages

We want your feedback!

xhtai/clusterTruster
Benchmarking Minimax Linkage