In elbamos/largeVis: High-Quality Visualizations of Large, High-Dimensional Datasets

require(ggplot2, 
        quietly = TRUE)
library(png)
library(grid)
knitr::opts_chunk$set(collapse = TRUE, 
                      comment = "#>",
                      cache = FALSE, 
                                            echo = FALSE)
theme_set(
  theme_bw() %+replace%
  theme(
    legend.title = element_text(size = rel(0.8), face = "bold"),
    legend.margin = margin(0, "cm"),
    legend.position = "right",
    legend.key.size = unit(2, "lines"),
    legend.text = element_text(size = unit(8, "points")), 
    axis.title.y = element_text(angle = 90),
    axis.text = element_text(size = rel(0.7)),
    plot.margin = margin(0, 0.5, 1, 0, "lines"), 
    axis.title = element_text(size = rel(0.8),
                              face = "bold"),
    title = element_text(size = rel(0.9))
  ) 
)
rebuild <- FALSE
par(mar = c(0.3, 0.3, 0.3, 0.3))
require(largeVis,quietly = TRUE)

Version 0.1.10 of largeVis adds two features that were not part of the original LargeVis paper:

*   The SGD phase of the LargeVis algorithm can be accelerated using momentum training.
*   The method for calculating weights in the negative sampling phase of the algorithm is now a parameter.

In addition, that package includes implementations of the DBSCAN, OPTICS, and HDBSCAN clustering algorithms optomized to take advantage of the neighbor data generated by largeVis.

Negative Sampling

During the stochastic gradient descent phase of largeVis, \code{M} nodes that do not have edges to the current point are sampled along with each edge. In the original paper, the negative nodes are selected by sampling weighted according to their degree raised to the $\frac{3}{4}$ power. In the reference implementation, however, the nodes are weighted according to $P_n(j) \propto (\sum_{i \iff j \in N_i} w_{ij})^{0.75}$.

Versions of largeVis prior to 0.1.8 (i.e., before the reference implementation became available) used degree. These results may be more aesthetically pleasing to many people. Version 0.1.10 therefore allows this to be set as a parameter. The default is to use edge weights, as in the reference implementation.

The difference is that using edge weights, the resulting clusters tend to be more plainly convex. Using degree encourages a wider variation of irregular shapes. The effect is imperceptible at small data sizes; is subtle noticeable at around 50,000 nodes; but becomes pronounced on datasets of greater than 1 million nodes.

if (rebuild) {
    load("../../largeVisData/mnist/test.RData")
    load("../../largeVisData/mnist/train.RData")
    mnist <- t(rbind(trainData, testData)) - 0.5
    preVis <- largeVis(mnist, K = 100, tree_threshold = 200, n_trees = 100, sgd_batches = 100, verbose = TRUE)
    set.seed(1974)
    labels <- rbind(trainLabels, testLabels)
    labels <- apply(labels, 1, function(x) which(x == 1))
    labels <- labels - 1
}

if (rebuild) {
    noDegree <- projectKNNs(preVis$wij, seed = 1974, verbose = TRUE)
    degree <- projectKNNs(preVis$wij, useDegree = TRUE, verbose = TRUE)
    degreeCoords <- data.frame(rbind(t(noDegree), t(degree)), 
                                                         label = factor(c(labels, labels)), 
                                                         degree = factor(rep(c("weights", "degree"), 
                                                                                                each = ncol(degree))))
}

if (rebuild) {
    degreeplot <- ggplot(degreeCoords, aes(x = X1, y = X2, color = label)) + 
        geom_point(size = 0.02, alpha = 0.1) + 
        facet_grid(. ~ degree) +
        scale_x_continuous("", labels = NULL, breaks = NULL) + 
        scale_y_continuous("", labels = NULL, breaks = NULL) +
        guides(color = FALSE) +
        ggtitle("Effect of useDegree")
    ggsave(degreeplot, 
             filename = system.file(package = "largeVis", "vignettedata/degreeplot.png"),
             width = 5, height = 4)
}

img <- readPNG(system.file(package = "largeVis", "vignettedata/degreeplot.png"))
grid.raster(img)

Momentum Training

Momentum training is method for optimizing stochastic gradient descent to encourage more rapid convergence. The technique is common in deep learning.

Momentum alters the gradient so that in each iteration the algorithm calculates the update for each parameter taking into account prior update on the same parameter. If the gradient for a parameter has the same direction on one iteration as the previous, momentum will cause it to move further in that direction. If the gradient reverses, the update will be smaller than it would be otherwise.

Without momentum, each iteration updates the low dimensional embedding according to the equation $\theta_{t + 1} = \theta_t + \rho\frac{dO}{d\theta_t}$

With momentum, this becomes: $\theta_{t + 1} = \theta_t + \mu_t$ where $\mu_t = \rho\frac{dO}{d\theta_t} + \lambda\mu_{t - 1}$

The momentum paramter, $\lambda$, controls the rate of decay.

Adding momentum can make it possible to reduce the number of sgd batches, in some cases substantially, without reducing the visual quality of the result.

if (rebuild) {
    starterCoords <- matrix(runif(n = 2 * ncol(mnist)) - 0.5, ncol = 2)
    firstCoords <- data.frame(
        t(projectKNNs(preVis$wij, coords = t(starterCoords), sgd_batches = 0.1, verbose = TRUE)), 
    lambda = 0, batches = 0.1, label = labels)
    for (batches in c(0.3, 0.8)) {
        newCoords <- data.frame(t(projectKNNs(preVis$wij, 
                                                                                    verbose = TRUE, 
                                                                                    sgd_batches = batches, 
                                                                                    coords = t(starterCoords))))
        newCoords$lambda <- 0
        newCoords$batches <- batches
        newCoords$label <- labels
        firstCoords <<- rbind(firstCoords, newCoords)
    }
    for (lambda in c(0.4, 0.9)) {
            for (batches in c(0.1, 0.3, 0.5)) {
                newtime <- system.time(newCoords <- data.frame(t(projectKNNs(preVis$wij, 
                                                                                                                                         verbose = TRUE,
                                                                                                                                         sgd_batches = batches, 
                                                                                                                                         momentum = lambda, 
                                                                                                                                         coords = t(starterCoords)))))
                newCoords$lambda <- lambda
                newCoords$batches <- batches
                newCoords$label <- labels
                firstCoords <<- rbind(firstCoords, newCoords)
            }
    }
    momentumCoords <- firstCoords
    momentumCoords$label <- factor(momentumCoords$label)
}

if (rebuild) {
    momentumPlot <- ggplot(momentumCoords, aes(x = X1, y = X2, color = label)) + 
        geom_point(size = 0.01, alpha = 0.1) + 
        facet_grid(batches ~ lambda, scales = "free", labeller = label_bquote(cols = lambda == .(lambda), 
                                                                                                                                                    rows = b == .(batches))) +
        scale_x_continuous("", limits = c(-40, 40), labels = NULL, breaks = NULL) + 
        scale_y_continuous("", limits = c(-40, 40), labels = NULL, breaks = NULL) +
        guides(color = FALSE) +
        ggtitle("Effect of Momentum and Reduced Training Batches")
    ggsave(momentumPlot, 
                 filename = system.file(package = "largeVis", "vignettedata/momentumplot.png"),
                 width = 6, height = 4)
}

img <- readPNG(system.file(package = "largeVis", "vignettedata/momentumplot.png"))
grid.raster(img)

Clustering

largeVis also includes three clustering algorithms: DBSCAN, OPTICS, and HDBSCAN.

The implementations here are not intended for general use. Rather, the LargeVis algorithm calculates, and largeVis retains, nearest neighbor information for inputs. Much of the computation time of the three clustering algorithms is spent identifying neighbors and calculating distances. The intent of these implementations is, by resuing the neighbor data generated by largeVis, to make it practical to apply these very computationally expensive algorithms to very large datasets.

Overview of Density Based Clustering

All three algorithms attempt to cluster points by linking a point to its nearest neighbors, using a modified definition of distance. The core distance of a point is defined as the distance to its Kth nearest neighbor. It is an (inverse) measure of the density of space around a point. The reachability distance between two points is the greater of the distance between them and core distances. When the neighbors of a point, $p$, are identified to link it with its nearest neighbors, no point $q$ can be closer to $p$ than $p$'s core distance. Thus, reachability distance combines both distance and density. (Differences in the definition of core distance and reachability distance among the algorithms are not addressed here.)

DBSCAN

DBSCAN [@Ester96adensity-based] attempts to achieve a flat clustering using a consistent density threshold, $\epsilon$. Points with more than minPts neighbors in their $\epsilon$-neighborhoods are core points. Where the $\epsilon$ neighborhoods of corepoints overlap, the neighborhoods are merged into a single cluster. DBSCAN thus takes two hyper-parameters, $\epsilon$ and minPts.

The objects returned by the largeVis DBSCAN implementation are compatible with the dbscan objects producted by package dbscan. [@hahsler]

The following chart illustrates the effect of the $\epsilon$ and minPts parameters on the performance of DBSCAN.

load(system.file(package = "largeVis", "vignettedata/spiral.Rda"))
dat <- spiral
vis <- largeVis(t(dat), K = 20, save_edges = TRUE, save_neighbors = TRUE, sgd_batches = 1)
set <- rbind(Map(f = function(y) {
    rbind(Map(f = function(x) {
        clust = lv_dbscan(vis, eps = x, minPts = y)$cluster
        data.frame(cluster = clust, eps = x, minPts = y)
    }, c(1, 3, 5)))
}, c(5, 10, 20)))
lbind <- function(x) do.call(rbind, x)
set <- lapply(set, FUN = lbind)
set <- lbind(set)
set$x <- rep(dat[, 1], 9)
set$y <- rep(dat[, 2], 9)
set$cluster <- factor(set$cluster)
set$eps <- ordered(set$eps, labels = paste("epsilon == ", c(1, 3, 5), sep = ""))
set$minPts <- ordered(set$minPts, labels = paste("minPts == ", c(5, 10, 20), sep = ""))
ggplot(data = set, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 0.5, alpha = 0.7) +
    facet_grid(minPts ~ eps, labeller = label_parsed) + 
    scale_x_continuous("", breaks = NULL) +
    scale_y_continuous("", breaks = NULL) +
    guides(color = FALSE) +
    ggtitle("Effect of eps and minPts on DBSCAN results")

OPTICS

OPTICS [@Ankerst:1999:OOP:304181.304187] applies similar logic as DBSCAN to build a hierarchical clustering. Beginning with a seed point, OPTICS finds the point with the lowest reachability distance (up to the cutoff, $\epsilon$) to any point already in the graph, and adds it. The result is an ordering of points by decreasing density. When there are no points with a reachability distance to the graph less than $\epsilon$, OPTICS begins again with a new seed point.

This implementation of OPTICS has a hyperparameter useQueue which controls the order in which points are processed. In effect, this controls what points will be the first point in newly discovered clusters. The reachability graph above used useQueue = FALSE, which is used by the dbscan package and the ELKI implementation. If useQueue is set to TRUE, then the algorithm will use points in denser regions of the space as the seeds for new clusters.

This is illustrated in the following reachability plots for the spiral dataset:

optClust <- lv_optics(vis, eps = 5, useQueue = FALSE, minPts = 5)
optClust2 <- lv_optics(vis, eps = 5, useQueue = TRUE, minPts = 5)
ggplot(data.frame(
    o = c(optClust$order, optClust2$order), 
    d = c(optClust$reachdist, optClust2$reachdist), 
    useQueue = rep(c("No Queue", "Use Queue"), each = length(optClust$order))
), aes(x = o, y = d)) + 
        geom_point(stat = 'identity', size = 0.1) + 
        geom_bar(stat = 'identity', alpha = 0.3) +
        facet_grid(useQueue ~ .) +
        scale_x_continuous("Order") + 
        scale_y_continuous("Reachability Distance")

OPTICS can be thought of as a hierarchical generalization of DBSCAN. OPTICS clusterings are readily convertible to DBSCAN clusterings by cutting clusters when the reachability distance exceeds a threshold. This can be done using the extractDBSCAN function of the dbscan package. The resulting clusters are identical to what would be obtained using DBSCAN with the same parameters, except in assignment of noise points.

suppressWarnings(opticsPoints <- do.call(rbind, Map(f = function(x) {
        clust = thiscut <- dbscan::extractDBSCAN(optClust, x)$cluster
        data.frame(cluster = clust, eps = x)
    }, c(1, 3, 5))))
opticsPoints$cluster <- factor(opticsPoints$cluster)
opticsPoints$x <- rep(dat[, 1], 3)
opticsPoints$y <- rep(dat[, 2], 3)
opticsPoints$eps <- factor(paste("epsilon ==" , opticsPoints$eps, sep = ""))

ggplot(data = opticsPoints, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 0.5, alpha = 0.7) +
    facet_grid(. ~ eps, labeller = label_parsed) + 
    scale_x_continuous("", breaks = NULL) +
    scale_y_continuous("", breaks = NULL) +
    guides(color = FALSE) +
    ggtitle("OPTICS Clusters With EPS Cuts")

The dbscan package has other functions for cutting and visualizing OPTICS clusterings as well. The optics objects produced by largeVis are compatible with the ones produced by the dbscan packag.

HDBSCAN

HDBSCAN [@Campello2013] aims to make the DBBSCAN and OPTICS approach flexible for graphs with different densities in different regions, by building a hierarchical density-bsaed clustering from which flat clusters are extracted. HDBSCAN does not need the $\epsilon$ hypterparameter, which can be difficult to set. Rather, HDBSCAN, in effect, determines different values for $\epsilon$ for different parts of the dataset.

suppressWarnings(set <- do.call(rbind, Map(f = function(y) {
    rbind(Map(f = function(x) {
        hdclust <- largeVis::hdbscan(vis, K = y, minPts = x)$cluster
        data.frame(cluster = as.numeric(hdclust), K = x, minPts = y)
    }, c(6, 10, 20)))
}, c(2, 6, 12))))
lbind <- function(x) do.call(rbind, x)
set <- lbind(set)
set$x <- rep(dat[, 1], 9)
set$y <- rep(dat[, 2], 9)
set$cluster <- factor(set$cluster)
set$K <- factor(paste("K=", set$K))
set$minPts <- factor(paste("minPts=", set$minPts))

ggplot(data = set, aes(x = x, y = y, color = cluster)) +
    geom_point(size = 0.5, alpha = 0.7) +
    facet_grid(minPts ~ K) + 
    scale_x_continuous("", breaks = NULL) +
    scale_y_continuous("", breaks = NULL) +
    guides(color = FALSE) +
    ggtitle("HDBSCAN Is Robust\nTo Hyperparameter Changes")

The largeVis implementation of HDBSCAN produces flat and hierarchical clusterings simultaneously. The hiearchical clusterings are compatible with hclust and other R hierarchical clustering packages and tools by means of an as.dendrogram.hdbscan function, which converts them to standard R dendrograms. The package also includes a function for visualizing the clustering.