require(ggplot2, quietly = TRUE) library(png) library(grid) knitr::opts_chunk$set(collapse = TRUE, comment = "#>", cache = FALSE, echo = FALSE) theme_set( theme_bw() %+replace% theme( legend.title = element_text(size = rel(0.8), face = "bold"), legend.margin = margin(0, "cm"), legend.position = "right", legend.key.size = unit(2, "lines"), legend.text = element_text(size = unit(8, "points")), axis.title.y = element_text(angle = 90), axis.text = element_text(size = rel(0.7)), plot.margin = margin(0, 0.5, 1, 0, "lines"), axis.title = element_text(size = rel(0.8), face = "bold"), title = element_text(size = rel(0.9)) ) ) rebuild <- FALSE par(mar = c(0.3, 0.3, 0.3, 0.3)) require(largeVis,quietly = TRUE)
Version 0.1.10 of largeVis
adds two features that were not part of the original LargeVis
paper:
* The SGD phase of the LargeVis algorithm can be accelerated using momentum training. * The method for calculating weights in the negative sampling phase of the algorithm is now a parameter.
In addition, that package includes implementations of the DBSCAN
, OPTICS
, and HDBSCAN
clustering algorithms optomized to take advantage of the neighbor data generated by largeVis
.
During the stochastic gradient descent phase of largeVis
, \code{M} nodes that do not have edges to the current point are sampled along with each edge. In the original paper, the negative nodes are selected by sampling weighted according to their degree raised to the $\frac{3}{4}$ power. In the reference implementation, however, the nodes are weighted according to $P_n(j) \propto (\sum_{i \iff j \in N_i} w_{ij})^{0.75}$.
Versions of largeVis
prior to 0.1.8 (i.e., before the reference implementation became available) used degree. These results may be more aesthetically pleasing to many people. Version 0.1.10 therefore allows this to be set as a parameter. The default is to use edge weights, as in the reference implementation.
The difference is that using edge weights, the resulting clusters tend to be more plainly convex. Using degree encourages a wider variation of irregular shapes. The effect is imperceptible at small data sizes; is subtle noticeable at around 50,000 nodes; but becomes pronounced on datasets of greater than 1 million nodes.
if (rebuild) { load("../../largeVisData/mnist/test.RData") load("../../largeVisData/mnist/train.RData") mnist <- t(rbind(trainData, testData)) - 0.5 preVis <- largeVis(mnist, K = 100, tree_threshold = 200, n_trees = 100, sgd_batches = 100, verbose = TRUE) set.seed(1974) labels <- rbind(trainLabels, testLabels) labels <- apply(labels, 1, function(x) which(x == 1)) labels <- labels - 1 }
if (rebuild) { noDegree <- projectKNNs(preVis$wij, seed = 1974, verbose = TRUE) degree <- projectKNNs(preVis$wij, useDegree = TRUE, verbose = TRUE) degreeCoords <- data.frame(rbind(t(noDegree), t(degree)), label = factor(c(labels, labels)), degree = factor(rep(c("weights", "degree"), each = ncol(degree)))) }
if (rebuild) { degreeplot <- ggplot(degreeCoords, aes(x = X1, y = X2, color = label)) + geom_point(size = 0.02, alpha = 0.1) + facet_grid(. ~ degree) + scale_x_continuous("", labels = NULL, breaks = NULL) + scale_y_continuous("", labels = NULL, breaks = NULL) + guides(color = FALSE) + ggtitle("Effect of useDegree") ggsave(degreeplot, filename = system.file(package = "largeVis", "vignettedata/degreeplot.png"), width = 5, height = 4) }
img <- readPNG(system.file(package = "largeVis", "vignettedata/degreeplot.png")) grid.raster(img)
Momentum training is method for optimizing stochastic gradient descent to encourage more rapid convergence. The technique is common in deep learning.
Momentum alters the gradient so that in each iteration the algorithm calculates the update for each parameter taking into account prior update on the same parameter. If the gradient for a parameter has the same direction on one iteration as the previous, momentum will cause it to move further in that direction. If the gradient reverses, the update will be smaller than it would be otherwise.
Without momentum, each iteration updates the low dimensional embedding according to the equation $\theta_{t + 1} = \theta_t + \rho\frac{dO}{d\theta_t}$
With momentum, this becomes: $\theta_{t + 1} = \theta_t + \mu_t$ where $\mu_t = \rho\frac{dO}{d\theta_t} + \lambda\mu_{t - 1}$
The momentum paramter, $\lambda$, controls the rate of decay.
Adding momentum can make it possible to reduce the number of sgd batches, in some cases substantially, without reducing the visual quality of the result.
if (rebuild) { starterCoords <- matrix(runif(n = 2 * ncol(mnist)) - 0.5, ncol = 2) firstCoords <- data.frame( t(projectKNNs(preVis$wij, coords = t(starterCoords), sgd_batches = 0.1, verbose = TRUE)), lambda = 0, batches = 0.1, label = labels) for (batches in c(0.3, 0.8)) { newCoords <- data.frame(t(projectKNNs(preVis$wij, verbose = TRUE, sgd_batches = batches, coords = t(starterCoords)))) newCoords$lambda <- 0 newCoords$batches <- batches newCoords$label <- labels firstCoords <<- rbind(firstCoords, newCoords) } for (lambda in c(0.4, 0.9)) { for (batches in c(0.1, 0.3, 0.5)) { newtime <- system.time(newCoords <- data.frame(t(projectKNNs(preVis$wij, verbose = TRUE, sgd_batches = batches, momentum = lambda, coords = t(starterCoords))))) newCoords$lambda <- lambda newCoords$batches <- batches newCoords$label <- labels firstCoords <<- rbind(firstCoords, newCoords) } } momentumCoords <- firstCoords momentumCoords$label <- factor(momentumCoords$label) }
if (rebuild) { momentumPlot <- ggplot(momentumCoords, aes(x = X1, y = X2, color = label)) + geom_point(size = 0.01, alpha = 0.1) + facet_grid(batches ~ lambda, scales = "free", labeller = label_bquote(cols = lambda == .(lambda), rows = b == .(batches))) + scale_x_continuous("", limits = c(-40, 40), labels = NULL, breaks = NULL) + scale_y_continuous("", limits = c(-40, 40), labels = NULL, breaks = NULL) + guides(color = FALSE) + ggtitle("Effect of Momentum and Reduced Training Batches") ggsave(momentumPlot, filename = system.file(package = "largeVis", "vignettedata/momentumplot.png"), width = 6, height = 4) }
img <- readPNG(system.file(package = "largeVis", "vignettedata/momentumplot.png")) grid.raster(img)
largeVis
also includes three clustering algorithms: DBSCAN
, OPTICS
, and HDBSCAN
.
The implementations here are not intended for general use. Rather, the LargeVis
algorithm calculates, and largeVis
retains, nearest neighbor information for inputs. Much of the computation time of the three clustering algorithms is spent identifying neighbors and calculating distances. The intent of these implementations is, by resuing the neighbor data generated by largeVis
, to make it practical to apply these very computationally expensive algorithms to very large datasets.
All three algorithms attempt to cluster points by linking a point to its nearest neighbors, using a modified definition of distance. The core distance
of a point is defined as the distance to its Kth nearest neighbor. It is an (inverse) measure of the density of space around a point. The reachability distance
between two points is the greater of the distance between them and core distances
. When the neighbors of a point, $p$, are identified to link it with its nearest neighbors, no point $q$ can be closer to $p$ than $p$'s core distance
. Thus, reachability distance
combines both distance and density. (Differences in the definition of core distance
and reachability distance
among the algorithms are not addressed here.)
DBSCAN
[@Ester96adensity-based] attempts to achieve a flat clustering using a consistent density threshold, $\epsilon$. Points with more than minPts
neighbors in their $\epsilon$-neighborhoods are core points
. Where the $\epsilon$ neighborhoods of corepoints overlap, the neighborhoods are merged into a single cluster. DBSCAN
thus takes two hyper-parameters, $\epsilon$ and minPts
.
The objects returned by the largeVis
DBSCAN
implementation are compatible with the dbscan
objects producted by package dbscan
. [@hahsler]
The following chart illustrates the effect of the $\epsilon$ and minPts
parameters on the performance of DBSCAN
.
load(system.file(package = "largeVis", "vignettedata/spiral.Rda")) dat <- spiral vis <- largeVis(t(dat), K = 20, save_edges = TRUE, save_neighbors = TRUE, sgd_batches = 1) set <- rbind(Map(f = function(y) { rbind(Map(f = function(x) { clust = lv_dbscan(vis, eps = x, minPts = y)$cluster data.frame(cluster = clust, eps = x, minPts = y) }, c(1, 3, 5))) }, c(5, 10, 20))) lbind <- function(x) do.call(rbind, x) set <- lapply(set, FUN = lbind) set <- lbind(set) set$x <- rep(dat[, 1], 9) set$y <- rep(dat[, 2], 9) set$cluster <- factor(set$cluster) set$eps <- ordered(set$eps, labels = paste("epsilon == ", c(1, 3, 5), sep = "")) set$minPts <- ordered(set$minPts, labels = paste("minPts == ", c(5, 10, 20), sep = "")) ggplot(data = set, aes(x = x, y = y, color = cluster)) + geom_point(size = 0.5, alpha = 0.7) + facet_grid(minPts ~ eps, labeller = label_parsed) + scale_x_continuous("", breaks = NULL) + scale_y_continuous("", breaks = NULL) + guides(color = FALSE) + ggtitle("Effect of eps and minPts on DBSCAN results")
OPTICS
[@Ankerst:1999:OOP:304181.304187] applies similar logic as DBSCAN
to build a hierarchical clustering. Beginning with a seed point, OPTICS
finds the point with the lowest reachability distance (up to the cutoff, $\epsilon$) to any point already in the graph, and adds it. The result is an ordering of points by decreasing density. When there are no points with a reachability distance
to the graph less than $\epsilon$, OPTICS
begins again with a new seed point.
This implementation of OPTICS
has a hyperparameter useQueue
which controls the order in which points are processed. In effect, this controls what points will be the first point in newly discovered clusters. The reachability graph above used useQueue = FALSE
, which is used by the dbscan
package and the ELKI
implementation. If useQueue
is set to TRUE
, then the algorithm will use
points in denser regions of the space as the seeds for new clusters.
This is illustrated in the following reachability plots
for the spiral dataset:
optClust <- lv_optics(vis, eps = 5, useQueue = FALSE, minPts = 5) optClust2 <- lv_optics(vis, eps = 5, useQueue = TRUE, minPts = 5) ggplot(data.frame( o = c(optClust$order, optClust2$order), d = c(optClust$reachdist, optClust2$reachdist), useQueue = rep(c("No Queue", "Use Queue"), each = length(optClust$order)) ), aes(x = o, y = d)) + geom_point(stat = 'identity', size = 0.1) + geom_bar(stat = 'identity', alpha = 0.3) + facet_grid(useQueue ~ .) + scale_x_continuous("Order") + scale_y_continuous("Reachability Distance")
OPTICS
can be thought of as a hierarchical generalization of DBSCAN
. OPTICS
clusterings are readily convertible to DBSCAN
clusterings by cutting clusters when the reachability distance
exceeds a threshold. This can be done using the extractDBSCAN
function of the dbscan
package. The resulting clusters are identical to what would be obtained using DBSCAN
with the same parameters, except in assignment of noise points.
suppressWarnings(opticsPoints <- do.call(rbind, Map(f = function(x) { clust = thiscut <- dbscan::extractDBSCAN(optClust, x)$cluster data.frame(cluster = clust, eps = x) }, c(1, 3, 5)))) opticsPoints$cluster <- factor(opticsPoints$cluster) opticsPoints$x <- rep(dat[, 1], 3) opticsPoints$y <- rep(dat[, 2], 3) opticsPoints$eps <- factor(paste("epsilon ==" , opticsPoints$eps, sep = "")) ggplot(data = opticsPoints, aes(x = x, y = y, color = cluster)) + geom_point(size = 0.5, alpha = 0.7) + facet_grid(. ~ eps, labeller = label_parsed) + scale_x_continuous("", breaks = NULL) + scale_y_continuous("", breaks = NULL) + guides(color = FALSE) + ggtitle("OPTICS Clusters With EPS Cuts")
The dbscan
package has other functions for cutting and visualizing OPTICS
clusterings as well. The optics
objects produced by largeVis
are compatible with the ones produced by the dbscan
packag.
HDBSCAN
[@Campello2013] aims to make the DBBSCAN
and OPTICS
approach flexible for graphs with different densities in different regions, by building a hierarchical density-bsaed clustering from which flat clusters are extracted. HDBSCAN
does not need the $\epsilon$ hypterparameter, which can be difficult to set. Rather, HDBSCAN
, in effect, determines different values for $\epsilon$ for different parts of the dataset.
suppressWarnings(set <- do.call(rbind, Map(f = function(y) { rbind(Map(f = function(x) { hdclust <- largeVis::hdbscan(vis, K = y, minPts = x)$cluster data.frame(cluster = as.numeric(hdclust), K = x, minPts = y) }, c(6, 10, 20))) }, c(2, 6, 12)))) lbind <- function(x) do.call(rbind, x) set <- lbind(set) set$x <- rep(dat[, 1], 9) set$y <- rep(dat[, 2], 9) set$cluster <- factor(set$cluster) set$K <- factor(paste("K=", set$K)) set$minPts <- factor(paste("minPts=", set$minPts)) ggplot(data = set, aes(x = x, y = y, color = cluster)) + geom_point(size = 0.5, alpha = 0.7) + facet_grid(minPts ~ K) + scale_x_continuous("", breaks = NULL) + scale_y_continuous("", breaks = NULL) + guides(color = FALSE) + ggtitle("HDBSCAN Is Robust\nTo Hyperparameter Changes")
The largeVis
implementation of HDBSCAN
produces flat and hierarchical clusterings simultaneously. The hiearchical clusterings are compatible with hclust
and other R hierarchical clustering packages and tools by means of an as.dendrogram.hdbscan
function, which converts them to standard R dendrograms. The package also includes a function for visualizing the clustering.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.