UMAP and SOM from Distance Matrices"
In Mercator: Clustering and Visualizing Distance Matrices

knitr::opts_chunk$set(echo = TRUE, fig.width=6, fig.height=6, echo=TRUE)

Introduction

We recently added visualizations based on Self-Organizing Maps (SOM) and Uniform Manifold Approximation and Projection (UMAP) for objects in the Mercator package. This addition relies on the core implementations in the kohonen and umap packages, respectively. The main challenge that we faced was that both of those implementations want to receive the raw data as inputs, while Mercator objects only store a distance matrix computed from the raw data. The purpose of this vignette is, first, to describe how we worked around this restriction, and second, to illustrate how to use these methods in the package.

The Mercator Class

First we load the package.

suppressMessages( suppressWarnings( library(Mercator) ) )

Next, we load a "fake" set of synthetic continuous data that comes with the Mercator package.

set.seed(36766)
data(fakedata)
ls()
dim(fakedata)
dim(fakeclin)

UMAP

Let's create a UMAP visualization directly from the raw data.

library(umap)
udirect <- umap(t(fakedata))
plot(udirect$layout, main="Original UMAP")

We aren't going to do anything fancy with labels or colors in this plot; we just want an idea of the main structure in the data. In particular, it looks like there are seven clusters, divided into one group three and two groups of two each.

Next, we are going to create a Meractor object with a multi-dimensional scaling (MDS) visualization. Inspired by the first plot, we will arbitrarily assign seven groups. (This assignment uses the PAM algorithm internally.)

mercury <- Mercator(dist(t(fakedata)), "euclid", "mds", 7)
plot(mercury, main = "MDS")

Interestingly, here we see a possibly different number of groups. But that's a separate issue, and we don't want to distracted from our main thread.

We want to confirm that the actual raw data is not contained in mercury, just the distance matrix.

summary(mercury)
slotNames(mercury)
mercury@metric
mercury@palette
mercury@symbols
table(mercury@clusters)
class(mercury@view)
names(mercury@view)
class(mercury@view[["hclust"]])
class(D <- mercury@distance)

Next, we add a "umap" visualization to this object.

mercury <- addVisualization(mercury, "umap")
plot(mercury, view = "umap", main="UMAP from distance matrix")

Notice that the plot method knows that we want to see the internal "layout" component of the umap object. It also automatically assigns axis names that (subliminally?) remind us that we computed this visualization with UMAP. And it colors the points using the same values from the initial PAM clustering assignments.

As an aside, we point out that the PAM clustering doesn't really match the implicit UMAP clustering of the data.

Euclideanization

We obviously constructed this plot just from the distance matrix, not from the raw data. It neverheless has a structure essentially the same as the original one. But how?

Well, internally, we use this function:

Mercator:::makeEuclidean

The first four lines of code inside this function are basically the algorithm used by the cmdscale function to compute the eigenvalues need to create its dimension reduction for visualization. The count of the number of nonzero eigenvalues (which defines R) is careful to stay away from zero. We do this to avoid round-off errors. If we miss by one or two eignevalues, than the final call to cmdscale will generate a confusing warning. That final line, by the way, creates an embedding into Euclidean space that should preserve almost all of the distances between pairs of points in the original data set.

To see that, let's carry out that step explicitly.

X <- Mercator:::makeEuclidean(D)
dim(X)

Now we compute distances between pairs of points in this new embedding.

D2 <- dist(X)
delta <- as.matrix(D) - as.matrix(D2)
summary(as.vector(delta))

We see that the absolute errors between any pair of points in the two distance computations are less than $10^{-12}$. And that is the fundamental reason that we expect equivalent structures whether we apply UMAP to the original data or to the re-embedded data arising from the distance matrix.

Self-Organizing Maps

The same approach works to compute self-organizing maps from a distance matrix. Here is the result using the data matrix.

mercury <- addVisualization(mercury, "som",
                            grid = kohonen::somgrid(6, 5, "hexagonal"))
plot(mercury, view = "som")

And here is the corresponding result using the original data.

library(kohonen)
mysom <- som(t(fakedata), grid = somgrid(6, 5, "hexagonal"))
plot(mysom, type = "mapping")

This time, we will work a little harder to color the plot in the same way.

plot(mysom, type = "mapping", pchs=16, col=Mercator:::colv(mercury))

Warning

Qualitatively, most of the SOM plots should be similar. The fundamental exception, however, is the "codes" plot. This plot shows the patterns of intensities in each of the underlying dimensions (averaged over the assigned objects in each node). Because we have performed a multi-dimensional scaling analysis, we have changed the underlying coordinate system. Here are the corresponding plots.

plot(mysom, type = "codes", main="Original", shape = "straight")
plot(mercury, view = "som", type="codes", main="After MDS")

Conclusions

As we have seen, when we start with a Euclidean distance matrix, we can get qualitatively equivalent results from the original data or from creating a "realization" of the distance matrix in a large Euclidean space. The major advantage, however, arises when we start with a completely different distance metric. In that case, we cannot apply SOM or UMAP at all. But we can still embed that distance matrix into a Eucldean space and then run SOM or UMAP.