Traditionally, dimensionality reduction methods make use of the k-nearest neighbors graph but there are some exceptions: Trimap and PaCMAP use an extended version of k-nearest neighbors, based on the "generalized" distances used in self-tuning spectral clustering, and PaCMAP goes further to also sample the kth closest neighbors from a locally sampled batch. LocalMAP also adjusts the neighbor graph dynamically during the embedding by sampling distances of the embedded points.
For the purposes of producing better-separated clusters, Dalmia and Sia (see also the article in the UMAP docs) consider the mutual nearest neighbors (augmented with some extra neighbors).
The way PaCMAP uses its neighbors is beyond the architecture of uwot
to handle
easilt, but the mutual nearest neighbors graph is something that uwot
can
deal with. As detailed in the article on uwot
's
nearest neighbors format,
we can provide as input a sparse distance matrix to represent nearest neighbors,
giving substantial flexibility. However, it's not always easy to take advantage
of that. Here I will show some code that can generate a suitably-massaged mutual
nearest neighbors graph and see if it has any effect on UMAP.
Although in a kNN graph each item only has k neighbors, any workable dimensionality reduction working on the kNN graph as input will symmetrize that graph, otherwise optimization is difficult. The outcome of symmetrizing the knn graph is that if $j$ is a neighbor of $i$ in the kNN graph, $i$ is also guaranteed to be a neighbor of $j$ in the symmetrized graph and vice versa. If one item tends to show up in a lot of the kNN lists of multiple items, the symmetrization means that there can be up to $N$ edges involving that very popular item. These are called hubs and not only do they make search on the graph more difficult (that item will be likely to show up in most queries) but because for a given $k$ there are a limited number of edges to distribute over the dataset, some items are not getting their fair share. This is more likely to lead to outliers points that get embedded far away from the main body of the data as they have no particularly strong bonds to the rest of the data.
You can think of the mutual k-nearest neighbors graph (the MNN graph) as being a symmetrized kNN graph where edges only appear if item $i$ was a neighbor of $j$ and $j$ was a neighbor of $i$: in the symmetrized kNN graph described above the requirement is only that $i$ be a neighbor of $j$ or $j$ be a neighbor of $i$. This means that an item can have no more than $k$ mutual neighbors. Hub problem solved.
The downside of the MNN graph is that even in very well-behaved kNN graphs, the MNN will contain several items with no edges to any other item: while they have $k$ nearest neighbors, none of those neighbors reciprocate. UMAP requires local connectivity, so it cannot handle a use case where we have such Billy no-mates items.
As a result, the Dalmia and Sia paper has to go to some extra lengths to augment the MNN graph with some extra nodes which you can see at their repo. This involves minimum spanning trees and Djikstra's algorithm to add new nodes (and distances via the path length).
One alternative they consider is to take any fully disconnected nodes and re-add the first nearest neighbor. They cite a paper by de Souza, Rezende and Batista for this, but there's no extra detail in that paper, the authors just note that they do this to avoid the disconnected nodes.
It should be said that this "adjacent neighbor" approach to MNN augmentation does not work very well, leading to a lot of disconnect components and small clusters of points scattered over the embedding. This is not very surprising.
As I don't currently fancy implementing any of the path neighbor approach, I propose the following strategy: just add back more than one neighbor to disconnected nodes. In fact, we would probably do well to ensure every item in the dataset has at least $m$ neighbors where $m \lt k$, not just items which are completely disconnected, as we can already see that items with only one neighbor don't do a good job. Obviously adding back all the neighbors would recreate the original symmetrized kNN graph, so we want to keep $m$ small, but bigger than one.
To name this concept I am going with the "Balanced MNN graph" where "balanced" is meant to suggest that we are trying to ensure that no node is left too isolated in the MNN graph. To be specific with numbers I will refer to a ($k$, $m$)-BMNN graph. I would welcome hearing if there is existing literature with a different (probably better) name.
The basic formula for creating the BMNN graph is as follows:
The loop in steps 3-5 must terminate in $m$ steps: at worst, for items with 0 neighbors, we add all the $m$-th nearest neighbors. For items with 1-($m$-1) neighbors, we don't know if the $m$-th nearest neighbor is already a neighbor. If it isn't we are one step closer to termination. If it is, then adding the edge will do nothing. So we can't add more than $m$ edges to any node and none of these nodes will get too many edges added.
Creating the MNN from the sparse kNN representation is easy, assuming your distances are symmetrical:
sp_mutual <- function(sp) { sp_t <- Matrix::t(sp) res <- sqrt(sp * sp_t) res }
Balancing the MNN is a bit more involved, especially if you want it to be
efficient and not do anything disastrous like accidentally attempt to turn
your sparse matrix dense. We must go from 5 lines of code to more than 100.
Below is the code for your delight and delectation. I may officially export
this from uwot
some day, but this remains an experiment for now for you to
copy and paste.
Input to balance_mnn
is:
nn_graph
-- the dense representation of the k-nearest neighbors, i.e. a list
of two dense matrices, idx
and dist
both of dimensions (n, k)
. See the
article on the nearest neighbor format
for more. uwot
exports in this format, and so does
rnndescent.k
-- optionally specify the size of the l
nearest neighbors to use. If you
omit this, all the neighbors in nn_graph
will be used. Because generating
nearest neighbors is time-consuming, you may wish to create a large neighbor
graph once (e.g. set k = 50
or k = 150
) and then use a smaller subset as the
need arises (e.g. 15 is typical for UMAP). The nng
helper function above can
do this. Pro-tip: I use rnndescent::brute_force_knn
to generate exact nearest
neighbors for k = 150
and then save them once and for all. My now rather
decrepit 8th gen Intel laptop still has enough juice to crank through datasets
of up to 100,000 items and up to 10,000 features with 6 threads in a
"reasonable" amount of time, where "reasonable" for me means anything up to a
few hours, but not "going on holiday for two weeks".m
-- minimum number of neighbors to ensure each item has. This needs to be
at least two.One slightly weird thing about specifying m
is that you will actually get back
one less edge than the value you ask for. This is because in the dense format
every item's nearest neighbor is itself with a distance of zero. This disappears
when we convert to the sparse format. As a result your dense k-nearest neighbor
graph only contains $k-1$ different neighbors for each item. The self-neighbor
thing counting as one of the neighbors is always how UMAP has done it, so we
won't be changing the habit of a lifetime. You will just have to remember to
reproduce the "adjacent neighbor" method you must specify m = 2
. The function
won't let you go lower than m = 2
anyway.
# create an m-balanced k-mutual nearest neighbors graph from the dense k-nearest # neighbors graph `nn-graph`. m must be less than or equal to k. If `k` is not # specified then all of the neighbors are used. balance_mnn <- function(nn_graph, m, k = NULL) { if (m < 2) { stop("m must be at least 2") } if (!is.null(k)) { nn_graph <- nng(nn_graph, k) } k <- ncol(nn_graph$idx) if (m > k) { stop("m must be less than or equal to k") } # create the mutual neighbors as a sparse matrix # each item's neighbors distances are stored by column bmnn <- k_to_m(nn_graph) bmnn_adj <- sp_to_adj(bmnn) bmnn_edges <- colSums(bmnn_adj) too_few_edges_mask <- bmnn_edges < (m - 1) # add edges back from the knn graph to "fill up" any items with too few edges # do it one column at a time to avoid over-filling any one item for (l in seq(from = 2, to = m)) { if (sum(too_few_edges_mask) == 0) { break } else { message("Iter ", l, " cols to process: ", sum(too_few_edges_mask)) } knn_idxl <- nn_graph$idx[, l] masked_knn_distl <- nn_graph$dist[, l] * too_few_edges_mask masked_knn <- vec_to_sp(knn_idxl, masked_knn_distl) knn_adj <- sp_to_adj(masked_knn) mk_adj <- bmnn_adj + knn_adj bmnn <- bmnn + masked_knn bmnn@x <- bmnn@x / mk_adj@x bmnn_adj <- sp_to_adj(bmnn) bmnn_edges <- colSums(bmnn_adj) too_few_edges_mask <- bmnn_edges < (m - 1) } bmnn } ## Helper functions # truncate a dense neighbor graph to the first k neighbors nng <- function(graph, k) { list(idx = graph$idx[, 1:k], dist = graph$dist[, 1:k]) } # convert a dense knn graph to a sparse matrix format # NB neighbors are stored by column nng_to_sp <- function(graph, nbrs = NULL) { if (!is.null(nbrs)) { graph <- nng(graph, nbrs) } idx <- graph$idx dist <- graph$dist n_nbrs <- ncol(idx) n_row <- nrow(idx) n_ref <- nrow(idx) i <- as.vector(idx) j <- rep(1:n_row, times = n_nbrs) x <- as.vector(dist) keep_indices <- !is.na(x) i <- i[keep_indices] j <- j[keep_indices] x <- x[keep_indices] res <- Matrix::sparseMatrix(i = i, j = j, x = x, dims = c(n_row, n_ref), repr = repr) Matrix::drop0(res) } # convert a sparse knn matrix to the mutual nearest neighbors graph sp_mutual <- function(sp) { sp_t <- Matrix::t(sp) res <- sqrt(sp * sp_t) res } # convert a dense kNN graph to a mutual nearest neighbors graph k_to_m <- function(graph, nbrs = NULL) { sp <- nng_to_sp(graph, nbrs) res <- sp_mutual(sp) res } # convert sparse matrix to an adjacency matrix (1 if there's an edge, 0 otherwise) sp_to_adj <- function(sp) { spi <- sp spi@x <- rep(1, length(spi@x)) spi } # convert the vector of values x to a sparse matrix with where the row is given # by the equivalent value in the vector i, and the column is the index of the # value in the vector x. The resulting matrix is a square matrix with the same # number of rows as the length of x. vec_to_sp <- function(i, x) { n_row <- length(i) j <- 1:n_row keep_indices <- !is.na(x) i <- i[keep_indices] j <- j[keep_indices] x <- x[keep_indices] res <- Matrix::sparseMatrix( i = i, j = j, x = x, dims = c(n_row, n_row), repr = "C" ) Matrix::drop0(res) }
Dalmia and Sia note that using a MNN graph leads to many disconnected
components, which can cause issues with the default spectral initialization in
UMAP. Their approach creates a connected graph to get round this problem. An
alternative I pursue in all the results here is to always initialize from the
first two components of PCA. Even in the case of the kNN graph, you can still
get disconnected components for a given value of n_neighbors
, and uwot
will
fall back to PCA in this case anyway.
Putting this together, my recommended workflow is:
library(rnndescent) librart(uwot) # get the approximate knn: use as many threads as you want data_knn_15 <- rnndescent::rnnd_knn(data, k = 15, n_threads = 6) # or if you have time on your hands # data_knn_15 <- rnndescent::brute_force_knn(data, k = 15, n_threads = 6) pca_init <- umap2(data, nn_method = data_knn_15, init = "pca", n_epochs = 0) # Normal UMAP data_umap <- umap2(data, nn_method = data_knn_15, init = pca_init) # Create the (15, 5)-balanced MNN data_bmnn_15_5 <- balance_mnn(data_knn_15, k = 15, m = 5) # UMAP with BMNN data_bmnn_umap <- umap2(data, nn_method = data_bmnn_15_5, init = pca_init)
In the work presented by Dalmia and Sia, multiple choices of k
are tried
(from 10-50) and some changes to min_dist
. I will stick with k = 15
for
all the results shown here.
Does the BMNN graph produce results that differ from the kNN graph? Let's compare the UMAP embeddings of the MNIST digits and Fashion MNIST datasets using the kNN graph and the BMNN graph. From left to right we have the original UMAP embedding, the embedding using the (15, 2)-BMNN graph which should give similar results to the MNN+adjacent neighbors approach used by Dalmia and Sia, and the an embedding using the (15, 5)-BMNN graph which (I hope) should give a more "connected" embedding. You can click on any of the images to see a (slightly) larger version.
| Dataset | 15-NN | (15,2)-BMNN | (15,5)-BMNN | | :---: | --- | --- | --- | | MNIST | | | | | fashion | | | |
Although it's not that easy to see unless you click on the image, the (15,2)-BMNN results do show a sprinkling of outlier points throughout the embedding. This confirms the observations of Dalmia and Sia and so this doesn't seem like a promising approach. I won't bother showing results for (15,2)-BMNN graphs going forward. The (15,5)-BMNN on the other hand seems quite well behaved. Now, if you were to say "I put it to you that it doesn't look very different from the normal UMAP results" I would have to agree with you. But there is perhaps slightly more separation of the clusters and some finer detail? Possibly that's entirely within the expected variation of a given UMAP run.
Fortunately there are some datasets where we do see more a difference.
Embedding COIL-20. and COIL-100 show a series of loops, but some of which typically get a bit tangled up together or mangled. If ever there were some datasets that might benefit from focusing on the mutual neighbors, it's these two.
| Dataset | 15-NN | (15,5)-BMNN | | :---: | --- | --- | | coil20 | | | | coil100 | | |
I no longer feel like I am lying to myself when I say I can see a difference between the two embeddings. The BMNN graph definitely seems to have untangled the loops a bit more.
There are some datasets which UMAP struggles to produce useful results for. These include:
k = 15
) results in 89% of the dataset being entirely
disconnected! A corresponding problem is that you have to be quite careful with
how large m
is in the BMNN: as the vast majority of edges in the kNN graph are
pointing towards the hubs, even a small number of additions of kNN edges back
can lead to a hub in the BMNN. The symmetrized kNN graph has a hub with 10,000
edges associated with it (around 20% of the dataset). The BMNN graph with
m = 5
reduces that, but the hub still has 4,000 edges.macosko2015
visualizes more nicely with some PCA as the hubness is greatly
reduced (I am agnostic on if this representation more faithfully represents the
biological reality). so I also include the same dataset after reducing to 100
dimensions with PCA. | Dataset | 15-NN | (15,5)-BMNN | | :---: | --- | --- | | norb | | | | macosko2015 | | | | macosko2015pca100 | | | | 20NG | | | | cifar10 | | |
Ok, so it turns out that the BMNN graph doesn't really help in any of these
cases. norb
is particularly disappointing in that it noticeably makes things
worse.
For macosko2015
, pretty much nothing happens, except the big blue
round cluster has got some slight mottling where it seems like some kind of
fine(ish) structure is forming, which might translate to clusters. I couldn't
tell you if it was biologically relevant though. I tested whether the hubness
that is present in the kNN and creeps back into the BMNN is a determinant of
this embedding by creating a BMNN where edges were added back not in order
of shortest distance, but in smallest k-occurrence, i.e. items that appear the
least often in the kNN graph. This reduces the maximum edge count in the BMNN
to around 200 (a nearly 20-fold drop), but the embedding looks very similar to
the vanilla UMAP result:
I include this image only because this is evidence against my long-held belief
that hubs in the kNN graph are a problem for UMAP, so this is a good
visualization for me to stare at the next time I start going down this road.
macosko2015pca100
shows very little change also.
For 20NG
and cifar10
the only effect is to make the main blob of data more
compressed in the middle of the plot, due to extra small clusters now being
located at the edges. While not attractive, this could be indicating some
anomalous clustering in these small subsets, which might be worth investigating
further. For example, these could be duplicates or close duplicates. I haven't
actually done that investigation though so I can't say if that's the case.
Using the BMNN to approximate the MNN is a mixed bag. In some cases it can help, but in others it either does very little, or can make things worse -- although there's always a chance that this could be a diagnostic of unusual structure in the data.
Also worth noting is that the BMNN graph is much sparser than the symmetrized kNN graph that UMAP uses by default. In the examples here the (15, 5)-BMNN graph has around 1/3-1/2 the number of edges of the kNN graph. This can give a moderate speedup in the UMAP embedding step: around 25-50%.
BMNN results are certainly not as visibly different as that seen in the Dalmia
and Sia paper. Why is this? It could be that I need to tweak k
and m
specifically for each dataset although that makes the BMNN approach very
unappealing. It could also be that a spectral initialization using the BMNN
approach would give different results, but in order to do that, we would need
a strategy to form a fully connected graph Laplacian. In general that would be
a useful thing to generate from any neighbor graph, but it seems like something
that is more suited to a graph or nearest neighbor package (like rnndescent
)
than uwot
itself.
Certainly the BMNN is no panacea to datasets that don't show much structure, but that was a pretty forlorn hope. My advice would be that if your UMAP result does show some kind of structure, then it's worth trying the BMNN as it's not difficult or time-consuming to generate.
cat('<script> document.addEventListener("DOMContentLoaded", function() { var images = document.querySelectorAll("body img"); images.forEach(function(img) { img.style.cursor = "pointer"; img.onclick = function() { showImageInModal(img); }; }); }); function showImageInModal(imgElement) { var imgSrc = imgElement.getAttribute("src"); var modal = document.createElement("div"); modal.style.position = "fixed"; modal.style.top = "0"; modal.style.left = "0"; modal.style.width = "100%"; modal.style.height = "100%"; modal.style.backgroundColor = "rgba(0,0,0,0.8)"; modal.style.display = "flex"; modal.style.justifyContent = "center"; modal.style.alignItems = "center"; modal.style.zIndex = "1000"; var fullImage = document.createElement("img"); fullImage.src = imgSrc; fullImage.style.maxWidth = "100%"; fullImage.style.maxHeight = "100%"; modal.appendChild(fullImage); modal.onclick = function() { document.body.removeChild(modal); }; document.body.appendChild(modal); } </script>')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.