densMAP is a variation on UMAP
that attempts to preserve the differences in density in the original data by
adding a new component to the cost function. It is available in the Python UMAP
implementation by setting densmap=True
. See also this excellent
tutorial. There
has been a request to add densMAP to
uwot since the start of 2021.
This document describes the approximate version of densMAP that is available in
uwot
which may be good enough. But first, a bit more about densMAP.
densMAP can be seen as adding a regularization to the UMAP cost function in which the "local radius" of each observation in the input and output space is calculated and the Pearson correlation between the two is optimized.
The radius of observation $i$, $R_i$ is calculated as:
$$ R_i = \frac{1}{\sum_i P_{ij}} \sum_i d_{ij}^2 P_{ij} $$
where $d_{ij}$ is the distance between point $i$ and $j$ and $P_{ij}$ is the
symmetrized edge weight between $i$ and $j$, i.e. the radius is the
edge-weighted average of the squared distances between each $i$ and each of its
neighbors. Because the matrix has been symmetrized, $i$ may have a different
number of neighbors than the n_neighbors
parameter that the user specifies.
To get a feel for what densMAP does, below are some results from the Python UMAP
(i.e. no uwot
output here), with and without densmap=True
. As recommended by
the current UMAP README, I also set n_neighbors=30
, except for
subset_clusters
, which I'll explain when I get to it. Everything else was left
as the defaults.
The images below are:
fit_transform
method when densmap=True
. When there were more than 1000
points, I used smoothScatter
to plot the results.First two datasets are simple simulation sets based on How to Use t-SNE Effectively, specifically sections 2 ("Cluster sizes in a t-SNE plot mean nothing") and 6 ("For topology, you may need more than one plot").
This dataset consists of two 50D Gaussians of 5,000 points each. One of the
clusters has a standard deviation 10x the other. I used
snedata to generate this data (via
the two_different_clusters_data
function).
| | | :----------------------------:|:--------------------------: | | |
Like t-SNE, default UMAP shows both of these clusters as the same size. densMAP shows one of the clusters as smaller. A good start.
This dataset once again features two 50D gaussians of 5,000 points each, but this time they overlap entirely. The larger cluster has a standard deviation of 50, the smaller a standard deviation of 1.
For this dataset I used n_neighbors=150
, much larger than the
n_neighbors=30
setting used for the other plots. This was to be consistent
with the plots I generated with uwot
. With uwot
I precalculated exact
nearest neighbors for all datasets and for this dataset, using the exact 30
nearest neighbors results in the smaller cluster being embedded at the edge of
the larger cluster. Presumably 30 nearest neighbors doesn't result in enough
edges between the two clusters to properly situate the smaller cluster. The
approximate nearest neighbors routine used in UMAP results in edges between
sufficiently far away neighbors to create a neighborhood to that reflects the
global structure better. I didn't investigate this thoroughly, but it sounds
plausible to me.
| | | :----------------------------:|:--------------------------: | | |
UMAP considers these two clusters to be the same size. With densMAP, the
ring-shaped structure that is found with high perplexities with t-SNE shows up.
But at least the small cluster is inside the big one. Note that if you re-run
this with n_neighbors=30
the results are not noticeably changed.
This is a synthetic dataset I like to use: 1000 points from a fuzzy 10D simplex.
| | | :----------------------------:|:--------------------------: | | |
I can't say I like what densMAP has done here. There is one point that has been flung far away from the rest of the plot. You can see from the coloring that this is because it has a large local radius. This perhaps indicates that the local radius calculation can be affected by edge effects.
A swiss roll dataset of my own devising.
| | | :----------------------------:|:--------------------------: | | |
densMAP does a noticeably better job at not ripping the swiss roll than UMAP does. As you would expect the local density to be fairly uniform across a 2D manifold I don't know if there's an a priori reason to have thought densMAP would do a better job here.
A 3D S-curve with a hole data set, used to validate the PaCMAP method (see also the github repo).
| | | :----------------------------:|:--------------------------: | | |
| | | :----------------------------:|:--------------------------: | | |
The Palmer Penguins. The s
in spenguins
stands for scaled, because I filtered out entries with missing
values then Z-scaled the inputs.
| | | :----------------------------:|:--------------------------: | | |
A 3D point cloud of a mammoth at the Smithsonian, from Understanding UMAP, based on work originally done by Max Noichl.
| | | :----------------------------:|:--------------------------: | | |
The Olivetti Faces images.
| | | :----------------------------:|:--------------------------: | | |
The Frey Faces images.
| | | :----------------------------:|:--------------------------: | | |
The faces dataset used in Isomap (processed via this gist).
| | | :----------------------------:|:--------------------------: | | |
The MNIST digits images.
| | | :----------------------------:|:--------------------------: | | |
The orange cluster is the 1 digits and its smaller radius is exceptionally obvious when coloring the UMAP plot by radius. It very noticeably shrinks when densMAP goes to work on it.
The Fashion MNIST images.
| | | :----------------------------:|:--------------------------: | | |
Similarly to the MNIST digits, the orange cluster (this time its images of pants) is noticeably smaller than the other clusters.
The Kuzushiji MNIST images.
| | | :----------------------------:|:--------------------------: | | |
The Small NORB images.
| | | :----------------------------:|:--------------------------: | | |
You can see the largest of the blue/purple ring-like structures is broken up by UMAP, but densMAP preserves it. In fact, the densMAP plot is one of the nicer UMAP layouts of the small NORB datasets I've seen.
The COIL-20 images.
| | | :----------------------------:|:--------------------------: | | |
The COIL-100 images.
| | | :----------------------------:|:--------------------------: | | |
The CIFAR-10 images.
| | | :----------------------------:|:--------------------------: | | |
The macosko2015 RNAseq data.
| | | :----------------------------:|:--------------------------: | | |
This dataset (an RNA sequence dataset) isn't a lot of fun to visualize with vanilla UMAP, but densMAP does not improve matters.
RNAseq data from the Allen Brain Atlas originally reported by Tasic and co-workers.
| | | :----------------------------:|:--------------------------: | | |
Another RNAseq dataset, densMAP has a very interesting effect on the cluster sizes, but because some of the clusters have a very low density, this has the effect of squashing the other into the center of the plot.
Another RNASeq data, found via the Picasso example notebook. I think the publication reference is https://doi.org/10.1038/s41586-021-03775-x (was published at biorXiv in 2020).
| | | :----------------------------:|:--------------------------: | | |
The 20 Newsgroups dataset, which I manually converted into a dense 3000D PCA result: see this gist.
| | | :----------------------------:|:--------------------------: | | |
densMAP noticeably has a hard time mapping the local radii to the 2D output.
What I take away from all this:
dens_lambda
. I didn't try
that but I have no reason to doubt a satisfying value can be found for any of
the datasets where I didn't like the output with default settings.But I don't currently want to implement densMAP because it involves adding lots of extra code, much of it in C++ and I am very lazy.
Here's my pitch for an approximation to densMAP, which I have named the Lightweight Estimate of Preservation of Local Density (Leopold).
For the input radii, as part of the smooth k-nearest neighbor distances routine
we already calculate $\rho$ which is the distance to the nearest neighbor of
each item in the dataset (subject to the local_connectivity
constraint).
Additionally, we also calculate $\sigma$ which is a normalization factor used
with the neighbors at distance greater than $\rho$. It must have units of
distance, and the larger the distances to the neighbors, the larger $\sigma$
gets. So I propose the following definition for the local radius:
$$ R_i = \rho + \sigma $$
The biggest difference between this definition of the local input radii is that I haven't bothered to square the distances (it doesn't make much difference for leopold) and that the values for $\sigma$ and $\rho$ are calculated before the symmetrization of the UMAP input edge weights. So these values are missing the influence of observations outside the initial k-nearest neighborhood graph.
For the output radii, I know from looking at the
output weight function parameters,
that increasing the a
parameter makes the clusters in a UMAP plot shrink, and
vice versa. So let's use that as measure of the output density, i.e. the inverse
of the local radius. As every point has its own radius, the value for a given
weight between points $i$ and $j$ will be the geometric mean of the two radii:
$$ w_{ij} = 1 / \left(1 + \frac{d_{ij}^2}{\sqrt{r_i r_j}} \right) $$
where $r_i$ is a scaled version of the input radius $R_i$ suitable for the lower dimensional space. This is a similar sort of scaling to that used by Zelnik-Manor and Perona for "self-tuning" spectral clustering, and also recommended for local scaling to reduce hubness in high dimensional space by Schnitzer and co-workers.
The scaling from input to output space is not sophisticated: I have observed that for typical values of $b$ used in UMAP, a usable range of $a$ values are between 0.01 (very diffuse clusters) and 100 (very tight clusters), so we will just map the inverse of the input local radii to that range.
The final adjustment to this is that it would be good to introduce a parameter
that controls how much of the range of a
is used in the output and thus how
big a disparity of the input radii is reflected in the output. To that end, there
is an adjustable parameter for leopold: a dens_scale
parameter. Set it to
1 and it uses all the available range of a
. Set it to 0 and you get plain old
UMAP back, with a fixed global value of a
as specified by the user (or via
the spread
and min_dist
parameters). Intermediate values get intermediate
results.
In my mind this scaling of the radii is straightforward, but looking up at that description maybe it's not. The procedure is:
a
parameter for the output and $s$ is the den_scale
parameter.The main advantage is for me: to implement this I don't have to write much new
code: $\rho$ and $\sigma$ are already calculated and implementing the
per-observation value for a
in the output function requires creating a new
version of the umap gradient routine that expects a vector of values rather than
a scalar. This is a lot less work than implementing densMAP.
It's all a bit approximate. But all I really want from this technique is that it makes the dense clusters get smaller and the diffuse clusters get bigger. The exact changes seem less important to me as long as they are vaguely sensible. I don't claim to have a good sense of what I expect to see when a blob of data with an intrinsic dimensionality of $d$ and a radius of $r$ is embedded into 2D versus one with an intrinsic dimensionality of $d + 5$ but a radius of $r / 2$. Even if that was accurately embedded, would it even look sensible let alone helpful?
Here are some plots using leopold. I used exact nearest neighbors for these
results as nearest neighbor calculations are usually the slowest part of UMAP.
The following deviations from default parameters were used: min_dist = 0.1
to
be closer to the Python UMAP results. For consistency with densMAP, n_neighbors
= 30
except for subset_clusters
which uses n_neighbors = 150
. Also to be
consistent with densMAP, an extra 200 epochs on top of the usual default
n_epochs
was used.
In the plots below:
top left: the output of running leopold,
smoothScatter
to plot the results.For a lot of these datasets, I won't have anything to say unless there is an interesting contrast or similarity with densMAP.
| | | :----------------------------:|:--------------------------: | |
leopold is able to produce two different sizes for the cluster, although the
size disparity is much larger than for densMAP. This is due to the very bimodal
nature of the standard deviations in this dataset. One cluster gets one end
of the radii scale (highly disperse) and the second cluster gets the other
(highly compact). Turning down dens_scale
to something like 0.2
works well
for this data set. But it doesn't really matter as this is a very artificial
dataset. As long as the clusters are the right relative size I am happy.
| | | :----------------------------:|:--------------------------: | |
This shows an interesting difference with densMAP: with densMAP, the outer yellow ring had a more uniform density, whereas for leopold there is a noticeable density gradient towards the center. I leave it up to you to decide if that is a good or a bad thing.
| | | :----------------------------:|:--------------------------: | |
The point that got pushed far away from the rest of the data in the densMAP plot is not as far away with leopold.
| | | :----------------------------:|:--------------------------: | |
Both this one and scurvehole
seem pretty comparable with the densMAP results.
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
If you compare this to densMAP, the extremities are much more diffuse than with leopold.
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
The main thing I hoped to see was the orange '1' cluster shrink. And it does. The yellow '2' cluster has also noticeably grown. Changes are more subtle for the other clusters, but apart from the '1' cluster their densities are all quite similar. I'm happy with this.
| | | :----------------------------:|:--------------------------: | |
leopold behaves in a densMAP-like way here too by shrinking the orange cluster. The clusters which densMAP expands are also larger here, but are not as diffuse as densMAP makes them. I personally prefer the leopold layout, but I am biased so you should probably just ignore me.
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
This layout is fine, and maybe a bit neater than the UMAP one. But I still like the densMAP version better.
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
I think the layout that leopold provides is more legible than the densMAP equivalent.
| | | :----------------------------:|:--------------------------: | |
Another example where leopold produces a less compressed result, because the more diffuse clusters (e.g. the clusters at the bottom of the plot) are not as expanded.
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
Does this layout look super-great? It does not. And nor can leopold reproduce the input radii in the output space, it seems in fact to be doing a worse job than densMAP. But it suffers less from outliers than densMAP which is something.
uwot
it would need
to be implemented. Hence leopold's creation.If you want to use leopold in uwot
:
dens_scale
parameter to a value between 0 (exactly like UMAP) to 1
(use the full range of useful a
values).n_neighbors
.n_neighbors = 15
and default
n_epochs
and not noticed a very big effect compared to the results here, so I
am not showing them as there are already way too many images on this page.dens_scale
is set.
When UMAP transforms new data, an adjusted local connectivity is used which affects
both $\sigma$ and $\rho$ so the estimated values of $R_i$ are not as precise as
I would like, but the effect seems to be quite small.localr
is quite informative as to which parts of the
embedding you would expect to shrink or expand.localr
and then if there are obvious clusters that are
a lighter or darker hue, run with dens_scale = 1
from the current coordinates
for a couple of hundred more epochs.Here's how you use it with iris
:
iris_leopold <- umap(iris, dens_scale = 1) # and if you want the local radii values iris_leopold <- umap(iris, dens_scale = 1, ret_extra = c("localr"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.