densMAP

densMAP is a variation on UMAP that attempts to preserve the differences in density in the original data by adding a new component to the cost function. It is available in the Python UMAP implementation by setting densmap=True. See also this excellent tutorial. There has been a request to add densMAP to uwot since the start of 2021. This document describes the approximate version of densMAP that is available in uwot which may be good enough. But first, a bit more about densMAP.

densMAP can be seen as adding a regularization to the UMAP cost function in which the "local radius" of each observation in the input and output space is calculated and the Pearson correlation between the two is optimized.

The radius of observation $i$, $R_i$ is calculated as:

$$ R_i = \frac{1}{\sum_i P_{ij}} \sum_i d_{ij}^2 P_{ij} $$

where $d_{ij}$ is the distance between point $i$ and $j$ and $P_{ij}$ is the symmetrized edge weight between $i$ and $j$, i.e. the radius is the edge-weighted average of the squared distances between each $i$ and each of its neighbors. Because the matrix has been symmetrized, $i$ may have a different number of neighbors than the n_neighbors parameter that the user specifies.

To get a feel for what densMAP does, below are some results from the Python UMAP (i.e. no uwot output here), with and without densmap=True. As recommended by the current UMAP README, I also set n_neighbors=30, except for subset_clusters, which I'll explain when I get to it. Everything else was left as the defaults.

The images below are:

First two datasets are simple simulation sets based on How to Use t-SNE Effectively, specifically sections 2 ("Cluster sizes in a t-SNE plot mean nothing") and 6 ("For topology, you may need more than one plot").

two_different_clusters

This dataset consists of two 50D Gaussians of 5,000 points each. One of the clusters has a standard deviation 10x the other. I used snedata to generate this data (via the two_different_clusters_data function).

| | | :----------------------------:|:--------------------------: two_different_clusters umap|two_different_clusters ro two_different_clusters densmap|two_different_clusters densmap-ro two_different_clusters cor|

Like t-SNE, default UMAP shows both of these clusters as the same size. densMAP shows one of the clusters as smaller. A good start.

subset_clusters

This dataset once again features two 50D gaussians of 5,000 points each, but this time they overlap entirely. The larger cluster has a standard deviation of 50, the smaller a standard deviation of 1.

For this dataset I used n_neighbors=150, much larger than the n_neighbors=30 setting used for the other plots. This was to be consistent with the plots I generated with uwot. With uwot I precalculated exact nearest neighbors for all datasets and for this dataset, using the exact 30 nearest neighbors results in the smaller cluster being embedded at the edge of the larger cluster. Presumably 30 nearest neighbors doesn't result in enough edges between the two clusters to properly situate the smaller cluster. The approximate nearest neighbors routine used in UMAP results in edges between sufficiently far away neighbors to create a neighborhood to that reflects the global structure better. I didn't investigate this thoroughly, but it sounds plausible to me.

| | | :----------------------------:|:--------------------------: subset_clusters umap|subset_clusters ro subset_clusters densmap|subset_clusters densmap-ro subset_clusters cor|

UMAP considers these two clusters to be the same size. With densMAP, the ring-shaped structure that is found with high perplexities with t-SNE shows up. But at least the small cluster is inside the big one. Note that if you re-run this with n_neighbors=30 the results are not noticeably changed.

s1k

This is a synthetic dataset I like to use: 1000 points from a fuzzy 10D simplex.

| | | :----------------------------:|:--------------------------: s1k umap|s1k ro s1k densmap|s1k densmap-ro s1k cor|

I can't say I like what densMAP has done here. There is one point that has been flung far away from the rest of the plot. You can see from the coloring that this is because it has a large local radius. This perhaps indicates that the local radius calculation can be affected by edge effects.

swissroll

A swiss roll dataset of my own devising.

| | | :----------------------------:|:--------------------------: swissroll umap|swissroll ro swissroll densmap|swissroll densmap-ro swissroll cor|

densMAP does a noticeably better job at not ripping the swiss roll than UMAP does. As you would expect the local density to be fairly uniform across a 2D manifold I don't know if there's an a priori reason to have thought densMAP would do a better job here.

scurvehole

A 3D S-curve with a hole data set, used to validate the PaCMAP method (see also the github repo).

| | | :----------------------------:|:--------------------------: scurvehole umap|scurvehole ro scurvehole densmap|scurvehole densmap-ro scurvehole cor|

iris

Ronald Fisher's iris dataset.

| | | :----------------------------:|:--------------------------: iris umap|iris ro iris densmap|iris densmap-ro iris cor|

spenguins

The Palmer Penguins. The s in spenguins stands for scaled, because I filtered out entries with missing values then Z-scaled the inputs.

| | | :----------------------------:|:--------------------------: spenguins umap|spenguins ro spenguins densmap|spenguins densmap-ro spenguins cor|

mammoth

A 3D point cloud of a mammoth at the Smithsonian, from Understanding UMAP, based on work originally done by Max Noichl.

| | | :----------------------------:|:--------------------------: mammoth umap|mammoth ro mammoth densmap|mammoth densmap-ro mammoth cor|

oli

The Olivetti Faces images.

| | | :----------------------------:|:--------------------------: oli umap|oli ro oli densmap|oli densmap-ro oli cor|

frey

The Frey Faces images.

| | | :----------------------------:|:--------------------------: frey umap|frey ro frey densmap|frey densmap-ro frey cor|

isofaces

The faces dataset used in Isomap (processed via this gist).

| | | :----------------------------:|:--------------------------: isofaces umap|isofaces ro isofaces densmap|isofaces densmap-ro isofaces cor|

mnist

The MNIST digits images.

| | | :----------------------------:|:--------------------------: mnist umap|mnist ro mnist densmap|mnist densmap-ro mnist cor|

The orange cluster is the 1 digits and its smaller radius is exceptionally obvious when coloring the UMAP plot by radius. It very noticeably shrinks when densMAP goes to work on it.

fashion

The Fashion MNIST images.

| | | :----------------------------:|:--------------------------: fashion umap|fashion ro fashion densmap|fashion densmap-ro fashion cor|

Similarly to the MNIST digits, the orange cluster (this time its images of pants) is noticeably smaller than the other clusters.

kuzushiji

The Kuzushiji MNIST images.

| | | :----------------------------:|:--------------------------: kuzushiji umap|kuzushiji ro kuzushiji densmap|kuzushiji densmap-ro kuzushiji cor|

norb

The Small NORB images.

| | | :----------------------------:|:--------------------------: norb umap|norb ro norb densmap|norb densmap-ro norb cor|

You can see the largest of the blue/purple ring-like structures is broken up by UMAP, but densMAP preserves it. In fact, the densMAP plot is one of the nicer UMAP layouts of the small NORB datasets I've seen.

coil20

The COIL-20 images.

| | | :----------------------------:|:--------------------------: coil20 umap|coil20 ro coil20 densmap|coil20 densmap-ro coil20 cor|

coil100

The COIL-100 images.

| | | :----------------------------:|:--------------------------: coil100 umap|coil100 ro coil100 densmap|coil100 densmap-ro coil100 cor|

cifar10

The CIFAR-10 images.

| | | :----------------------------:|:--------------------------: cifar10 umap|cifar10 ro cifar10 densmap|cifar10 densmap-ro cifar10 cor|

macosko2015

The macosko2015 RNAseq data.

| | | :----------------------------:|:--------------------------: macosko2015 umap|macosko2015 ro macosko2015 densmap|macosko2015 densmap-ro macosko2015 cor|

This dataset (an RNA sequence dataset) isn't a lot of fun to visualize with vanilla UMAP, but densMAP does not improve matters.

tasic2018

RNAseq data from the Allen Brain Atlas originally reported by Tasic and co-workers.

| | | :----------------------------:|:--------------------------: tasic2018 umap|tasic2018 ro tasic2018 densmap|tasic2018 densmap-ro tasic2018 cor|

Another RNAseq dataset, densMAP has a very interesting effect on the cluster sizes, but because some of the clusters have a very low density, this has the effect of squashing the other into the center of the plot.

lamanno2020

Another RNASeq data, found via the Picasso example notebook. I think the publication reference is https://doi.org/10.1038/s41586-021-03775-x (was published at biorXiv in 2020).

| | | :----------------------------:|:--------------------------: lamanno2020 umap|lamanno2020 ro lamanno2020 densmap|lamanno2020 densmap-ro lamanno2020 cor|

ng20

The 20 Newsgroups dataset, which I manually converted into a dense 3000D PCA result: see this gist.

| | | :----------------------------:|:--------------------------: ng20 umap|ng20 ro ng20 densmap|ng20 densmap-ro ng20 cor|

densMAP noticeably has a hard time mapping the local radii to the 2D output.

What I take away from all this:

But I don't currently want to implement densMAP because it involves adding lots of extra code, much of it in C++ and I am very lazy.

Approximating densMAP

Here's my pitch for an approximation to densMAP, which I have named the Lightweight Estimate of Preservation of Local Density (Leopold).

For the input radii, as part of the smooth k-nearest neighbor distances routine we already calculate $\rho$ which is the distance to the nearest neighbor of each item in the dataset (subject to the local_connectivity constraint). Additionally, we also calculate $\sigma$ which is a normalization factor used with the neighbors at distance greater than $\rho$. It must have units of distance, and the larger the distances to the neighbors, the larger $\sigma$ gets. So I propose the following definition for the local radius:

$$ R_i = \rho + \sigma $$

The biggest difference between this definition of the local input radii is that I haven't bothered to square the distances (it doesn't make much difference for leopold) and that the values for $\sigma$ and $\rho$ are calculated before the symmetrization of the UMAP input edge weights. So these values are missing the influence of observations outside the initial k-nearest neighborhood graph.

For the output radii, I know from looking at the output weight function parameters, that increasing the a parameter makes the clusters in a UMAP plot shrink, and vice versa. So let's use that as measure of the output density, i.e. the inverse of the local radius. As every point has its own radius, the value for a given weight between points $i$ and $j$ will be the geometric mean of the two radii:

$$ w_{ij} = 1 / \left(1 + \frac{d_{ij}^2}{\sqrt{r_i r_j}} \right) $$

where $r_i$ is a scaled version of the input radius $R_i$ suitable for the lower dimensional space. This is a similar sort of scaling to that used by Zelnik-Manor and Perona for "self-tuning" spectral clustering, and also recommended for local scaling to reduce hubness in high dimensional space by Schnitzer and co-workers.

The scaling from input to output space is not sophisticated: I have observed that for typical values of $b$ used in UMAP, a usable range of $a$ values are between 0.01 (very diffuse clusters) and 100 (very tight clusters), so we will just map the inverse of the input local radii to that range.

The final adjustment to this is that it would be good to introduce a parameter that controls how much of the range of a is used in the output and thus how big a disparity of the input radii is reflected in the output. To that end, there is an adjustable parameter for leopold: a dens_scale parameter. Set it to 1 and it uses all the available range of a. Set it to 0 and you get plain old UMAP back, with a fixed global value of a as specified by the user (or via the spread and min_dist parameters). Intermediate values get intermediate results.

In my mind this scaling of the radii is straightforward, but looking up at that description maybe it's not. The procedure is:

  1. Define the minimum log density as $a \times 10^{-2s}$ where $a$ is the UMAP a parameter for the output and $s$ is the den_scale parameter.
  2. Define the maximum log density as $a \times 10^{2s}$.
  3. Range scale $\log \left(1 / R_i \right)$ between the minimum and maximum log density. Call that range scaled log density $\Delta_i$
  4. $a_i = \sqrt{ \exp \left( \Delta_i \right) }$.
  5. The leopold output weight function is $w_{ij} = 1 / \left( 1 + a_i a_j d_{ij}^2 \right)$.

The Advantage of leopold

The main advantage is for me: to implement this I don't have to write much new code: $\rho$ and $\sigma$ are already calculated and implementing the per-observation value for a in the output function requires creating a new version of the umap gradient routine that expects a vector of values rather than a scalar. This is a lot less work than implementing densMAP.

The Disadvantage of leopold

It's all a bit approximate. But all I really want from this technique is that it makes the dense clusters get smaller and the diffuse clusters get bigger. The exact changes seem less important to me as long as they are vaguely sensible. I don't claim to have a good sense of what I expect to see when a blob of data with an intrinsic dimensionality of $d$ and a radius of $r$ is embedded into 2D versus one with an intrinsic dimensionality of $d + 5$ but a radius of $r / 2$. Even if that was accurately embedded, would it even look sensible let alone helpful?

Results

Here are some plots using leopold. I used exact nearest neighbors for these results as nearest neighbor calculations are usually the slowest part of UMAP. The following deviations from default parameters were used: min_dist = 0.1 to be closer to the Python UMAP results. For consistency with densMAP, n_neighbors = 30 except for subset_clusters which uses n_neighbors = 150. Also to be consistent with densMAP, an extra 200 epochs on top of the usual default n_epochs was used.

In the plots below:

top left: the output of running leopold,

For a lot of these datasets, I won't have anything to say unless there is an interesting contrast or similarity with densMAP.

two_different_clusters

| | | :----------------------------:|:--------------------------: two_different_clusters leopold|two_different_clusters leopold-ro two_different_clusters leopold-cor|two_different_clusters leopold-dro-cor

leopold is able to produce two different sizes for the cluster, although the size disparity is much larger than for densMAP. This is due to the very bimodal nature of the standard deviations in this dataset. One cluster gets one end of the radii scale (highly disperse) and the second cluster gets the other (highly compact). Turning down dens_scale to something like 0.2 works well for this data set. But it doesn't really matter as this is a very artificial dataset. As long as the clusters are the right relative size I am happy.

subset_clusters

| | | :----------------------------:|:--------------------------: subset_clusters leopold|subset_clusters leopold-ro subset_clusters leopold-cor|subset_clusters leopold-dro-cor

This shows an interesting difference with densMAP: with densMAP, the outer yellow ring had a more uniform density, whereas for leopold there is a noticeable density gradient towards the center. I leave it up to you to decide if that is a good or a bad thing.

s1k

| | | :----------------------------:|:--------------------------: s1k leopold|s1k leopold-ro s1k leopold-cor|s1k leopold-dro-cor

The point that got pushed far away from the rest of the data in the densMAP plot is not as far away with leopold.

swissroll

| | | :----------------------------:|:--------------------------: swissroll leopold|swissroll leopold-ro swissroll leopold-cor|swissroll leopold-dro-cor

Both this one and scurvehole seem pretty comparable with the densMAP results.

scurvehole

| | | :----------------------------:|:--------------------------: scurvehole leopold|scurvehole leopold-ro scurvehole leopold-cor|scurvehole leopold-dro-cor

iris

| | | :----------------------------:|:--------------------------: iris leopold|iris leopold-ro iris leopold-cor|iris leopold-dro-cor

spenguins

| | | :----------------------------:|:--------------------------: spenguins leopold|spenguins leopold-ro spenguins leopold-cor|spenguins leopold-dro-cor

mammoth

| | | :----------------------------:|:--------------------------: mammoth leopold|mammoth leopold-ro mammoth leopold-cor|mammoth leopold-dro-cor

If you compare this to densMAP, the extremities are much more diffuse than with leopold.

oli

| | | :----------------------------:|:--------------------------: oli leopold|oli leopold-ro oli leopold-cor|oli leopold-dro-cor

frey

| | | :----------------------------:|:--------------------------: frey leopold|frey leopold-ro frey leopold-cor|frey leopold-dro-cor

isofaces

| | | :----------------------------:|:--------------------------: isofaces leopold|isofaces leopold-ro isofaces leopold-cor|isofaces leopold-dro-cor

mnist

| | | :----------------------------:|:--------------------------: mnist leopold|mnist leopold-ro mnist leopold-cor|mnist leopold-dro-cor

The main thing I hoped to see was the orange '1' cluster shrink. And it does. The yellow '2' cluster has also noticeably grown. Changes are more subtle for the other clusters, but apart from the '1' cluster their densities are all quite similar. I'm happy with this.

fashion

| | | :----------------------------:|:--------------------------: fashion leopold|fashion leopold-ro fashion leopold-cor|fashion leopold-dro-cor

leopold behaves in a densMAP-like way here too by shrinking the orange cluster. The clusters which densMAP expands are also larger here, but are not as diffuse as densMAP makes them. I personally prefer the leopold layout, but I am biased so you should probably just ignore me.

kuzushiji

| | | :----------------------------:|:--------------------------: kuzushiji leopold|kuzushiji leopold-ro kuzushiji leopold-cor|kuzushiji leopold-dro-cor

norb

| | | :----------------------------:|:--------------------------: norb leopold|norb leopold-ro norb leopold-cor|norb leopold-dro-cor

This layout is fine, and maybe a bit neater than the UMAP one. But I still like the densMAP version better.

coil20

| | | :----------------------------:|:--------------------------: coil20 leopold|coil20 leopold-ro coil20 leopold-cor|coil20 leopold-dro-cor

coil100

| | | :----------------------------:|:--------------------------: coil100 leopold|coil100 leopold-ro coil100 leopold-cor|coil100 leopold-dro-cor

cifar10

| | | :----------------------------:|:--------------------------: cifar10 leopold|cifar10 leopold-ro cifar10 leopold-cor|cifar10 leopold-dro-cor

macosko2015

| | | :----------------------------:|:--------------------------: macosko2015 leopold|macosko2015 leopold-ro macosko2015 leopold-cor|macosko2015 leopold-dro-cor

I think the layout that leopold provides is more legible than the densMAP equivalent.

tasic2018

| | | :----------------------------:|:--------------------------: tasic2018 leopold|tasic2018 leopold-ro tasic2018 leopold-cor|tasic2018 leopold-dro-cor

Another example where leopold produces a less compressed result, because the more diffuse clusters (e.g. the clusters at the bottom of the plot) are not as expanded.

lamanno2020

| | | :----------------------------:|:--------------------------: lamanno2020 leopold|lamanno2020 leopold-ro lamanno2020 leopold-cor|lamanno2020 leopold-dro-cor

ng20

| | | :----------------------------:|:--------------------------: ng20 leopold|ng20 leopold-ro ng20 leopold-cor|ng20 leopold-dro-cor

Does this layout look super-great? It does not. And nor can leopold reproduce the input radii in the output space, it seems in fact to be doing a worse job than densMAP. But it suffers less from outliers than densMAP which is something.

Conclusions

If you want to use leopold in uwot:

Here's how you use it with iris:

iris_leopold <- umap(iris, dens_scale = 1)
# and if you want the local radii values
iris_leopold <- umap(iris, dens_scale = 1, ret_extra = c("localr"))


jlmelville/uwot documentation built on April 25, 2024, 5:20 a.m.