This page is a gallery of images comparing the output of uwot
version 0.1.3
to the Python UMAP
package version 0.3.8.
In an ideal world, this wouldn't be a very exciting document, as it would exist
to demonstrate that uwot
results resemble those from the Python
implementation.
However, there are a couple of examples that highlight some differences between
the implementations. It's still not that exciting, but there may be some
information of use to anyone using uwot
and possibly some implications for
users of t-SNE packages. To distinguish between the general technique of UMAP
and the its specific implementation in the Python package of the same name, I'll
refer to the Python implementation with the formatting
UMAP
.
For details on the datasets, see the
UMAP examples gallery.
The uwot
results have been re-run for the images here using different random
seeds, but they should resemble the uwot
output on that page.
The datasets originated in R, mostly being loaded via the snedata package. I then exported them to CSV, before reading them into Python. UMAP version 0.3.8 was used to generate the embeddings, using all default settings. The embeddings were then written to CSV and read back into R for visualization.
The uwot
output also used default settings with the following exceptions:
pca = 100
was used to reduce the dimensionality to a maximum of 100 columns
(data was mean-centered as part of this process, but not rescaled), because
the Annoy nearest neighbor search can be very slow for high-dimensional data.
UMAP
's
nearest neighbor search seems less
affected by dimensionality. Also, min_dist = 0.1
, which is the default for
the Python UMAP. uwot
uses a default of min_dist = 0.01
. Why? I would love
to tell you, but I think I just made a mistake when I set the uwot
defaults.
Therefore, the defaults will probably change in some later version of uwot
.
Fortunately, this doesn't make a perceptible practical difference, but I will
use the Python default of min_dist = 0.1
in the following results.
There is also a potential difference in initialization. Both packages use a
spectral initialization if possible, but differ if the input graph contains more
than one component: UMAP
attempts a meta-embedding of the two separate
components, where uwot
abandons a spectral approach and falls back to using
PCA (followed by rescaling the output to a standard deviation of 1 for each
axis). This affects the following datasets: iris
, coil20
, coil100
,
norb
and tasic2018
.
set.seed(42) iris_umap <- umap(iris, pca = 100, min_dist = 0.1)
# iris X data read from CSV iris_umap = umap.UMAP().fit_transform(iris_x)
Below the results on the left are for the uwot
output. The image on the right
is from UMAP
. I present most of these results without comment because
they seem to be very similar to each other. There are two exceptions, for
the small NORB dataset (norb
) and the gene expression dataset macosko2015
,
which will be discussed in their section.
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
Here's a result that seems worth commenting on. The uwot
results show slightly
less structure, a large blue loop seems to have been broken up.
I suspect that the PCA preprocessing to 100 dimensions as used by uwot
is
too aggressive here and might be throwing away too much of the variance in the
dataset. Compared to the ground truth 15 nearest neighbors using
FNN, Annoy results with pca = 100
are 82% accurate. Without PCA, this accuracy goes up to 99% and the uwot
results are much closer to the UMAP
results. This is the image below, on the
left. Meanwhile, if you use the PCA results as input to UMAP
, the output
now looks a bit more like the uwot
results. That's the image on the lower
right.
| | | :----------------------------:|:--------------------------: |
Unfortunately, the very high dimensionality of norb
makes running Annoy on it
directly very slow indeed: with a single-thread, it took 4 and a half hours,
versus around 5 minutes in the PCA case. We will revisit nearest neighbor
accuracies later.
Note also that norb
is one of the datasets where the graph contains two
separate components, so the initialization of the output coordinates is
different between uwot
and UMAP
, with UMAP
's "meta-embedding" approach
better at retaining local structure.
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
This result shows the largest deviation of uwot
from UMAP
. That
big cyan cluster is of an obviously different shape in the two plots. The source
of the difference here seems to be nearest neighbor results. It turns out that
both uwot
and UMAP
have trouble find good approximations to the nearest
neighbors with this dataset, but UMAP
does a lot better. We will get into
more detail about the nearest neighbor accuracies across the datasets in
the next section. For now, below are two further plots. The left-hand plot
is the uwot
results when it uses the more accurate nearest neighbor data
calculated by UMAP
: the uwot
results now resemble the UMAP
results,
which is reassuring. The right hand plot uses exact nearest neighbor results
calculated via the FNN
.
| | | :----------------------------:|:--------------------------: |
Even though the UMAP
nearest neighbor data is better than that which uwot
produced, we can still see an obvious difference in the two plots, so it's clear
that this data set presents a challenge for the approximate nearest neighbor
methods considered here.
| | | :----------------------------:|:--------------------------: |
Based on results with this dataset in the
UMAP examples gallery,
I wasn't expecting these results to be a feast for the eyes. After seeing the
difference in results with macosko2015
, I am just relieved that these two
results look disappointing in a similar way.
The norb
and macosko2015
results suggest that the use of PCA and the nearest
neighbor search methods present the biggest source of difference between uwot
and UMAP
. Here are some more details.
For small datasets, which in both implementations means less than 4,096
observations, both uwot
and UMAP
calculate the exact nearest neighbors. For
larger datasets, uwot
uses the Annoy
method, and UMAP
uses pynndescent,
which uses random projection trees, followed by nearest neighbor descent.
For the larger datasets, I used the
FNN package to generate the exact 15
nearest neighbors and then compared results for uwot
and UMAP
. For UMAP
,
I used default settings. For uwot
, I applied pca = 100
as that is what I
currently use for high dimensional datasets. The third column shows the amount
of variance explained by only keeping 100 components.
datasets | uwot
| UMAP
| PCA 100 var
----------------|--------|--------|------------
coil100
| 89.1% | 99.7% | 88.4%
mnist
| 86.2% | 97.5% | 91.5%
fashion
| 74.3% | 97.9% | 91.2%
kuzushiji
| 78.3% | 95.2% | 84.6%
norb
| 81.7% | 98.3% | 95.6%
tasic2018
| 48.0% | 97.7% | 67.7%
macosko2015
| 11.2% | 41.8% | 37.3%
cifar10
| 67.4% | 84.1% | 90.1%
UMAP
does a pretty incredible job with nearly every dataset, and it's fast
too. And it doesn't need to use PCA to reduce the dimensionality. The obvious
exception is macosko2015
. I was unable to find a combination of parameters
with pynndescent
that produced good results, e.g. by increasing the number
of trees in the RP tree initialization or the number of iterations in the
nearest neighbor descent phase. With n_trees=250
, I was able to get to 54.9%
accuracy, but I started to get out of memory errors much beyond this setting,
and nearest neighbor descent tends to converge after 5-6 iterations anyway.
Doing better will may require more manipulation of the sampling rate and
candidate list size used in nearest neighbor descent, which can quickly cause
a large increase in memory and run time if you aren't careful. Also, you can't
fiddle with these parameters in UMAP
version 0.3.8 anyway.
uwot
results... well, they're less impressive than UMAP
's. The tasic2018
results are pretty bad, and the macosko2015
results can be fairly described as
horrendous. The accuracies show a rough correlation with the percentage of
variance explained by 100 components. Results for tasic2018
and especially
macosko2015
show that 100 components may not be enough to extract the
meaningful variation from the datasets, which might explain the inability to
find the true nearest neighbors. The uwot
defaults for the Annoy search are
somewhat arbitrary, having been copied from
LargeVis. That package used nearest
neighbor descent to improve the accuracy, and increasing Annoy's n_trees
parameter can give a small improvement to some of the datasets, but the
increase in computation time is probably not worth the small increase in
accuracy I've seen when trying this. It seems that in cases where the current
defaults don't work well, it's because too much information is thrown away
by the PCA pre-processing.
tasic2018
and macosko2015
are both transcriptomics datasets and clearly
represent an interesting challenge to Annoy, while macosko2015
troubles
even pynndescent. The other datasets are image datasets. Possibly
the way the data has been scaled and prepared for these biology datasets makes
life harder for approximate nearest neighbors? It's hard to know where to
start when it comes to unpicking what causes macosko2015
to perform so
differently, but here is a histogram of the nearest neighbor distances for
macosko2015
and tasic2018
(top row), which misbehave with Annoy, and on
the bottom row are the histograms for mnist
and fashion
, which are better
behaved:
| | | :----------------------------:|:--------------------------: | |
The biomodal distribution of macosko2015
and tasic2018
do seem more
similar to each other than they do to mnist
or fashion
. That said,
tasic2018
behaves perfectly well with the pynndescent method, so I see no
obvious answers here.
The good news is that for most datasets, the differences between uwot
's UMAP
implementation and that of the Python UMAP
implementation makes very little
difference. The bad news is that there are a couple of cases where we see
differences. The biggest offender is macosko2015
but there neither method is
doing a perfect job. Nonetheless, UMAP
is able to calculate nearest neighbors
more accurately and more quickly than uwot
.
One source of the difference is the use of Annoy as the nearest neighbor calculation method. It does seem to be solidly slower than pynndescent, but in my experience, it can reach high accuracies. even if it takes a while. March 8 2020: On Windows, there used to be an issue where an Annoy index that was larger than 2GB on disk couldn't be read back in. Make sure you are using a version of RcppAnnoy 0.0.15 or later to avoid this.
The time to search an Annoy index seems to start scaling quite badly with
dimensionality > 1000, so if you can get away with picking 500-1000 random
features and use that, you could try that. With 1,152 random features (i.e.
1/16th the pixels innorb
), visual results aren't that much worse with this
approach and the nearest neighbor search runs 100 times faster (50 minutes vs 20
seconds):
| | | :----------------------------:|:--------------------------: |
Otherwise, if you have a high-dimensional dataset, you should consider PCA.
I don't have any solid advice for the number of components to retain if you do
use PCA for preprocessing. Less than the usual t-SNE default of 50 is probably
unwise. I use pca = 100
in most examples, but as we've seen, that can still
sufficiently perturbs the pairwise distances in the input data to give some
differences in the output. It would be nice if you could use more components to
get a better trade-off of accuracy and speed, but even with fast partial PCA
routines used by uwot
(courtesy of
irlba), extracting more than 100
components can be quite time-consuming.
uwot
is not the only dimensionality reduction to use PCA or Annoy for
approximate nearest neighbor search. LargeVis uses it, the t-SNE package
Fit-SNE has Annoy as an option and
openTSNE seems to have at least
considered it (see e.g. https://github.com/pavlin-policar/openTSNE/issues/28).
It is also standard practice in t-SNE to reduce dimensionality:
Rtsne, a wrapper around the de-facto
standard Barnes Hut t-SNE routine. will only keep 50 components via PCA by
default.
Does the PCA issue effect t-SNE? Well, a little bit. Below are results for
norb
and macosko2015
(openTSNE's documentation is where I discovered this
dataset, as it happens) with t-SNE results from Rtsne
, using
perplexity = 15
. The images on the left use PCA to reduce the initial
dimensionality down to 100, while the ones on the right use the raw input data.
| | | :----------------------------:|:--------------------------: | |
norb
results seem fairly unaffected, but we also see for macosko2015
that
there's a change to the main large cluster. Usually, you don't set perplexity
as low as 15, so are there also differences when the perplexity is set to a
more typical value, like 50? Also, the default PCA dimensionality setting
in Rtsne
is 50, not 100, so here are those t-SNE plots repeated with
perplexity = 50
and either pca = TRUE, partial_pca = TRUE, initial_dims = 50
on the left, and pca = FALSE
on the right:
| | | :----------------------------:|:--------------------------: | |
Again, the norb
dataset is less affected than macosko2015
but for the latter
there is a definite effect of preprocessing with PCA. This is something to bear
in mind before applying PCA to your data even with t-SNE. On balance, I'd
probably live with the effect of applying PCA to your data, because just like
with uwot
, Rtsne
input processing can take a few hours with very high
dimensional datasets.
Finally, some people consider results obtained with PCA to be more reliable
than those using the "raw" input, because PCA can be considered a way to denoise
data, where the lower-variance components are assumed to be irrelevant. I might
be a bit biased, but of all the UMAP plots of macosko2015
, I think the default
uwot
one shows the best separation of clusters, so maybe there's something to
it. But this is also the reason I tabulated the amount of variance 100
components explained for each dataset. Where most of the datasets keep a large
majority of the variance with 100 components, in the case of macosko2015
, you
would have to be prepared to argue why it was ok to throw away over 60% of
variance in the data. I'm sure it can be done, but you would need a solid,
ideally biology-based justification.
A future version of uwot
will hopefully contain an implementation of nearest
neighbor descent optimization to reduce the gap in nearest neighbor accuracy
with UMAP
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.