The LargeVis method (see also its
github), is in many respects a bridge
between t-SNE and UMAP. It's sufficiently close to UMAP that uwot
also offers a LargeVis-like method, lvish
:
# perplexity, init and n_epoch values shown are the defaults # use perplexity instead of n_neighbors to control local neighborhood size mnist_lv <- lvish(mnist, perplexity = 50, init = "lvrand", n_epochs = 5000, verbose = TRUE) # Make hilarious Lembas bread joke
Although lvish
is like the real LargeVis in terms of the input weights, output
weight function and gradient, and so should give results that resemble the real
thing, note that:
umap
, where no scaling is carried out. Scaling can be
controlled by the scale
parameter.search_k
parameter is twice as large than Annoy's default to compensate.n_trees
, is not dynamically
chosen based on data set size. In LargeVis, it ranges between 10 (for N <
100,000) and 100 (for N > 5,000,000). The lvish
default of 50 would cover
datasets up to N = 5,000,000, and combined with the default search_k
,
seems suitable for the datasets I've looked at.umap
. You may be able to get away with fewer epochs, and using
the UMAP initialization of init = "spectral"
, rather than the default Gaussian
random initialization (init = "lvrand"
) can help.The left-hand image below is the result of running the official LargeVis
implementation on MNIST. The image on the right is that from running lvish
with its default settings (apart from setting n_threads = 8
). Given they were
both initialized from different random configurations, there's no reason to
believe they would be identical, but they look pretty similar:
| | | |-----------------------------------------------|-----------------------------------------| | | |
Because the default number of neighbors is 3 times the perplexity
, and the
default perplexity = 50
, the nearest neighbor search needs to find 150 nearest
neighbors per data point, an order of magnitude larger than the UMAP defaults.
This leads to a less sparse input graph and hence more edges to sample. Combined
with the increased number of epochs, expect lvish
to be slower than umap
:
with default single-threaded settings, it took about 20 minutes to embed the
MNIST data under the same circumstances as described in the "Performance"
section. With n_threads = 4
, it took 7 minutes. In addition, storing those
extra edges requires a lot more memory than the umap
defaults: my R session
increased by around 3.2 GB, versus 1 GB for umap
.
As an alternative to the usual Gaussian input weight function, you can use the
k-nearest neighbor graph itself, by setting kernel = "knn"
. This will give
each edge between neighbors a uniform weight equal to 1/perplexity
, which
leads to each row's probability distribution having the target perplexity
.
This matrix will then be symmetrized in the usual way. The advantage of this is
that the number of neighbors is reduced to the same as the perplexity (indeed,
the n_neighbors
parameter is ignored with this setting), and leads to less
memory usage and a faster runtime. You can also get away with setting the
perplexity to a much lower value than usual with this kernel (e.g. perplexity =
15
) and get closer to UMAP's performance. If you use the default LargeVis
random initialization, you will still need more epochs than UMAP, but you can
still expect to see a big improvement. Something like the following works for
MNIST:
mnist_lv <- lvish(mnist, kernel = "knn", perplexity = 15, n_epochs = 1500, init = "lvrand", verbose = TRUE)
For details on the datasets, and to compare with the output of UMAP and t-SNE, see the UMAP examples gallery.
As mentioned above, by default lvish
uses a Gaussian similarity function to
determine perplexities, just like t-SNE. These results are given below.
There are two images per dataset. The left-hand image uses a perplexity of 15, which is similar to the sort of settings UMAP uses. The right-hand image is for a perplexity of 50, which is the LargeVis default.
The only other non-default settings was to use pca = 100
, which reduces
the input dimensionality to 100.
iris_lv15 <- lvish(iris, pca = 100, perplexity = 15) iris_lv50 <- lvish(iris, pca = 100, perplexity = 50)
Note that by default lvish
uses a random initialization and a much larger
number of epochs to match the LargeVis defaults. This makes the optimization
take a lot longer than UMAP. LargeVis uses multiple threads during the
optimization phase, but lvish
does not, to ensure reproducibility of results
with a fixed random seed. To get multi-threaded performance like LargeVis, add
the option, n_sgd_threads = "auto"
, e.g.:
iris_lv15 <- lvish(iris, pca = 100, perplexity = 15, n_sgd_threads = "auto")
I would also suggest that you fix the number of epochs to a smaller value initially and see if that provides an adequate visualization.
iris_lv15 <- lvish(iris, pca = 100, perplexity = 15, n_sgd_threads = "auto", n_epochs = 500)
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
Default initialization in lvish
, as with LargeVis and t-SNE, is from a
random distribution. As with t-SNE, you can see one issue with that is that
sometimes clusters get split up by another cluster and are unable to re-merge.
MNIST is the easiest example image to see this in.
In general, there's not a huge difference in the effect of increasing
perplexity, for larger datasets. For smaller datasets it's apparent that the
resulting clusters tend to be more spread out with larger perplexity values. The
norb
(small NORB) dataset
shows an obvious difference, where the perplexity = 15
results are clearly too
low, and break up the structures that are apparent at perplexity = 50
. A
similar effect is seen when using UMAP, so I don't think this is due to the
random initialization of lvish
in this case. A contributing factor is likely
to be that the initial PCA dimensionality reduction to 100 dimensions is too
aggressive for NORB and reduces the nearest neighbor accuracy, which is
recovered at higher perplexities (as this requires finding more near neighbors).
On the other hand, it's hard to see what's going on with the coil20
and
especially the coil100
results. If you could see what was going on from the
static images above, it would be apparent that, in contrast to the norb
results, the perplexity = 50
results are too high here, and the loop structure
of the clusters gets broken up.
The coil100
and coil20
results show an issue with using LargeVis (and UMAP)
that isn't normally a problem with t-SNE: the extra repulsion in their cost
function can often spread the data quite far apart compared to the cluster
sizes. t-SNE has the opposite problem of the clusters expanding into a large
circular form which makes discerning clusters harder as the datasets get larger,
but in a single static plot, I find the t-SNE results to be easier to examine.
For UMAP and lvish
, you may have to resort to more interactive means of
examining the data, such as using the embed_plotly
function in
vizier.
An alternative for lvish
is to modify the repulsion_strength
parameter
(referred to as gamma
in LargeVis). The default value, 7
is taken from the
LargeVis paper but seems to have been chosen empiricially. Here are results
for coil20
and perplexity = 50
with the repulsion reduced to
repulsion_strength = 0.7
in the left image, and repulsion_strength = 0.07
on the right:
| | | :----------------------------:|:--------------------------: |
This helps a bit, but there are limits: the blue cluster on the right remains
an outlier, and reducing the repulsion_strength
too far causes some of the
loops to shrink, as can be seen with the outlying blue and black clusters for
the right-hand plot.
As an alternative to using the Gaussian perplexities, you could use the k-nearest neighbor graph directly, which involves setting the similarity of $i$ with $j$ to 1 if $i$ is in the k-nearest neighbors of $j$, and 0 otherwise. The usual t-SNE procedure of symmetrizing (but not the normalization step) is then carried out. There are some t-SNE implementations which use kNN-derived perplexities, e.g. the majorization-minimization approach of Yang and co-workers.
The advantage of using the kNN kernel is that you get a sparser set of edges,
which due to lvish
using the same formula for determining the number of
iterations required, results in a shorter run time.
For the results below the kNN perplexities were used by setting kernel = "knn"
iris_lv15k <- lvish(iris, pca = 100, perplexity = 15, kernel = "knn") iris_lv50k <- lvish(iris, pca = 100, perplexity = 50, kernel = "knn")
For comparison, the top row of images use the same settings as in the previous
section with Gaussian perplexities, and the bottom row show the results for
using the kNN kernel. The run time for each embedding is given in the image.
I used version 0.1.3 of uwot
from CRAN.
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
| | | :----------------------------:|:--------------------------: | |
For smaller datasets, the kNN kernel gives noticeably different results to the
Gaussian perplexity, particularly for perplexity = 50
. For iris
, s1k
and
oli
, the trend seems to be that the results are more expanded with the kNN
kernel. For frey
, coil20
and coil100
, the clusters are more separated.
For the larger datasets, the difference in behavior is less pronounced, although
the macosko2015
results show a larger cluster separation as well.
The good new is that for all cases, the run times are noticeably reduced.
So for larger datasets, using kernel = "knn"
seems to be an ok choice for
reducing the runtime of lvish
. It also seems that you may want to use a
smaller value of the perplexity
than you would with a Gaussian perplexity,
which further reduces the runtime. For smaller datasets, results are more mixed.
It seems that a smaller perplexity
in this case is definitely to be preferred.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.