knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

If you choose the UMAP curve parameters to be a = 1 and b = 1, you get back the Cauchy distribution used in t-Distributed Stochastic Neighbor Embedding and LargeVis. This also happens to significantly simplify the gradient leading to a noticeable speed-up.

For MNIST:

library(uwot)
# install snedata package from github
# pak::pkg_install("jlmelville/snedata")
mnist <- snedata::download_mnist()

I saw the optimization time drop from 66 seconds with UMAP:

mnist_umap <- umap(mnist, n_neighbors = 15)
knitr::include_graphics("img/tumap/umap.png")

to 18 seconds with t-UMAP:

mnist_tumap <- tumap(mnist, n_neighbors = 15)
knitr::include_graphics("img/tumap/tumap.png")

You will still spend most of the time in the nearest neighbor search, so you will really see a difference in terms of total time with larger values of n_epochs. The trade off, as you can see, is that you will see larger, more spread-out clusters than with the typical UMAP settings (they're still more compact than you see in t-SNE, however). I think it's worth the trade-off.

Note that using umap(a = 1, b = 1) doesn't use the simplified gradient, so you won't see any speed-up that way.

Some examples comparing UMAP and t-UMAP are in the examples article.



jlmelville/uwot documentation built on April 25, 2024, 5:20 a.m.