December 29 2018 New, better settings for t-SNE, better plots and a couple of new datasets. Removed neighborhood preservation values until I've double checked they are working correctly.

Here are some examples of the output of uwot's implementation of UMAP, compared to t-SNE output. As you will see, UMAP's output results in more compact, separated clusters compared to t-SNE.

Data preparation

For details on the datasets, follow their links. Somewhat more detail is also given in the smallvis documentation. iris you already have if you are using R. s1k is part of the sneer package. frey, oli, mnist, fashion, kuzushiji, norb and cifar10 can be downloaded via snedata. coil20 and coil100 can be fetched via coil20.

mnist <- snedata::download_mnist()

# For some functions we need to strip out non-numeric columns and convert data to matrix
x2m <- function(X) {
  if (!methods::is(X, "matrix")) {
    m <- as.matrix(X[, which(vapply(X, is.numeric, logical(1)))])
  }
  else {
    m <- X
  }
  m
}

At the time I generated this document (late December 2018), the kuzushiji dataset had some duplicate and all-black images that needed filtering. This seems to have been remedied as of early February 2019. I re-ran both UMAP and t-SNE on the fixed dataset, but the results weren't noticeably different. For the record, the clean-up routines I ran were:

# Remove all-black images in Kuzushiji MNIST (https://github.com/rois-codh/kmnist/issues/1)
kuzushiji <- kuzushiji[-which(apply(x2m(kuzushiji), 1, sum) == 0), ]
# Remove duplicate images (https://github.com/rois-codh/kmnist/issues/5)
kuzushiji <- kuzushiji[-which(duplicated(x2m(kuzushiji))), ]

UMAP settings

For UMAP, I stick with the defaults, with the exception of iris, coil20, and coil100 and norb. The spectral initialization with the default n_neighbors leads to disconnected components, which can lead to a poor global picture of the data. The Python UMAP implementation goes to fairly involved lengths to ameliorate theses issues, but uwot does not.

For these datasets, a perfectly good alternative that provides a global initialization is to use the first two components from PCA, scaled so their standard deviations are initially 1e-4 (via init = "spca"). This usually results in an embedding which isn't too different from starting via the raw PCA but is more compact, i.e. less space between clusters. For visualizations, as long as the relative orientation and rough distances between clusters are maintained, the exact distances between them are not that interesting to me.

Dimensionality was reduced to 100 by applying PCA to the data. It's commonly applied to data before or as part of t-SNE, so to avoid any differences from this pre-processing, I applied PCA to the UMAP data. I used 100 components, rather than the usual 50 just to be on the safe side. This seemed to have no major effect on the resulting visualizations.

# For iris 
iris_umap <- umap(iris, init = "spca")

# Small datasets (s1k, oli, frey)
s1k_umap <- umap(s1k)

# Big datasets (mnist, fashion, kuzushiji)
mnist_umap <- umap(mnist, pca = 100)

# norb, coil20 and coil100, tasic2018
coil20_umap <- umap(coil20, pca = 100, init = "spca")

t-SNE settings

The Rtsne package was used for the t-SNE calculations, except for the iris dataset, proving troublesome once again. This time it's because Rtsne doesn't allow for duplicates. For iris only, I used the smallvis package.

For t-SNE, I also employ the following non-defaults:

It would be nice to use the same initial coordinates for both methods, but unfortunately Rtsne doesn't apply early exaggeration with user-supplied input. Without early exaggeration, t-SNE results aren't as good, especially with larger datasets. Therefore the t-SNE plots use a random initialization.

# For iris only
iris_tsne <- smallvis::smallvis(iris, perplexity = 15, Y_init = "rand", exaggeration_factor = 4)

# Small datasets (s1k, oli, frey)
s1k_tsne <- Rtsne::Rtsne(x2m(s1k), perplexity = 15, initial_dims = 100,
                       partial_pca = TRUE, exaggeration_factor = 4)

# Big datasets (coil20, coil100, mnist, fashion, kuzushiji etc.)
mnist_tsne <- Rtsne::Rtsne(x2m(mnist), perplexity = 15, initial_dims = 100,
                       partial_pca = TRUE, exaggeration_factor = 12)

Visualization

For visualization, I used the vizier package. The plots are colored by class membership (there's an obvious choice for every dataset considered), except for frey, where the points are colored according to their position in the sequence of images.

embed_img <- function(X, Y, k = 15, ...) {
  args <- list(...)
  args$coords <- Y
  args$x <- X

  do.call(vizier::embed_plot, args)
}
embed_img(iris, iris_umap, pc_axes = TRUE, equal_axes = TRUE, alpha_scale = 0.5, title = "iris UMAP", cex = 1)

For UMAP, where non-default initialization was used, it's noted in the title of the plot (e.g. "(spca)").

iris

The standard iris dataset, known and loved by all.

| | | :----------------------------:|:--------------------------: iris UMAP (spca)|iris t-SNE

s1k

A 9-dimensional fuzzy simplex, which I created for testing t-SNE and related methods, original in the sneer package.

| | | :----------------------------:|:--------------------------: s1k UMAP|s1k t-SNE

oli

The ORL database of faces.

| | | :----------------------------:|:--------------------------: oli UMAP|oli t-SNE

frey

Images of Brendan Frey's face, as far as I know originating from a page belonging to Saul Roweis.

| | | :----------------------------:|:--------------------------: frey UMAP|frey t-SNE

isofaces

Yet more faces, this time the dataset used in Isomap, consisting of images of the same face under different rotations and lighting conditions. Unfortunately, it's no longer available at the MIT website, but it can be found via the Wayback Machine. I wrote a gist for processing the data in R.

In the images below the points are colored by the first pose angle.

| | | :----------------------------:|:--------------------------: isofaces UMAP|isofaces t-SNE

coil20

The COIL-20 Columbia Object Image Library.

| | | :----------------------------:|:--------------------------: coil20 UMAP (spca)|coil20 t-SNE

coil100

The COIL-100 Columbia Object Image Library.

| | | :----------------------------:|:--------------------------: coil100 UMAP (spca)|coil100 t-SNE

The UMAP results are rather hard to visualize on a static plot. If you could pan and zoom around, you would see that the rather indistinct blobs are mainly correctly preserved loops. This is an example where choosing a different value of the a and b parameters would be a good idea. The t-UMAP variant, available as tumap uses a = 1, b = 1 (effectively using the same Cauchy kernel as t-SNE) does a better job here and is shown below on the left. The correct choice of parameters is also important for t-SNE. On the right is the t-SNE result with a default perplexity = 50, which does not retain as much of the loop structure as the lower perplexity:

| | | :----------------------------:|:--------------------------: coil100 t-UMAP (spca)|coil100 t-SNE (perplexity 50)

swiss roll

The Swiss Roll data used in Isomap. A famous dataset, but perhaps not that representative of typical real world datasets. t-SNE is know to not handle this well, but UMAP makes an impressive go at unfolding it.

| | | :----------------------------:|:--------------------------: swiss UMAP|swiss t-SNE

mnist

The MNIST database of handwritten digits.

| | | :----------------------------:|:--------------------------: mnist UMAP|mnist t-SNE

fashion

The Fashion MNIST database, images of fashion objects.

| | | :----------------------------:|:--------------------------: fashion UMAP|fashion t-SNE

kuzushiji (KMNIST)

The Kuzushiji MNIST database, images of cursive Japanese handwriting.

| | | :----------------------------:|:--------------------------: kuzushiji UMAP|kuzushiji t-SNE

norb

The small NORB dataset, pairs of images of 50 toys photographed at different angles and under different lighting conditions.

| | | :----------------------------:|:--------------------------: norb UMAP|norb t-SNE

cifar10

The CIFAR-10 dataset, consisting of 60000 32 x 32 color images evenly divided across 10 classes (e.g. airplane, cat, truck, bird). t-SNE was applied to CIFAR-10 in the Barnes-Hut t-SNE paper, but in the main paper, only the results after passing through a convolutional neural network were published. t-SNE on the original pixel data was only given in the supplementary information (PDF) which is oddly hard to find a link to via JMLR or the article itself, and in the less widely-cited preliminary investigation into BH t-SNE.

| | | :----------------------------:|:--------------------------: cifar10 UMAP|cifar10 t-SNE

There is an outlying orange cluster (which isn't easy to see) in the top right of the UMAP plot. I see the same thing in the Python implementation, so I don't think this is a bug in uwot (although I also said that in a previous version of this page, and it turned out there was a bug in uwot. The current result really is closer to the Python version now, though). That cluster of images is of some automobiles but they all seem to be variations of the same image. The same cluster is present in the t-SNE plot (bottom left), but is more comfortably close to the rest of the data.

The existence of these near-duplicates in CIFAR-10 doesn't seem to have been widely known or appreciated until quite recently, see for instance this twitter thread and this paper by Recht and co-workers. Such manipulations are in line with the practice of data augmentation that is popular in deep learning, but you would need to be aware of it to avoid the test set results being contaminated. These images seem like a good argument for applying UMAP or t-SNE to your dataset as a way to spot this sort of thing.

We can get better results with UMAP by using the scaled PCA initialization (below on the left) and by using the t-UMAP settings (below, right):

| | | :----------------------------:|:--------------------------: cifar10 UMAP (spca)|cifar10 t-UMAP (spca)

That rogue cluster is still present (off to the lower right now), but either image is more comparable to the t-SNE result.

As the visualization of CIFAR-10 isn't very successful with UMAP or t-SNE, here are some results using the activations of a convnet, similar to that used in the BH t-SNE paper. For the convnet, I used a keras implementation, taken from the Machine Learning in Action blog. Features were Z-scaled as carried out in the blog, but I used 100 epochs and a batch size of 128 and I also used the RMSprop (PDF) optimizer favored in the Deep Learning with Python book (with lr=1e-4 and decay=1e-6). Without any data augmentation, this gave a test set error of 0.1655, slightly lower than the test set result given in the BH t-SNE paper (which used a different architecture and without the benefit of an extra 4 years of deep learning research). After retraining with all 60000 images, the flattened output of the final max-pool layer was used, giving 2048 features (the BH t-SNE paper network had 1024 output activations). UMAP and t-SNE results are below. For the UMAP results, I used the t-UMAP settings with scaled PCA initialization.

| | | :----------------------------:|:--------------------------: cifar10 activations t-UMAP|cifar10 activations t-SNE

Results don't look quite as good as those in the BH t-SNE paper, but they are still an improvement. The orange cluster of automobiles remains as an outlier, even in the activation space. You can also see it in the BH t-SNE paper in the lower image in Figure 5 (orange cluster at the bottom, slightly left of center).

tasic2018

The tasic2018 dataset is a transcriptomics dataset of mouse brain cell RNA-seq data from the Allen Brain Atlas (originally reported by Tasic and co-workers. There is gene expression data for 14,249 cells from the primary visual cortex, and 9,573 cells from the anterior lateral motor cortex to give a dataset of size n = 23,822 overall. Expression data for 45,768 genes were obtained in the original data, but the dataset used here follows the pre-processing treatment of Kobak and Berens which applied a normalization and log transformation and then only kept the top 3000 most variable genes.

The data can be generated from the Allen Brain Atlas website and processed in Python by following the instructions in this Berens lab notebook. I output the data to CSV format for reading into R and assembling into a data frame with the following extra exporting code:

np.savetxt("path/to/allen-visp-alm/tasic2018-log3k.csv", logCPM, delimiter=",")
np.savetxt("path/to/allen-visp-alm/tasic2018-areas.csv", tasic2018.areas, delimiter=",", fmt = "%d")
np.savetxt("path/to/allen-visp-alm/tasic2018-genes.csv", tasic2018.genes[selectedGenes], delimiter=",", fmt='%s')
np.savetxt("path/to/allen-visp-alm/tasic2018-clusters.csv", tasic2018.clusters, delimiter=",", fmt='%d')
np.savetxt("path/to/allen-visp-alm/tasic2018-cluster-names.csv", tasic2018.clusterNames, delimiter=",", fmt='%s')
np.savetxt("path/to/allen-visp-alm/tasic2018-cluster-colors.csv", tasic2018.clusterColors, delimiter=",", fmt='%s')

| | | :----------------------------:|:--------------------------: tasic2018 UMAP|tasic2018 t-SNE

Again, the default UMAP settings produce clusters that are bit too well-separated to clearly see in these images. So here are the results from using t-UMAP (the same as setting a = 1, b = 1), on the left, and then using a = 2, b = 2 on the right:

| | | :----------------------------:|:--------------------------: tasic2018 t-UMAP|tasic2018 UMAP (a = 2, b = 2)

macosko2015

Another transcriptomics data set, used as an example in openTSNE. This contains data for 44,808 cells from the mouse retina.

The raw data was fetched similarly to this shell script from the Hemberg Lab and then the data was prepared using the openTSNE notebook by Pavlin Policar. Similarly to the tasic2018 dataset, data was log normalized and the 3,000 most variable genes were retained.

I exported the data (without the Z-scaling and PCA dimensionality reduction), as a CSV file, e.g.:

np.savetxt("/path/to/macosko2015/macosko2015-log3k.csv", x, delimiter=",")
# Use these as column names
np.savetxt("/path/to/macosko2015/macosko2015-genenames.csv", data.T.columns.values[gene_mask].astype(str), delimiter=",", fmt = "%s")
np.savetxt("/path/to/macosko2015/macosko2015-clusterids.csv", cluster_ids.values.astype(int), delimiter=",", fmt = "%d")

| | | :----------------------------:|:--------------------------: macosko2015 UMAP|macosko2015 t-SNE

This is another result where the UMAP defaults might need a bit of fiddling with if you don't like how separated the clusters are.

The openTSNE results use Z-scaling of the inputs before applying t-SNE, so below are the results for UMAP and t-SNE with Z-scaling applied (for umap, pass scale = "Z" as an argument so you don't have to do it manually):

| | | :----------------------------:|:--------------------------: macosko2015 t-UMAP, Z-scaled|macosko2015 t-SNE Z-scaled



jlmelville/uwot documentation built on April 25, 2024, 5:20 a.m.