December 29 2018 New, better settings for t-SNE, better plots and a couple of new datasets. Removed neighborhood preservation values until I've double checked they are working correctly.
Here are some examples of the output of uwot
's implementation of UMAP,
compared to t-SNE output. As you will see, UMAP's output results in more
compact, separated clusters compared to t-SNE.
For details on the datasets, follow their links. Somewhat more detail is also
given in the
smallvis documentation.
iris
you already have if you are using R. s1k
is part of the
sneer package. frey
, oli
, mnist
,
fashion
, kuzushiji
, norb
and cifar10
can be downloaded via
snedata. coil20
and coil100
can be
fetched via coil20.
mnist <- snedata::download_mnist() # For some functions we need to strip out non-numeric columns and convert data to matrix x2m <- function(X) { if (!methods::is(X, "matrix")) { m <- as.matrix(X[, which(vapply(X, is.numeric, logical(1)))]) } else { m <- X } m }
At the time I generated this document (late December 2018), the kuzushiji
dataset had some duplicate and all-black images that needed filtering. This
seems to have been remedied as of early February 2019. I re-ran both UMAP and
t-SNE on the fixed dataset, but the results weren't noticeably different. For
the record, the clean-up routines I ran were:
# Remove all-black images in Kuzushiji MNIST (https://github.com/rois-codh/kmnist/issues/1) kuzushiji <- kuzushiji[-which(apply(x2m(kuzushiji), 1, sum) == 0), ] # Remove duplicate images (https://github.com/rois-codh/kmnist/issues/5) kuzushiji <- kuzushiji[-which(duplicated(x2m(kuzushiji))), ]
For UMAP, I stick with the defaults, with the exception of iris
, coil20
, and
coil100
and norb
. The spectral initialization with the default n_neighbors
leads to disconnected components, which can lead to a poor global picture of the
data. The Python UMAP implementation goes to fairly involved lengths to
ameliorate theses issues, but uwot
does not.
For these datasets, a perfectly good alternative that provides a global
initialization is to use the first two components from PCA, scaled so their
standard deviations are initially 1e-4 (via init = "spca"
). This usually
results in an embedding which isn't too different from starting via the raw
PCA but is more compact, i.e. less space between clusters. For visualizations,
as long as the relative orientation and rough distances between clusters are
maintained, the exact distances between them are not that interesting to me.
Dimensionality was reduced to 100 by applying PCA to the data. It's commonly applied to data before or as part of t-SNE, so to avoid any differences from this pre-processing, I applied PCA to the UMAP data. I used 100 components, rather than the usual 50 just to be on the safe side. This seemed to have no major effect on the resulting visualizations.
# For iris iris_umap <- umap(iris, init = "spca") # Small datasets (s1k, oli, frey) s1k_umap <- umap(s1k) # Big datasets (mnist, fashion, kuzushiji) mnist_umap <- umap(mnist, pca = 100) # norb, coil20 and coil100, tasic2018 coil20_umap <- umap(coil20, pca = 100, init = "spca")
The Rtsne package was used for the
t-SNE calculations, except for the iris
dataset, proving troublesome once
again. This time it's because Rtsne
doesn't allow for duplicates. For iris
only, I used the smallvis package.
For t-SNE, I also employ the following non-defaults:
perplexity = 15
, which is closer to the neighborhood size used by UMAPinitial_dims = 100
, which applies PCA to the input, keeping the first 100
components, rather than the usual 50
to minmize distortion from the initial
PCA.uwot
also makes use of irlba
, there is no reason not to use
partial_pca
.It would be nice to use the same initial coordinates for both methods,
but unfortunately Rtsne
doesn't apply early exaggeration with user-supplied
input. Without early exaggeration, t-SNE results aren't as good, especially with
larger datasets. Therefore the t-SNE plots use a random initialization.
# For iris only iris_tsne <- smallvis::smallvis(iris, perplexity = 15, Y_init = "rand", exaggeration_factor = 4) # Small datasets (s1k, oli, frey) s1k_tsne <- Rtsne::Rtsne(x2m(s1k), perplexity = 15, initial_dims = 100, partial_pca = TRUE, exaggeration_factor = 4) # Big datasets (coil20, coil100, mnist, fashion, kuzushiji etc.) mnist_tsne <- Rtsne::Rtsne(x2m(mnist), perplexity = 15, initial_dims = 100, partial_pca = TRUE, exaggeration_factor = 12)
For visualization, I used the vizier
package. The plots are colored by class membership (there's an obvious choice
for every dataset considered), except for frey
, where the points are colored
according to their position in the sequence of images.
embed_img <- function(X, Y, k = 15, ...) { args <- list(...) args$coords <- Y args$x <- X do.call(vizier::embed_plot, args) } embed_img(iris, iris_umap, pc_axes = TRUE, equal_axes = TRUE, alpha_scale = 0.5, title = "iris UMAP", cex = 1)
For UMAP, where non-default initialization was used, it's noted in the title of the plot (e.g. "(spca)").
The standard iris
dataset, known and loved by all.
| | | :----------------------------:|:--------------------------: |
A 9-dimensional fuzzy simplex, which I created for testing t-SNE and related methods, original in the sneer package.
| | | :----------------------------:|:--------------------------: |
| | | :----------------------------:|:--------------------------: |
Images of Brendan Frey's face, as far as I know originating from a page belonging to Saul Roweis.
| | | :----------------------------:|:--------------------------: |
Yet more faces, this time the dataset used in Isomap, consisting of images of the same face under different rotations and lighting conditions. Unfortunately, it's no longer available at the MIT website, but it can be found via the Wayback Machine. I wrote a gist for processing the data in R.
In the images below the points are colored by the first pose angle.
| | | :----------------------------:|:--------------------------: |
The COIL-20 Columbia Object Image Library.
| | | :----------------------------:|:--------------------------: |
The COIL-100 Columbia Object Image Library.
| | | :----------------------------:|:--------------------------: |
The UMAP results are rather hard to visualize on a static plot. If you could
pan and zoom around, you would see that the rather indistinct blobs are mainly
correctly preserved loops. This is an example where choosing a different value
of the a
and b
parameters would be a good idea. The t-UMAP variant,
available as tumap
uses a = 1
, b = 1
(effectively using the same Cauchy
kernel as t-SNE) does a better job here and is shown below on the left. The
correct choice of parameters is also important for t-SNE. On the right is
the t-SNE result with a default perplexity = 50
, which does not retain as
much of the loop structure as the lower perplexity:
| | | :----------------------------:|:--------------------------: |
The Swiss Roll data used in Isomap. A famous dataset, but perhaps not that representative of typical real world datasets. t-SNE is know to not handle this well, but UMAP makes an impressive go at unfolding it.
| | | :----------------------------:|:--------------------------: |
The MNIST database of handwritten digits.
| | | :----------------------------:|:--------------------------: |
The Fashion MNIST database, images of fashion objects.
| | | :----------------------------:|:--------------------------: |
The Kuzushiji MNIST database, images of cursive Japanese handwriting.
| | | :----------------------------:|:--------------------------: |
The small NORB dataset, pairs of images of 50 toys photographed at different angles and under different lighting conditions.
| | | :----------------------------:|:--------------------------: |
The CIFAR-10 dataset, consisting of 60000 32 x 32 color images evenly divided across 10 classes (e.g. airplane, cat, truck, bird). t-SNE was applied to CIFAR-10 in the Barnes-Hut t-SNE paper, but in the main paper, only the results after passing through a convolutional neural network were published. t-SNE on the original pixel data was only given in the supplementary information (PDF) which is oddly hard to find a link to via JMLR or the article itself, and in the less widely-cited preliminary investigation into BH t-SNE.
| | | :----------------------------:|:--------------------------: |
There is an outlying orange cluster (which isn't easy to see) in the top right
of the UMAP plot. I see the same thing in the Python implementation, so I don't
think this is a bug in uwot
(although I also said that in a previous version
of this page, and it turned out there was a bug in uwot
. The current result
really is closer to the Python version now, though). That cluster of images is
of some automobiles but they all seem to be variations of the same image. The
same cluster is present in the t-SNE plot (bottom left), but is more comfortably
close to the rest of the data.
The existence of these near-duplicates in CIFAR-10 doesn't seem to have been widely known or appreciated until quite recently, see for instance this twitter thread and this paper by Recht and co-workers. Such manipulations are in line with the practice of data augmentation that is popular in deep learning, but you would need to be aware of it to avoid the test set results being contaminated. These images seem like a good argument for applying UMAP or t-SNE to your dataset as a way to spot this sort of thing.
We can get better results with UMAP by using the scaled PCA initialization (below on the left) and by using the t-UMAP settings (below, right):
| | | :----------------------------:|:--------------------------: |
That rogue cluster is still present (off to the lower right now), but either image is more comparable to the t-SNE result.
As the visualization of CIFAR-10 isn't very successful with UMAP or t-SNE, here
are some results using the activations of a convnet, similar to that used in
the BH t-SNE paper. For the convnet, I used a keras
implementation, taken from the
Machine Learning in Action blog. Features were Z-scaled as
carried out in the blog, but I used 100 epochs and a batch size of 128 and I also
used the RMSprop (PDF)
optimizer favored in the
Deep Learning with Python book
(with lr=1e-4
and decay=1e-6
). Without any data augmentation, this gave a
test set error of 0.1655
, slightly lower than the test set result given in the
BH t-SNE paper (which used a different architecture and without the benefit of
an extra 4 years of deep learning research). After retraining with all 60000
images, the flattened output of the final max-pool layer was used, giving 2048
features (the BH t-SNE paper network had 1024 output activations). UMAP and
t-SNE results are below. For the UMAP results, I used the t-UMAP settings with
scaled PCA initialization.
| | | :----------------------------:|:--------------------------: |
Results don't look quite as good as those in the BH t-SNE paper, but they are still an improvement. The orange cluster of automobiles remains as an outlier, even in the activation space. You can also see it in the BH t-SNE paper in the lower image in Figure 5 (orange cluster at the bottom, slightly left of center).
The tasic2018
dataset is a transcriptomics dataset of mouse brain cell RNA-seq
data from the Allen Brain Atlas
(originally reported by
Tasic and co-workers.
There is gene expression data for 14,249 cells from the primary visual cortex,
and 9,573 cells from the anterior lateral motor cortex to give a dataset of size
n = 23,822 overall. Expression data for 45,768 genes were obtained in the
original data, but the dataset used here follows the pre-processing treatment of
Kobak and Berens which applied a normalization
and log transformation and then only kept the top 3000 most variable genes.
The data can be generated from the Allen Brain Atlas website and processed in Python by following the instructions in this Berens lab notebook. I output the data to CSV format for reading into R and assembling into a data frame with the following extra exporting code:
np.savetxt("path/to/allen-visp-alm/tasic2018-log3k.csv", logCPM, delimiter=",") np.savetxt("path/to/allen-visp-alm/tasic2018-areas.csv", tasic2018.areas, delimiter=",", fmt = "%d") np.savetxt("path/to/allen-visp-alm/tasic2018-genes.csv", tasic2018.genes[selectedGenes], delimiter=",", fmt='%s') np.savetxt("path/to/allen-visp-alm/tasic2018-clusters.csv", tasic2018.clusters, delimiter=",", fmt='%d') np.savetxt("path/to/allen-visp-alm/tasic2018-cluster-names.csv", tasic2018.clusterNames, delimiter=",", fmt='%s') np.savetxt("path/to/allen-visp-alm/tasic2018-cluster-colors.csv", tasic2018.clusterColors, delimiter=",", fmt='%s')
| | | :----------------------------:|:--------------------------: |
Again, the default UMAP settings produce clusters that are bit too well-separated
to clearly see in these images. So here are the results from using t-UMAP (the
same as setting a = 1, b = 1
), on the left, and then using a = 2, b = 2
on
the right:
| | | :----------------------------:|:--------------------------: |
Another transcriptomics data set, used as an example in openTSNE. This contains data for 44,808 cells from the mouse retina.
The raw data was fetched similarly to this shell script from the Hemberg Lab
and then the data was prepared using the
openTSNE notebook by Pavlin Policar.
Similarly to the tasic2018
dataset, data was log normalized and the 3,000 most variable genes were retained.
I exported the data (without the Z-scaling and PCA dimensionality reduction), as a CSV file, e.g.:
np.savetxt("/path/to/macosko2015/macosko2015-log3k.csv", x, delimiter=",") # Use these as column names np.savetxt("/path/to/macosko2015/macosko2015-genenames.csv", data.T.columns.values[gene_mask].astype(str), delimiter=",", fmt = "%s") np.savetxt("/path/to/macosko2015/macosko2015-clusterids.csv", cluster_ids.values.astype(int), delimiter=",", fmt = "%d")
| | | :----------------------------:|:--------------------------: |
This is another result where the UMAP defaults might need a bit of fiddling with if you don't like how separated the clusters are.
The openTSNE results use Z-scaling of the inputs before applying t-SNE, so below
are the results for UMAP and t-SNE with Z-scaling applied (for umap
, pass
scale = "Z"
as an argument so you don't have to do it manually):
| | | :----------------------------:|:--------------------------: |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.