Among other things, UMAP provides two interesting extensions to its basic dimensionality reduction. First, it can do a supervised embedding, where labels (or numeric values) are leveraged so that similar points are closer together than they would otherwise be. Second, it can do metric learning, by embedding out-of-sample points based on an existing embedding.
This document shows how to do it in uwot
, but more information is
available in UMAP's
documentation.
The example dataset used is Fashion MNIST. One way to download it in uwot-ready form is:
devtools::install_github("jlmelville/snedata") fashion <- snedata::download_fashion_mnist()
The Fashion MNIST dataset contains 70,000 images of fashion items, in one of ten
classes. A factor column, Label
contains the id of each item (from 0
to 9
)
for backwards compatibility with the MNIST dataset, which Fashion MNIST is
designed to be a drop-in replacement for. A more descriptive, but entirely
equivalent, factor column, Description
provides a short text string to
describe the classes, e.g. the Description
"Coat"
and the Label
4
are
equivalent.
To produce the plots below, I used my vizier package, which can be installed using:
devtools::install_github("jlmelville/vizier")
I'll show the commands to produce the plots before they are displayed.
We'll compare the supervised result with a standard run of UMAP:
set.seed(1337) fashion_umap <- umap(fashion)
For supervised learning, provide a suitable vector of labels as the y
argument
to umap
(or tumap
):
set.seed(1337) fashion_sumap <- umap(fashion, y = fashion$Description)
Let's take a look at the results, the unsupervised embedding on the left, and the supervised version on the right:
vizier::embed_plot(fashion_umap, fashion, cex = 0.5, title = "Fashion UMAP", alpha_scale = 0.075) vizier::embed_plot(fashion_sumap, fashion, cex = 0.5, title = "Fashion Supervised UMAP", alpha_scale = 0.075)
| | | :----------------------------:|:--------------------------: |
Clearly, the supervised UMAP has done a much better job of separating out the classes, although it has also retained the relative location of the clusters pretty well, too.
It's also possible to use an existing embedding to embed new points. Fashion MNIST comes with its own suggested split into training (the first 60,000 images) and test (the remaining 10,000 images) sets, so we'll use that:
fashion_train <- head(fashion, 60000) fashion_test <- tail(fashion, 10000)
Training proceeds by running UMAP normally, but we need to return more than just
the embedded coordinates. To return enough information to embed new data, we
need to set the ret_model
flag when we run umap
. This will return a list.
The embedded coordinates can be found as the embedding
item.
For training, we shall continue to use both standard UMAP:
set.seed(1337) fashion_umap_train <- umap(fashion_train, ret_model = TRUE)
and supervised UMAP:
set.seed(1337) fashion_sumap_train <- umap(fashion_train, ret_model = TRUE, y = fashion_train$Description)
These results shouldn't be that different from the full-dataset embeddings, but let's take a look anyway:
vizier::embed_plot(fashion_umap_train$embedding, fashion_train, cex = 0.5, title = "Fashion Train UMAP", alpha_scale = 0.075) vizier::embed_plot(fashion_sumap_train$embedding, fashion_train, cex = 0.5, title = "Fashion Train Supervised UMAP", alpha_scale = 0.075)
| | | :----------------------------:|:--------------------------: |
Everything looks in order here. The standard UMAP training plot is flipped along the y-axis compared to the full dataset, but that doesn't matter.
To embed new data, use the umap_transform
function. Pass the new data and the
trained UMAP model. There's no difference between using a standard UMAP model:
set.seed(1337) fashion_umap_test <- umap_transform(fashion_test, fashion_umap_train)
or a supervised UMAP model:
set.seed(1337) fashion_sumap_test <- umap_transform(fashion_test, fashion_sumap_train)
Here are the results:
vizier::embed_plot(fashion_umap_test, fashion_test, cex = 0.5, title = "Fashion Test UMAP", alpha_scale = 0.075) vizier::embed_plot(fashion_sumap_test, fashion_test, cex = 0.5, title = "Fashion Test Supervised UMAP", alpha_scale = 0.075)
| | | :----------------------------:|:--------------------------: |
The test data results are very obviously embedded in a similar way to the training data. Of particular interest are the test results with the supervised model, where the clusters stay well separated compared to the unsupervised results, although there are some misclassifications of shirts, t-shirts, coats and pullover classes (the green, blue and red clusters on the right of the supervised UMAP plot).
To quantify this improvement, we can look at accuracy in predicting the
test set labels by using the embedded coordinates as a k-nearest neighbor
classifier. There are a variety of ways I can imagine using the information
in the model, but two obvious ones are to use the label of the nearest neighbor,
(1NN
) or take a vote using the n_neighbors
(in this case, 15) nearest
neighbors (15NN
).
For standard UMAP, the 1NN
accuracy is 71%, and the 15NN
accuracy is
77%. Using supervised UMAP, these accuracies improve to 83% and 84%,
respectively. So quantitatively, the supervised UMAP is a big help in correctly
classifying the test data.
To put these numbers in perspective, we can carry out similar calculations using
the input data directly. Here, the 1NN
accuracy is 85% and the 15NN
accuracy
is 84%. Possibly, the lack of improvement on going from 1 to 15 neighbors
indicates that a different value of the n_neighbors
parameter could improve
the embedding, but I haven't pursued that.
At any rate, it's clear that the Fashion MNIST images do not embed well in
two dimensions, although supervised UMAP gets impressively close to matching
the high dimensional results. Maybe supervised UMAP can do even better by
a suitable choice of target_weight
and n_components
on top of fiddling
with n_neighbors
.
The Fashion MNIST website contains a page that shows the accuracy using
129
scikit-learn methods,
and the 15NN
supervised UMAP accuracy puts us in the top 60, which isn't bad,
considering the only hyperparameter search I did was to look at 1NN
and
15NN
. However, although the highest accuracy reported on that page is 89.7%,
the
deep learning results achieve
90-97%.
Here's an example of using supervised UMAP with a numerical target vector. We
shall use the diamonds
dataset that comes with the
ggplot2 package, as it is of
a similar size to MNIST.
library(ggplot2) ?diamonds
There are 10 variables associated with each diamond: five numeric values
related to the geometry of the diamonds (table
, x
, y
, z
and depth
),
three factors that measure the quality of the diamond (cut
, color
and
clarity
), and the price
in dollars. The price
seems like a perfect
candidate for the sort of thing we'd want as the target vector, leaving the
other nine variables to be used for the dimensionality reduction.
uwot
's implementation of UMAP uses all numeric columns in can find in its
calculations, so to avoid including the price
in the non-supervised part of
UMAP, let's create a new data frame, initially with the geometric data:
dia <- diamonds[, c("carat", "x", "y", "z", "table")]
The depth
column is related to x
, y
and z
(albeit non-linearly) so I'm
not going to include it.
Additionally, the factors cut
, color
and clarity
are all ordinal variables,
i.e. their categories can be ordered, so we can convert these to a numeric
scale and include them as well:
dia$cut <- as.numeric(diamonds$cut) dia$color <- as.numeric(diamonds$color) dia$clarity <- as.numeric(diamonds$clarity)
We now have a dataset with 53,940 rows and 8 columns. There are 360 duplicates, but it doesn't seem to affect the results particularly.
Now, I'm not saying that this is the trickiest dataset to extract any meaning from. First, let's look at some standard unsupervised results. For starters, here's a plot of the first two principal components, using the irlba package:
dia_pca <- irlba::prcomp_irlba(dia, n = 2, scale. = TRUE) vizier::embed_plot(dia_pca$x, diamonds$price, title = "Diamonds PCA", color_scheme = "RColorBrewer::Spectral", alpha_scale = 0.1, cex = 0.5, pc_axes = TRUE)
Because the different columns have different units and meaning, I set scale. =
TRUE
to equalize their variances. The color scheme is "Spectral" palette from
ColorBrewer: red indicates a low price and blue a high price. Despite the
majority of the dataset being clumped together in the plot due to some outliers
you can't really see, the progression of prices from low to high is already
pretty well captured with two components.
Anyway, let's see what UMAP does with it. Like with PCA, the columns are all
scaled to have equal variance (scale = TRUE
):
dia_umap <- umap(dia, scale = TRUE, verbose = TRUE) vizier::embed_plot(dia_umap, diamonds$price, title = "Diamonds UMAP", color_scheme = "RColorBrewer::Spectral", alpha_scale = 0.1, cex = 0.5, pc_axes = TRUE)
Not bad. The high price diamonds are clumped together in their own little clusters in the middle of the plot. On this occasion, I prefer the layout that's initialized from the PCA results, though:
dia_umap_from_pca <- umap(dia, scale = TRUE, verbose = TRUE, init = dia_pca$x) vizier::embed_plot(dia_umap_from_pca, diamonds$price, title = "Diamonds UMAP (PCA init)", color_scheme = "RColorBrewer::Spectral", alpha_scale = 0.1, cex = 0.5, pc_axes = TRUE)
This maintains the global structure of the PCA result. Rather than have to
separately create the PCA, you can also use init = "pca"
and get the same
results (uwot
uses irlba
internally for this, so there's no loss of speed).
Onto the supervised result. Results are not particularly affected by the choice of initialization, so for simplicity we'll just use the standard spectral initialization:
dia_sumap <- umap(dia, scale = TRUE, verbose = TRUE, y = diamonds$price) vizier::embed_plot(dia_sumap, diamonds$price, title = "Diamonds Supervised UMAP", color_scheme = "RColorBrewer::Spectral", alpha_scale = 0.1, cex = 0.5, pc_axes = TRUE)
As expected, the embedding is now even more well-organized along the price of the diamonds.
There is a visible gap between the lowest price diamonds (on the right) and the
rest of the embedding. If you increase the n_epochs
parameter and allow the
optimization to proceed, this gap increases substantially, making the plot
harder to read. Adjusting the n_epochs
parameter, along with the
target_n_neighbors
and target_weight
parameters may be required to strike
the right balance. At the time of writing, I'm not aware of many examples of
supervised UMAP with a numeric vector (in fact none except this thing I just
wrote) so I cannot provide a lot of sage wisdom on this matter.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.