## block with some startup/background objects functions library(umap) plot.iris = function(x, labels, main="A UMAP visualization of the Iris dataset", colors=c("#ff7f00", "#e377c2", "#17becf"), pad=0.1, cex=0.65, pch=19, add=FALSE, legend.suffix="", cex.main=1, cex.legend=1) { layout = x if (class(x)=="umap") { layout = x$layout } xylim = range(layout) xylim = xylim + ((xylim[2]-xylim[1])*pad)*c(-0.5, 0.5) if (!add) { par(mar=c(0.2,0.7,1.2,0.7), ps=10) plot(xylim, xylim, type="n", axes=F, frame=F) rect(xylim[1], xylim[1], xylim[2], xylim[2], border="#aaaaaa", lwd=0.25) } points(layout[,1], layout[,2], col=colors[as.integer(labels)], cex=cex, pch=pch) mtext(side=3, main, cex=cex.main) labels.u = unique(labels) legend.pos = "topright" legend.text = as.character(labels.u) if (add) { legend.pos = "bottomright" legend.text = paste(as.character(labels.u), legend.suffix) } legend(legend.pos, legend=legend.text, col=colors[as.integer(labels.u)], bty="n", pch=pch, cex=cex.legend) } set.seed(123456)
Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction proposed by McInnes and Healy.
This vignette demonstrates how to use the umap
R package to perform dimensional reduction and data trasnformations with the UMAP method. The vignette uses a small dataset as an example, but the package is suited to process larger data with many thousands of data points.
For a practical demonstration, let's use the Iris dataset, iris
.
head(iris, 3)
The first four columns contain data and the last column contains a label. It will be useful to separate those components.
iris.data = iris[, grep("Sepal|Petal", colnames(iris))] iris.labels = iris[, "Species"]
Let's now load the umap
package and apply the UMAP transformation.
library(umap) iris.umap = umap(iris.data)
The output, iris.umap
, is an object of class umap
. We can get a minimal summary of its contents by just printing it.
iris.umap
The main component of the object is layout
, which holds a matrix with coordinates.
head(iris.umap$layout, 3)
These coordinates can be used to visualize the dataset. (The custom plot function, plot.iris
, is available at the end of this vignette.)
plot.iris(iris.umap, iris.labels)
Once we have a umap
object describing a projection of one dataset into a low-dimensional layout, it becomes possible to project other data onto the same manifold.
To perform the projection, we need a second dataset with the same data features as the training data. For this demonstration, let's create such data by adding some noise to the original data.
iris.wnoise = iris.data + matrix(rnorm(150*40, 0, 0.1), ncol=4) colnames(iris.wnoise) = colnames(iris.data) head(iris.wnoise, 3)
We can now request to arrange these perturbed observations onto the same layout as before. Following R's design pattern for fitted models, this is performed via predict
.
iris.wnoise.umap = predict(iris.umap, iris.wnoise) head(iris.wnoise.umap, 3)
The output here is a matrix with coordinates. We can now visualize this new data alongside the original.
plot.iris(iris.umap, iris.labels) plot.iris(iris.wnoise.umap, iris.labels, add=T, pch=4, legend.suffix=" (with noise)")
Note the new observations lie close to their original counterparts.
The example above uses function umap
with a single argument - the input dataset - so the embedding is performed with default settings. However, the algorithm can be tuned in several ways. There are two strategies for tuning: via configuration objects and via additional arguments.
The default configuration object is called umap.defaults
. This is a list encoding default values for all the parameters used within the algorithm.
umap.defaults
umap.defaults
This object is a list with key-value pairs shown. To obtain some minimal information about each field, see the documentation in help(umap.defaults)
, or see the original publication.
To create a custom configuration, create a copy of the defaults and then update any of the fields. For example, let's change the seed for random number generation.
custom.config = umap.defaults custom.config$random_state = 123
We can observe the changed settings by inspecting the object again (try it). To perform the UMAP projection with these settings, we can run the projection again and pass the configuration object as a second argument.
iris.umap.2 = umap(iris.data, custom.config) plot.iris(iris.umap.2, iris.labels, main="Another UMAP visualization (different seed)")
The result is slightly different than before due to a new instantiation of the random number generator.
Another way to customize the algorithm is to specify the non-default parameters explicitly. To achieve equivalent results to the above, we can thus use
iris.umap.3 = umap(iris.data, random_state=123)
The coordinates in this new output object should match the ones from iris.umap.2
(check it!)
The package provides two implementations of the umap method, one written in R and one accessed via an external python module.
The implementation written in R is the default. This implementation follows the design principles of the UMAP algorithm and its running time scales better-than-quadratically with the number of items (points) in a dataset. It is thus in principle suitable for use on datasets with thousands of points. It is the default because it should be functional without extensive dependencies.
The second available implementation is a wrapper for a python package umap-learn
. To enable this implementation, specify the argument method
.
iris.umap.4 = umap(iris.data, method="umap-learn")
This command has several dependencies. You must have the reticulate
package installed and loaded (use install.packages("reticulate")
and library(reticulate)
). Furthermore, you must have the umap-learn
python package installed (see the package repo for instructions). If either of these components is not available, the above command will display an error message.
A separate vignette explains additional aspects of interfacing with umap-learn
, including handling of discrepancies between releases.
Note that it will not be possible to produce exactly the same output from the two implementations due to inequivalent random number generators in R and python, and due to slight discrepancies in the implementations.
The custom plot function used to visualize the Iris dataset:
plot.iris
Summary of R session:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.