knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(sneezy) library(SingleCellExperiment) # to display gifs inline library(gganimate)
The goal of sneezy is to allow a user to exlpore non-linear embedding algorithms such as t-SNE and UMAP in the context of global geometry of the original high dimensional space. It does this in a few ways:
It provides a unified interface to a selection of embedding methods and nearest neighbor estimation methods.
This vignette walks through basic usage of sneezy to get you familiar with the data structures and methods available. For more advanced useage with 'realistic' analysis scenarios see the pbmc3k and pdfsense vignettes.
sneezy comes preloaded with a benchmark from the multi-challenge dataset.
This data arrives as a simple data.frame
that contains a key indicating
the challenge id and an index giving the observation id for that particular
challenge, the remaining coordinates are the dimensions.
library(SingleCellExperiment) library(sneezy) head(multi)
In order to keep track of the embeddings and transformations on the
numerical parts of the data, sneezy creates a data structure called a
TourExperiment. It extends the SingleCellExperiment::SingleCellExperiment()
object with two additional slots called basisSets
and neighborSets
,
we will describe their use in more detail shortly.
A TourExperiment object can be created directly from a data.frame or a matrix as follows:
multi_te <- TourExperiment(multi, basisSets = SimpleList(), neighborSets = SimpleList(), X1:X10) multi_te
The last argument is the selection of columns that form the numeric data that we are interested in exploring the structure of, called an assay in the nonmelcature of bioinformatics. Note this is transposed so rows are variables and columns are observations.
# the x1 - x10 data assay(multi_te, "view")[, 1:5]
We will start with a simpler look of the data by selecting the keys "A" and "D". This consists of two well sepearted Gaussian clusters in 10 dimensional space (with same covariance matrix), and two 3-dimensional Gaussian clusters embeded in 10 dimensional space. One of the clusters has more granular structure and consists of three sub clusters. From the multichallenge dataset page this subset is described as follows:
The first subset consists of a Gaussian cluster and another cluster that is itself divided into three Gaussian clusters, all of them living in a three-dimensional space. This subset is used to demonstrate how an algorithm deals with different levels of granularity in cluster structures. The distance between the centers of the two main clusters, i.e. the the big cluster and the cluster that consists of the three smaller ones, is 5 times the standard deviation d of the first main cluster. The three smaller clusters are arranged around the center of the second cluster, which they themselves form, on a circle of radius 5-d equidistant from each other. The three 3 small cluster centers and the center of the large cluster lie in the same plane. The small clusters each have a standard deviation of d and one third of the number of 3 data points of the large cluster.
First we will subset the multi_te
object:
multi_te2 <- multi_te[, colData(multi_te)$key %in% c("A", "D")] colData(multi_te2)$labels <- ifelse(colData(multi_te2)$key == "A", "sub-gaussian", "10-d 2x cluster") multi_te2
We can get a view of the structure using exact principal components via
the embed_linear
function:
multi_te2 <- embed_linear(multi_te2, num_comp = nrow(multi_te2), .on = "view", center = TRUE, .engine = pca_exact()) multi_te2
This adds an element to the reducedDims slot called pca_exact()
that
stores the sample factors, and loadings for the principal components.
pcs <- reducedDim(multi_te2, "pca_exact") pcs
The resulting principal components can be viewed via ggplot2:
library(ggplot2) ggplot( data.frame(sampleFactors(pcs), labels = colData(multi_te2)$labels), aes(x = PC1, y = PC2) ) + geom_point(aes(colour = labels)) + scale_color_brewer(palette = "Dark2")
From the PCs we can clearly see the separation of the two 10-d clusters but cannot see the finer structure in the sub-gaussian clusters.
At this point we can compute exact t-SNE for a given perpelexity and exaggeration factor alpha:
set.seed(1010010) multi_te2 <- embed_nonlinear( multi_te2, num_comp = 2, .on = "view", .engine = tsne_exact(perplexity = 30, alpha = ncol(multi_te2) / 10) ) tsne <- reducedDim(multi_te2, "tsne_exact")
And we can follow the same approach as before for visualising the results:
pl <- ggplot( data.frame(sampleFactors(tsne), labels = colData(multi_te2)$labels), aes(x = Dim1, y = Dim2) ) + geom_point(aes(colour = labels)) + scale_color_brewer(palette = "Dark2") + theme(axis.title = element_blank(), axis.ticks = element_blank(), axis.text = element_blank(), panel.grid = element_blank()) pl
In this case it looks as though t-SNE has worked ok: it has identified the three subclusters of the second cluster for the orange points, and seperated the two ten dimensional clusters.
Next we can estimate the nearest neighbors on both the original dataset and in the embedding. This produces a matrix of indices for each row giving the nearest neighbor.
multi_te2 <- estimate_neighbors(multi_te2, 90, .on = "view") multi_te2 <- estimate_neighbors(multi_te2, 90, .on = "tsne_exact") dim(neighborSet(multi_te2, "tsne_exact"))
You can then view the k-nearest neighbors graph overlaid on the
scatter via the overlay_neighbors()
function:
pl + overlay_neighbors(sampleFactors(tsne)[,1], sampleFactors(tsne)[,2], neighborSet(multi_te2, "tsne_exact"), alpha = 0.01)
This can give us a direct visualisation of how the difference between the emphasis t-SNE provides on local geometry by viewing the k-NN graph from the original dataset on the t-SNE layout.
pl + overlay_neighbors(sampleFactors(tsne)[,1], sampleFactors(tsne)[,2], neighborSet(multi_te2, "view"), alpha = 0.01)
We can directly visualise the difference:
pl + overlay_difference_neighbors(sampleFactors(tsne)[,1], sampleFactors(tsne)[,2], neighborSet(multi_te2, "view"), neighborSet(multi_te2, "tsne_exact"), alpha = 0.01)
Or the intersection:
pl + overlay_intersect_neighbors(sampleFactors(tsne)[,1], sampleFactors(tsne)[,2], neighborSet(multi_te2, "view"), neighborSet(multi_te2, "tsne_exact"), alpha = 0.01)
Sometimes the graph view can be messy, especially when there's lots of neighbors. Instead we can estimate the centroids of clusters seen in the embedding by clustering the complete kNN or the shared nearest neighbor (sNN) graph. Note that this is heavily influenced by the use of
pl + overlay_snn_centroids(sampleFactors(tsne)[,1], sampleFactors(tsne)[,2], neighborSet(multi_te2, "tsne_exact"), color = "red")
We can also tour around a data space, and see how the nearest neighbours
graph from t-SNE space is preserved in high-dimensional space. We can take view the entire k-NN or s-NN graph in the embedding space.
This allows us to see how t-SNE preserves local topology. First
we generate a basis set on the for our tour, which will form part of
our TourExperiment
:
set.seed(1999) multi_te2 <- generate_bases(multi_te2, .on = "view", max_bases = 300) basisSetNames(multi_te2)
Then we can animate the tour view and overlay the neighbors with the following:
library(gganimate) pal <- c("#1B9E77","#D95F02")[as.integer(factor(colData(multi_te2)$labels))] sneezy_neighbors(multi_te2, "view", "tsne_exact", col = pal)
And look at the centroids in the original space of the nearest neighbours graph in t-SNE space:
sneezy_centroids(multi_te2, "view", "tsne_exact", col = pal)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.