In sa-lee/sneezy: Interactively explore non-linear projections like t-SNE

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(sneezy)
library(SingleCellExperiment)
# to display gifs inline
library(gganimate)

Exploring non-linear embeddings with sneezy

The goal of sneezy is to allow a user to exlpore non-linear embedding algorithms such as t-SNE and UMAP in the context of global geometry of the original high dimensional space. It does this in a few ways:

generating a tour of the original (high dimensional) data set
estimating and visualising nearest neighbor graphs in the original and embedding space
providing numerical and visual diagnostics to determine the quality of the embedding

It provides a unified interface to a selection of embedding methods and nearest neighbor estimation methods.

This vignette walks through basic usage of sneezy to get you familiar with the data structures and methods available. For more advanced useage with 'realistic' analysis scenarios see the pbmc3k and pdfsense vignettes.

The multi-challenge dataset

sneezy comes preloaded with a benchmark from the multi-challenge dataset.

This data arrives as a simple data.frame that contains a key indicating the challenge id and an index giving the observation id for that particular challenge, the remaining coordinates are the dimensions.

library(SingleCellExperiment)
library(sneezy)
head(multi)

The TourExperiment object

In order to keep track of the embeddings and transformations on the numerical parts of the data, sneezy creates a data structure called a TourExperiment. It extends the SingleCellExperiment::SingleCellExperiment() object with two additional slots called basisSets and neighborSets, we will describe their use in more detail shortly.

A TourExperiment object can be created directly from a data.frame or a matrix as follows:

multi_te <- TourExperiment(multi,
                           basisSets = SimpleList(),
                           neighborSets = SimpleList(),
                           X1:X10)
multi_te

The last argument is the selection of columns that form the numeric data that we are interested in exploring the structure of, called an assay in the nonmelcature of bioinformatics. Note this is transposed so rows are variables and columns are observations.

# the x1 - x10 data
assay(multi_te, "view")[, 1:5]

Making Shapes

We will start with a simpler look of the data by selecting the keys "A" and "D". This consists of two well sepearted Gaussian clusters in 10 dimensional space (with same covariance matrix), and two 3-dimensional Gaussian clusters embeded in 10 dimensional space. One of the clusters has more granular structure and consists of three sub clusters. From the multichallenge dataset page this subset is described as follows:

The first subset consists of a Gaussian cluster and another cluster that is itself divided into three Gaussian clusters, all of them living in a three-dimensional space. This subset is used to demonstrate how an algorithm deals with different levels of granularity in cluster structures. The distance between the centers of the two main clusters, i.e. the the big cluster and the cluster that consists of the three smaller ones, is 5 times the standard deviation d of the first main cluster. The three smaller clusters are arranged around the center of the second cluster, which they themselves form, on a circle of radius 5-d equidistant from each other. The three 3 small cluster centers and the center of the large cluster lie in the same plane. The small clusters each have a standard deviation of d and one third of the number of 3 data points of the large cluster.

First we will subset the multi_te object:

multi_te2 <- multi_te[, colData(multi_te)$key %in% c("A", "D")]
colData(multi_te2)$labels <- ifelse(colData(multi_te2)$key == "A", 
                                    "sub-gaussian", 
                                    "10-d 2x cluster")
multi_te2

We can get a view of the structure using exact principal components via the embed_linear function:

multi_te2 <- embed_linear(multi_te2, 
                          num_comp = nrow(multi_te2),
                          .on = "view",
                          center = TRUE,
                          .engine = pca_exact())
multi_te2

This adds an element to the reducedDims slot called pca_exact() that stores the sample factors, and loadings for the principal components.

pcs <- reducedDim(multi_te2, "pca_exact")
pcs

The resulting principal components can be viewed via ggplot2:

library(ggplot2)
ggplot(
  data.frame(sampleFactors(pcs), labels = colData(multi_te2)$labels),
  aes(x = PC1, y = PC2)
  ) +
  geom_point(aes(colour = labels)) +
  scale_color_brewer(palette = "Dark2")

From the PCs we can clearly see the separation of the two 10-d clusters but cannot see the finer structure in the sub-gaussian clusters.

At this point we can compute exact t-SNE for a given perpelexity and exaggeration factor alpha:

set.seed(1010010)
multi_te2 <- embed_nonlinear(
  multi_te2,
  num_comp = 2,
  .on = "view", 
  .engine = tsne_exact(perplexity = 30, alpha = ncol(multi_te2) / 10)
)

tsne <- reducedDim(multi_te2, "tsne_exact")

And we can follow the same approach as before for visualising the results:

pl <- ggplot(
  data.frame(sampleFactors(tsne), labels = colData(multi_te2)$labels),
  aes(x = Dim1, y = Dim2)
) +
  geom_point(aes(colour = labels)) +
  scale_color_brewer(palette = "Dark2") +
  theme(axis.title = element_blank(), 
        axis.ticks = element_blank(), 
        axis.text = element_blank(),
        panel.grid = element_blank())

pl

In this case it looks as though t-SNE has worked ok: it has identified the three subclusters of the second cluster for the orange points, and seperated the two ten dimensional clusters.

Next we can estimate the nearest neighbors on both the original dataset and in the embedding. This produces a matrix of indices for each row giving the nearest neighbor.

multi_te2 <- estimate_neighbors(multi_te2, 90, .on = "view")
multi_te2 <- estimate_neighbors(multi_te2, 90, .on = "tsne_exact")

dim(neighborSet(multi_te2, "tsne_exact"))

You can then view the k-nearest neighbors graph overlaid on the scatter via the overlay_neighbors() function:

pl +
  overlay_neighbors(sampleFactors(tsne)[,1], 
                    sampleFactors(tsne)[,2], 
                    neighborSet(multi_te2, "tsne_exact"),
                    alpha = 0.01)

This can give us a direct visualisation of how the difference between the emphasis t-SNE provides on local geometry by viewing the k-NN graph from the original dataset on the t-SNE layout.

pl +
  overlay_neighbors(sampleFactors(tsne)[,1], 
                    sampleFactors(tsne)[,2], 
                    neighborSet(multi_te2, "view"),
                    alpha = 0.01)

We can directly visualise the difference:

pl +
  overlay_difference_neighbors(sampleFactors(tsne)[,1], 
                    sampleFactors(tsne)[,2], 
                    neighborSet(multi_te2, "view"),
                    neighborSet(multi_te2, "tsne_exact"),
                    alpha = 0.01)

Or the intersection:

pl +
  overlay_intersect_neighbors(sampleFactors(tsne)[,1], 
                    sampleFactors(tsne)[,2], 
                    neighborSet(multi_te2, "view"),
                    neighborSet(multi_te2, "tsne_exact"),
                    alpha = 0.01)

Sometimes the graph view can be messy, especially when there's lots of neighbors. Instead we can estimate the centroids of clusters seen in the embedding by clustering the complete kNN or the shared nearest neighbor (sNN) graph. Note that this is heavily influenced by the use of

pl +
  overlay_snn_centroids(sampleFactors(tsne)[,1], 
                        sampleFactors(tsne)[,2], 
                        neighborSet(multi_te2, "tsne_exact"),
                        color = "red")

Viewing high-dimensional space

We can also tour around a data space, and see how the nearest neighbours graph from t-SNE space is preserved in high-dimensional space. We can take view the entire k-NN or s-NN graph in the embedding space. This allows us to see how t-SNE preserves local topology. First we generate a basis set on the for our tour, which will form part of our TourExperiment:

set.seed(1999)
multi_te2 <- generate_bases(multi_te2, .on = "view", max_bases = 300)
basisSetNames(multi_te2)

Then we can animate the tour view and overlay the neighbors with the following:

library(gganimate)
pal <- c("#1B9E77","#D95F02")[as.integer(factor(colData(multi_te2)$labels))]

sneezy_neighbors(multi_te2, "view", "tsne_exact", col = pal)

And look at the centroids in the original space of the nearest neighbours graph in t-SNE space:

sneezy_centroids(multi_te2, "view", "tsne_exact", col = pal)

sa-lee/sneezy documentation built on Nov. 25, 2019, 12:38 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sa-lee/sneezy
Interactively explore non-linear projections like t-SNE

In sa-lee/sneezy: Interactively explore non-linear projections like t-SNE

Exploring non-linear embeddings with sneezy

The multi-challenge dataset

The TourExperiment object

Making Shapes

Viewing high-dimensional space

R Package Documentation

Browse R Packages

We want your feedback!

sa-lee/sneezy Interactively explore non-linear projections like t-SNE

In sa-lee/sneezy: Interactively explore non-linear projections like t-SNE

Exploring non-linear embeddings with sneezy

The multi-challenge dataset

The TourExperiment object

Making Shapes

Viewing high-dimensional space

R Package Documentation

Browse R Packages

We want your feedback!

sa-lee/sneezy
Interactively explore non-linear projections like t-SNE