Application of 'advanced' mode on 2D t-SNE coordinates

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 5,
  fig.align = 'center'
)

library(ggplot2)
library(cowplot)

tl;dr

You can indicate the general expression levels of cells using input parameter use.advanced.sampling. If you do that,singleCellHaystack takes the general expression levels into account and finds genes (DEGs) that are different from that general expression pattern.

Preparing the input data

The data used in these examples can be found here. We recommend downloading the .rda file and loading it using the load() function. You can also download the individual data files separately.

load(url("https://ndownloader.figshare.com/files/24593213"))
ls()

This data should include the following objects:

Let's have a look at this example dataset:

# this data contains 12,030 genes and 1981 cells
dim(dat.expression)

# visualizing the cells in a t-SNE plot:
ggplot(dat.tsne, aes(x = V1, y = V2)) + labs(x = "t-SNE1", y = "t-SNE2") + geom_point()

# the t-SNE coordinates are based on the 50 PCs
dim(dat.pca) # the PCA data contains 1981 cells and 50 PCs

There are several groups of cells, although the borders between them are not clear and several might consist of additional subclusters. We can use singleCellHaystack to predict DEGs without relying on grouping of the cells into arbitrary clusters.

singleCellHaystack has two required input parameters: 1) the coordinates of cells in the input space, and 2) a table showing which genes are detected in which cells. The definition of detection is left to the user. In this example we will use the median normalized read count of each gene as threshold for defining detection. Alternatively, we could also define genes with counts > 0 as being detected, for example.

median.per.gene <- apply(dat.expression,1,median) # get the median read count per gene
head(median.per.gene) # for many genes the median read count is 0
dat.detection <- dat.expression > median.per.gene # use the medians as threshold for defining detection
dat.detection[1:5,1:5] # TRUE means detected, FALSE means not detected

Now, that we have defined gene detection, let's have a look at the t-SNE plot again, colouring cells by the number of genes detected in each cell:

general.detection = apply(dat.detection, 2, sum)
ggplot(dat.tsne, aes(x = V1, y = V2, colour = general.detection)) + labs(x = "t-SNE1", y = "t-SNE2") + 
  geom_point(size=2) + scale_color_gradient(low="dodgerblue", high="maroon2") + labs(color = "Det. genes")

As can be seen in the last plot, cells in the bottom right have in general fewer detected genes. If a gene is detected predominantly in these cells and not in others, this tendency goes against the general trend in expression. Such a trend could be of particular interest.

We can use singleCellHaystack to predict DEGs without relying on grouping of the cells into arbitrary clusters, taking also into account general gene expression trends, using the option use.advanced.sampling.

Running haystack with use.advanced.sampling

First, load the package.

library(singleCellHaystack)

Next, run haystack on the t-SNE coordinates. These are 2D coordinates, so we set 'method' to '2D'. We also give the detection values as input to 'detection'. use.advanced.sampling is set to the vector of detected genes per cell, general.detection. By doing this, haystack will calculate cell distributions and perform randomizations taking into account the general detection levels. We also set a random seed to ensure replicability.

set.seed(123)
res.tsne.adv <- haystack(x = dat.tsne, detection = dat.detection, use.advanced.sampling = general.detection, method = "2D")

Let's have a look at the most significant DEGs. The gene with the strongest differential expression is Trim10. We can plot the expression and detection of this gene using the plot_gene_haystack function.

show_result_haystack(res.haystack = res.tsne.adv, n = 5)
# plotting detection of this gene
plot_gene_haystack(x = dat.tsne, gene = "Trim10", expression = dat.detection)
# plotting log expression of this gene
plot_gene_haystack(x = dat.tsne, gene = "Trim10", expression = log10(dat.expression)) 

From these plots, we can see that Trim10 is expressed almost exclusively in the bottom right group of cells. As seen above, these cells have in general fewer detected genes, so the pattern of Trim10 is especially surprizing: it is deteced in cells that express in general few genes.

This explains why the advanced mode of singleCellHaystack judges Trim10 to be the most significant DEG.

Clustering and visualization

Next, let's take the top 1000 DEGs, and cluster them by their expression pattern in the 2D t-SNE coordinates. Here we use hclust_haystack, which uses hierarchical clustering. Alternatively, we could use kmeans_haystack for k-means clustering.

# get the top 1000 DEGs in the result
res.top <- show_result_haystack(res.haystack = res.tsne.adv, n = 1000)
# cluster DEGs by their expression pattern in the 2D plot
genes.top <- row.names(res.top)
res.hc <- hclust_haystack(x = dat.tsne, genes = genes.top, detection = dat.detection)

hclust_haystack returns as result a hclust tree, which we can cut into clusters using the cutree function. Here, we arbitrarily set the number of clusters to 5.

res.hc.clusters <- cutree(res.hc, k=5)
table(res.hc.clusters)

Let's run through the 5 clusters and plot their averaged detection pattern using plot_gene_set_haystack, which is similar to plot_gene_haystack but uses a set of genes as input instead of just 1 gene.

pl <- lapply(1:5, function(cluster) {
  gene.set <- names(res.hc.clusters)[res.hc.clusters==cluster]
  plot.title <- paste0("Cluster ", cluster)
  p <- plot_gene_set_haystack(x = dat.tsne, genes = gene.set, detection = dat.detection)
  p + ggtitle(plot.title) + theme(legend.title = element_text(size = 8))
})
plot_grid(plotlist = pl, ncol = 2)
res.hc.clusters["Trim10"] # the most significant DEG is in cluster 5

The most significant DEG, Trim10, was clustered into cluster 5. Comparing its expression pattern (see above) with that of each cluster, we can indeed see that it fits most closely with that of cluster 5. Cluster 5 genes are on average detected almost exclusively in the bottom right group of cells.



Try the singleCellHaystack package in your browser

Any scripts or data that you put into this service are public.

singleCellHaystack documentation built on March 28, 2021, 9:12 a.m.