Application of 'advanced' mode on multi-dimensional coordinates

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 5,
  fig.align = 'center'
)

library(ggplot2)
library(cowplot)

tl;dr

You can indicate the general expression levels of cells using input parameter use.advanced.sampling. If you do that,singleCellHaystack takes the general expression levels into account and finds genes (DEGs) that are different from that general expression pattern.

Preparing input data

The data used in these examples can be found here. We recommend downloading the .rda file and loading it using the load() function. You can also download the individual data files separately.

load(url("https://ndownloader.figshare.com/files/24593213"))
ls()

This data should include the following objects:

Let's have a look at this example dataset:

# this data contains 12,030 genes and 1981 cells
dim(dat.expression)

# visualizing the cells in a t-SNE plot:
ggplot(dat.tsne, aes(x = V1, y = V2)) + labs(x = "t-SNE1", y = "t-SNE2") + geom_point()

# the t-SNE coordinates are based on the 50 PCs
dim(dat.pca) # the PCA data contains 1981 cells and 50 PCs

There are several groups of cells, although the borders between them are not clear and several might consist of additional subclusters. We can use singleCellHaystack to predict DEGs without relying on grouping of the cells into arbitrary clusters.

singleCellHaystack has two required input parameters: 1) the coordinates of cells in the input space, and 2) a table showing which genes are detected in which cells. The definition of detection is left to the user. In this example we will use the median normalized read count of each gene as threshold for defining detection. Alternatively, we could also define genes with counts > 0 as being detected, for example.

median.per.gene <- apply(dat.expression,1,median) # get the median read count per gene
head(median.per.gene) # for many genes the median read count is 0
dat.detection <- dat.expression > median.per.gene # use the medians as threshold for defining detection
dat.detection[1:5,1:5] # TRUE means detected, FALSE means not detected

Now, that we have defined gene detection, let's have a look at the t-SNE plot again, colouring cells by the number of genes detected in each cell:

general.detection = apply(dat.detection, 2, sum)
ggplot(dat.tsne, aes(x = V1, y = V2, colour = general.detection)) + labs(x = "t-SNE1", y = "t-SNE2") + 
  geom_point(size=2) + scale_color_gradient(low="dodgerblue", high="maroon2") + labs(color = "Det. genes")

As can be seen in the last plot, cells in the bottom right have in general fewer detected genes. If a gene is detected predominantly in these cells and not in others, this tendency goes against the general trend in expression. Such a trend could be of particular interest.

We can use singleCellHaystack to predict DEGs without relying on grouping of the cells into arbitrary clusters, taking also into account general gene expression trends, using the option use.advanced.sampling.

Running haystack with use.advanced.sampling

First, load the package.

library(singleCellHaystack)

Next, run haystack on the first 50 principal components. Since the space is 50-dimensional, we set 'method' to 'highD'. We also give the detection values as input to 'detection'. This example dataset is relatively small, containing 1,981 cells, so running 'haystack' should take just 1 to 3 minutes to finish. We also give the detection values as input to 'detection'. use.advanced.sampling is set to the vector of detected genes per cell, general.detection. By doing this, haystack will calculate cell distributions and perform randomizations taking into account the general detection levels. We also set a random seed to ensure replicability.

set.seed(123)
res.pc50.adv <- haystack(x = dat.pca[,1:50], detection = dat.detection, method = "highD", use.advanced.sampling = general.detection)

Let's have a look at the most significant DEGs. The gene with the strongest differential expression is Tspo2. We can plot the expression and detection of this gene using the plot_gene_haystack function.

show_result_haystack(res.haystack = res.pc50.adv, n = 5)
# plotting detection of this gene
plot_gene_haystack(x = dat.tsne, gene = "Tspo2", expression = dat.detection)
# plotting log expression of this gene
plot_gene_haystack(x = dat.tsne, gene = "Tspo2", expression = log10(dat.expression)) 

From the plots we can see that Tspo2 is detected in the lower right groups of cells. As seen above, these cells have in general fewer detected genes, so the pattern of Tspo2 is especially surprizing: it is deteced in cells that express in general few genes.

This explains why the advanced mode of singleCellHaystack judges Tspo2 to be the most significant DEG.

The expression distribution of Tspo2 is similar to that of Trim10, the top DEG predicted in the default mode (see here). Indeed Trim10 ranked among the top DEGs here too (it ranks 7th). On the other hand, other top DEGs of the "default" mode are detected in cells that have in general many detected genes. Hence, they are not among the top DEGs in the "advanced" mode.

Clustering and visualization

Next, let's take the top 1000 DEGs, and cluster them by their expression pattern in the input space (first 50 PCs). Here we use hclust_haystack, which uses hierarchical clustering. Alternatively, we could use kmeans_haystack for k-means clustering.

# get the top 1000 DEGs in the result
res.top <- show_result_haystack(res.haystack = res.pc50.adv, n = 1000)
# cluster DEGs by their expression pattern in the 2D plot
genes.top <- row.names(res.top)
res.hc <- hclust_haystack(x = dat.tsne, genes = genes.top, detection = dat.detection)

hclust_haystack returns as result a hclust tree, which we can cut into clusters using the cutree function. Here, we arbitrarily set the number of clusters to 5.

res.hc.clusters <- cutree(res.hc, k=5)
table(res.hc.clusters)

Let's run through the 5 clusters and plot their averaged detection pattern using plot_gene_set_haystack, which is similar to plot_gene_haystack but uses a set of genes as input instead of just 1 gene.

pl <- lapply(1:5, function(cluster) {
  gene.set <- names(res.hc.clusters)[res.hc.clusters==cluster]
  plot.title <- paste0("Cluster ", cluster)
  p <- plot_gene_set_haystack(x = dat.tsne, genes = gene.set, detection = dat.detection)
  p + ggtitle(plot.title) + theme(legend.title = element_text(size = 8))
})
plot_grid(plotlist = pl, ncol = 2)
res.hc.clusters["Tspo2"] # the top DEG is in cluster 5

The most significant DEG, Tspo2, was clustered into cluster 5. Comparing its expression pattern (see above) with that of each cluster, we can indeed see that it fits most closely with that of cluster 5. Cluster 5 genes are on average detected almost exclusively in the bottom right group of cells.



Try the singleCellHaystack package in your browser

Any scripts or data that you put into this service are public.

singleCellHaystack documentation built on March 28, 2021, 9:12 a.m.