# Application on multi-dimensional coordinates In singleCellHaystack: Finding Needles (=differentially Expressed Genes) in Haystacks (=single Cell Data)

```knitr::opts_chunk\$set(
collapse = TRUE,
comment = "#>",
fig.width = 6,
fig.height = 5,
fig.align = 'center'
)

library(ggplot2)
library(cowplot)
```

In this article, we show the application of `singleCellHaystack` on high-dimensional coordinates. Our method takes the coordinates of the cells in a >2-dimensional space (e.g. the first few principal components, etc), and predicts differentially expressed genes (DEGs) in that space. It does so by looking at the "expression distribution" of a gene (i.e. the distribution in the input space of cells that express a gene).

# Preparing input data

The data used in these examples can be found here. We recommend downloading the .rda file and loading it using the `load()` function. You can also download the individual data files separately.

```load(url("https://ndownloader.figshare.com/files/24593213"))
ls()
```

This data should include the following objects:

• `dat.expression`: a matrix object with the expression of genes (rows) in each cell (columns).
• `dat.pca`: the output of PCA. This data.frame contains the first 50 pricipal components (PCs).
• `dat.tsne`: a data.frame with t-SNE coordinates (2D) based on the first 50 PCs.
• `dat.umap`: a data.frame with UMAP coordinates (2D) based on the first 50 PCs.

Let's have a look at this example dataset:

```# this data contains 12,030 genes and 1981 cells
dim(dat.expression)

# visualizing the cells in a t-SNE plot:
ggplot(dat.tsne, aes(x = V1, y = V2)) + labs(x = "t-SNE1", y = "t-SNE2") + geom_point()

# the t-SNE coordinates are based on the 50 PCs
dim(dat.pca) # the PCA data contains 1981 cells and 50 PCs
```

There are several groups of cells, although the borders between them are not clear and several might consist of additional subclusters. We can use `singleCellHaystack` to predict DEGs without relying on grouping of the cells into arbitrary clusters.

`singleCellHaystack` has two required input parameters: 1) the coordinates of cells in the input space, and 2) a table showing which genes are detected in which cells. The definition of detection is left to the user. In this example we will use the median normalized read count of each gene as threshold for defining detection. Alternatively, we could also define genes with counts > 0 as being detected, for example.

```median.per.gene <- apply(dat.expression,1,median) # get the median read count per gene
dat.detection <- dat.expression > median.per.gene # use the medians as threshold for defining detection
dat.detection[1:5,1:5] # TRUE means detected, FALSE means not detected
```

# Running `haystack` on multi-dimensional coordinates

```library(singleCellHaystack)
```

Next, run `haystack` on the first 50 principal components. Since the space is 50-dimensional, we set 'method' to 'highD'. We also give the detection values as input to 'detection'. This example dataset is relatively small, containing 1,981 cells, so running 'haystack' should take just 1 to 3 minutes to finish. We also set a random seed to ensure replicability.

```set.seed(123)
res.pc50 <- haystack(x = dat.pca[,1:50], detection = dat.detection, method = "highD")
```

Let's have a look at the most significant DEGs. The gene with the strongest differential expression is Trim10. We can plot the expression and detection of this gene using the `plot_gene_haystack` function. From the t-SNE plots we can see that Trim10 is detected in the bottom right group of cells. In contrast, the expression of Eef1a1 is higher in cells of the top-right and bottom-left groups of cells.

```show_result_haystack(res.haystack = res.pc50, n = 5)

# plotting detection of Trim10
plot_gene_haystack(x = dat.tsne, gene = "Trim10", expression = dat.detection)
# plotting log expression of Trim10
plot_gene_haystack(x = dat.tsne, gene = "Trim10", expression = log10(dat.expression))

# plotting detection of Eef1a1
plot_gene_haystack(x = dat.tsne, gene = "Eef1a1", expression = dat.detection)
# plotting log expression of Eef1a1
plot_gene_haystack(x = dat.tsne, gene = "Eef1a1", expression = log10(dat.expression))
```

# Clustering and visualization

Next, let's take the top 1000 DEGs, and cluster them by their expression pattern in the input space (first 50 PCs). Here we use `hclust_haystack`, which uses hierarchical clustering. Alternatively, we could use `kmeans_haystack` for k-means clustering.

```# get the top 1000 DEGs in the result
res.top <- show_result_haystack(res.haystack = res.pc50, n = 1000)
# cluster DEGs by their expression pattern in the 2D plot
genes.top <- row.names(res.top)
res.hc <- hclust_haystack(x = dat.tsne, genes = genes.top, detection = dat.detection)
```

`hclust_haystack` returns as result a `hclust` tree, which we can cut into clusters using the `cutree` function. Here, we arbitrarily set the number of clusters to 5.

```res.hc.clusters <- cutree(res.hc, k=5)
table(res.hc.clusters)
```

While some clusters contain hunderds of genes, others are quite small.

Let's run through the 5 clusters and plot their averaged detection pattern using `plot_gene_set_haystack`, which is similar to `plot_gene_haystack` but uses a set of genes as input instead of just 1 gene.

```pl <- lapply(1:5, function(cluster) {
gene.set <- names(res.hc.clusters)[res.hc.clusters==cluster]
plot.title <- paste0("Cluster ", cluster)
p <- plot_gene_set_haystack(x = dat.tsne, genes = gene.set, detection = dat.detection)
p + ggtitle(plot.title) + theme(legend.title = element_text(size = 8))
})
plot_grid(plotlist = pl, ncol = 2)
```
```res.hc.clusters["Trim10"] # the top DEG is in cluster 5
```

The most significant DEG, Trim10, was clustered into cluster 5. Comparing its expression pattern (see above) with that of each cluster, we can indeed see that it fits most closely with that cluster.

## Try the singleCellHaystack package in your browser

Any scripts or data that you put into this service are public.

singleCellHaystack documentation built on March 28, 2021, 9:12 a.m.