In Siliegia/SIGMA: A clusterability measure for scRNA-seq data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache.lazy = FALSE
)

library(SIGMA)
library(ggplot2)
library(Seurat)

The authors who have anlyzed this data already normalized the data set with the R package "scran" and determined clusters by hierachical clustering. In total they have found 22 clusters.

data("force_gr_kidney")
data("sce_kidney")

paga.coord$Group <- sce_kidney$cell.type

ggplot(paga.coord, aes(x = V1, y = V2, colour = Group)) +
  geom_point(shape = 16)

With SIGMA, we are now able to assess the variability for each cluster and see if possible sub-clusters can be found. First, we load the preprocessed SingleCellObject of the kidney data.

#Load kidney data from package

#Extract scran normalized counts and log-transform
expr.norm.log <- as.matrix(log(assay(sce_kidney, "scran")+1))

#Change the name of the rows to readable gene names
rownames(expr.norm.log) <- as.character(rowData(sce_kidney)$HUGO)
rownames(sce_kidney) <- as.character(rowData(sce_kidney)$HUGO)

In the next step we would like to exclude certain variances from appearing in the measure. For example, in this fetal kidney data set, several factors would not be of interest to cluster on: cell cycle related variances, ribosomal and mitochondrial gene expression. As, well as stress related genes, which arise during dissociation. Cycling genes, we determine here with the Seurat package, so for that we first need to create a Seurat object and normalize it. Another important factor is technical variability, for example the varying number of transcripts. It's important to also include that in the data frame.

#Creating Seurat object
cnts <- counts(sce_kidney)
colnames(cnts) <- 1:ncol(cnts)
rownames(cnts) <- as.character(rowData(sce_kidney)$HUGO)

fetalkidney <- CreateSeuratObject(cnts)
fetalkidney <- NormalizeData(fetalkidney)

#Cell cycle analysis
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes

fetalkidney <- CellCycleScoring(fetalkidney, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)

#Determining the expression of MT-genes, Rb-genes and stress genes:
data("ribosomal_genes")
data("stress_genes")

rb <- rownames(fetalkidney) %in% rb.genes 
stress.genes <- intersect(stress.genes, rownames(expr.norm.log))

#Creating the final data frame with all the factors to be excluded from considering while calculating the clusterability measure:
exclude <- data.frame(clsm = log(colSums(cnts) + 1), cellcycle = fetalkidney$G2M.Score, 
                      mt = colMeans(expr.norm.log[grep("^MT-", rownames(expr.norm.log)),]), 
                      ribosomal = colMeans(expr.norm.log[rb,]), stress = colMeans(expr.norm.log[stress.genes,]))

Now we are ready to apply the main function to determine clusterability:

#Main funcion SIGMA
out_kidney <- sigma_funct(expr.norm.log, clusters = sce_kidney$cell.type, exclude = exclude)

We can have a look at the main output of this function. For each cluster, the corresponding clusterability measure is shown.

#Evaluate the output of the measure

#plot all values for sigma
plot_sigma(out_kidney)

If you would like to go into more detail, then you can have a look at all sigmas and g-sigmas that are available per cluster.

#Plot all values for sigma and g_sigma
plot_all_sigmas(out_kidney)
plot_all_g_sigmas(out_kidney)

If you are interested in the values of all sigmas, g-sigmas and singular values of the signal matrix, then this information can be obtained with the help of this function.

#obtain the values for sigma and additional information
get_info(out_kidney, "UBCD")

Now, to determine if the clustrs with a high clusterability measure have variances that are meaningful for you to sub-cluster, have a look at the variance driving genes, which will tell you which genes cause the signal to appear. For example, if genes are only related to differentiation, then sub-clustering might not be necessary but could be of interest.

#See which genes cause variances in the data
get_var_genes(out_kidney, "UBCD")[,1:3]

You can also check out the fit of the MP distribution for each cluster.

#Check if the MP distribution fits to the data
plot_MP(out_kidney, "UBCD")

And for fruther validation, see if the singular vectors of the significant singular values look meaningful. By plotting either clusters or genes with the singular vectors.

#Plot clusters
plot_singular_vectors(out_kidney, "UBCD", colour = sce_kidney@metadata$ubcd.cluster)

#Plot variance driving genes
plot_singular_vectors(out_kidney, "UBCD", colour = "UPK1A", scaled = FALSE)