Introduction

CONCLUS is a tool for robust clustering and positive marker features selection of single-cell RNA-seq (sc-RNA-seq) datasets. It is designed to identify rare cells populations by consensus clustering. Of note, CONCLUS does not cover the preprocessing steps of sequencing files obtained following next-generation sequencing. You can find a good resource to start with here.

CONCLUS is organized into the following steps:

Getting help

Issues can be submitted directly to the Bioconductor forum using the keyword 'conclus' in the post title. To contact us directly write to christophe.lancrin@embl.it or rachedi.ilyess@gmail.com. The principles of this package were originally developed by Polina Pavlovich who is now doing her Ph.D at the Max Planck Institute of Immunobiology and Epigenetics.

Important note

Due to the stochastic aspect of tSNE, everything was generated with a random seed of 42 (default parameter of generateTSNECoordinates) to ensure reproducibility of the results. Because of the evolution of biomaRt, you might get slightly different figures. However, the marker genes for each cluster should be the same.

The package is currently limited to mouse and human. Other organisms can be added on demand. Write to christophe.lancrin@embl.it or rachedi.ilyess@gmail.com.Priority will be given to model organisms.

Standard workflow

Quick start

CONCLUS requires to start with a raw-count matrix with reads or unique molecular identifiers (UMIs). The columns of the count matrix must contain cells and the rows -- genes. CONCLUS needs a large number of cells to collect statistics, we recommend using CONCLUS if you have at least 100 cells.

In the example below, a small toy example is used to illustrate the runCONCLUS method. Real data are used later in this vignette.

library(conclus)

outputDirectory <- tempdir()
experimentName <- "Test"
species <- "mouse"

countmatrixPath <- system.file("extdata/countMatrix.tsv", package="conclus")
countMatrix <- loadDataOrMatrix(file=countmatrixPath, type="countMatrix",
                                ignoreCellNumber=TRUE)

coldataPath <- system.file("extdata/colData.tsv", package="conclus") 
columnsMetaData <- loadDataOrMatrix(file=coldataPath, type="coldata",
                        columnID="cell_ID")

sceObjectCONCLUS <- runCONCLUS(outputDirectory, experimentName, countMatrix, 
                            species, columnsMetaData=columnsMetaData,
                            perplexities=c(2,3), tSNENb=1, 
                            PCs=c(4,5,6,7,8,9,10), epsilon=c(380, 390, 400),
                            minPoints=c(2,3), clusterNumber=4)

In your "outputDirectory", the sub-folder pictures contains all tSNE with dbscan coloration (sub-folder tSNE_pictures), the cell similarity matrix ( Test_cells_correlation_X_clusters.pdf), the cell heatmap (Test_clustersX_meanCenteredTRUE_orderClustersFALSE_orderGenesFALSE markersPerCluster.pdf), and the cluster similarity matrix (Test_clusters_similarity_10_clusters.pdf). You will also find in the sub-folder Results:

Further details about how all these results are generated can be found below.

Data

In this vignette, we demonstrate how to use CONCLUS on a sc-RNA-seq dataset from Shvartsman et al. The Yolk Sac (YS) and Aorta-Gonad-Mesonephros (AGM) are two major haematopoietic regions during embryonic development. Interestingly, AGM is the only one generating haematopoietic stem cells (HSCs). To identify the difference between AGM and YS, Shvartsman et al compared them using single cell RNA sequencing between 9.5 and 11.5 days of mouse embryonic development. This vignette aims at reproducing the identification of a rare population corresponding to liver like cells in the YS at day 11.5. The number of clusters used in this vignette is lower than in the original article for sake of simplification. This vignette neither provides a description of the differences between the AGM and YS. Please refer to Shvartsman et al for a complete description.

Since endothelial cells represent a small population in these tissues, cells expressing the endothelial marker VE-Cadherin (VE-Cad, also called Cdh5) were enriched and sorted, VE-Cad Negative cells constituting the microenvironment. Therefore, four cell states are defined by the cell barcodes: YS VE-Cad minus, AGM VE-Cad minus, YS VE-Cad minus, and AGM VE-Cad minus. The E11.5 data are constituted of 1482 cells. After filtering and normalization, 1303 cells were retained.

This sc-RNA-seq experiment was performed using the SMARTer ICELL8 Single-Cell System (Click here for more info). The protocol was based on 3' end RNA sequencing where each mRNA molecule is labeled with a unique molecular identifier (UMI) during reverse transcription in every single cell.

Shvartsman et al deleted highly abundant haemoglobins having names starting with ‘Hba’ or ‘Hbb’ because they seemed to be the primary source of contamination. Additionally, they excluded poorly annotated genes that did not have gene symbols to improve the clusters annotation process. Finally, the human cell controls were removed.

Retrieving data from GEO

The code below format the count matrix and the columns meta-data. The source data are downloaded from the GEO page GSE161933. The URL of the count matrix and the cells meta-data were retrieved by right-click and 'copy link adress' on the http hyperlink of the supplementary file at the bottom of the page. If the columns metadata are not available in the supplementary file section, the name of the series matrix, containing columns meta-data can be retrieved by clicking the 'Series Matrix File(s)' link just above the count matrix. The function retrieveFromGEO used with the parameter seriesMatrixName will download all series matrices present in the GEO record however only the one of interest will be kept.

library(conclus)

outputDirectory <- tempdir()
dir.create(outputDirectory, showWarnings=FALSE)
species <- "mouse"

countMatrixPath <- file.path(outputDirectory, "countmatrix.txt")
metaDataPath <- file.path(outputDirectory, "metaData.txt")
matrixURL <- paste0("https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM492",
"3493&format=file&file=GSM4923493%5FE11%5F5%5Frawcounts%5Fmatrix%2Etsv%2Egz")
metaDataURL <- paste0("https://www.ncbi.nlm.nih.gov/geo/download/?acc=",
"GSM4923493&format=file&file=GSM4923493%5FMetadata%5FE11%5F5%2Etxt%2Egz")

result <- retrieveFromGEO(matrixURL, countMatrixPath, species,
metaDataPath=metaDataPath, colMetaDataURL=metaDataURL)
countMatrix <- result[[1]]
columnsMetaData <- result[[2]]
## Correct the columns names to fit conclus input requirement
## The columns should be: cellName, state, and cellBarcode
columnsMetaData <- columnsMetaData[,c(1,3,2)]
colnames(columnsMetaData) <- c("cellName", "state", "cellBarcode")

## Removing embryonic hemoglobins with names starting with "Hba" or "Hbb"
idxHba <- grep("Hba", rownames(countMatrix))
idxHbb <- grep("Hbb", rownames(countMatrix))
countMatrix <- countMatrix[-c(idxHba, idxHbb),]

## Removing control human cells
idxHumanPos <- which(columnsMetaData$state == "Pos Ctrl")
idxHumanNeg <- which(columnsMetaData$state == "Neg Ctrl")
columnsMetaData <- columnsMetaData[-c(idxHumanPos, idxHumanNeg),]
countMatrix <- countMatrix[,-c(idxHumanPos, idxHumanNeg)]

## Removing genes not having an official symbol
idxENS <- grep("ENSMUSG", rownames(countMatrix))
countMatrix <- countMatrix[-idxENS,]

Using local data

If you executed the code above, a matrix path is stored in countMatrixPath and a file path to the meta-data is stored in metaDataPath. Just indicate the paths to your files in these two variables. Please also note that to load the metadata, you need to fill in the columnID argument with the name of the column containing the names of the cells. These names must be the same as those in the count matrix.

## countMatrixPath <- ""
## metaDataPath <- ""

countMatrix <- loadDataOrMatrix(countMatrixPath, type="countMatrix")
columnsMetaData <- loadDataOrMatrix(file=metaDataPath, type="coldata",
                                    columnID="")

## Filtering steps to add here as performed above

Test clustering

The TestClustering function runs one clustering round out of the 84 (default) rounds that CONCLUS normally performs. This step can be useful to determine if the default DBSCAN parameters are suitable for your dataset. By default, they are dbscanEpsilon = c(1.3, 1.4, 1.5) and minPts = c(3,4). If the dashed horizontal line in the k-NN distance plot lays on the “knee” of the curve (as shown below), it means that optimal epsilon is equal to the intersection of the line to the y-axis. In our example, optimal epsilon is 1.4 for 5-NN distance where 5 corresponds to MinPts.

In the "test_clustering" folder under outputDirectory, the three plots below will be saved where one corresponds to the “distance_graph.pdf”, another one to "test_tSNE.pdf" (p[[1]]), and the last one will be saved as "test_clustering.pdf" (p[[2]]).

## Creation of the single-cell RNA-Seq object
scr <- singlecellRNAseq(experimentName = "E11_5",
        countMatrix     = countMatrix,
        species         = "mouse",
        outputDirectory = outputDirectory)

## Normalization of the count matrix
scr <- normaliseCountMatrix(scr, coldata=columnsMetaData)
p <- testClustering(scr, writeOutput=TRUE, silent=TRUE)

If the dashed horizontal line in the k-NN distance plot lays on the “knee”
of the curve, it means that optimal epsilon is equal to the intersection of
the line to the
y-axis.{width="344px" height="350px"}

# saved as "outputDirectory/test_clustering/test_tSNE.pdf"
p[[1]]

One of the 14 tSNE (by default) generated by
conclus.{width="369px" height="366px"}

# saved as "outputDirectory/test_clustering/test_clustering.pdf"
p[[2]]

One of the 84 dbscan clustering solutions generated by
conclus.{width="369px" height="366px"}

CONCLUS step by step

The wrapper function runCONCLUS is organized into 7 steps:

Normalization of the counts matrix

sc-RNA-seq datasets are quite challenging notably because of sparsity (many genes are not detected consistently yielding expression matrices with many zeroes) and also because of technical noise. To facilitate analysis, one needs to perform a step of normalization which allows for the correction of unwanted technical and biological noises (click here for a complete review on normalization techniques).

CONCLUS uses Scran and Scater packages for normalization. Beforehand, the function will annotate genes creating rowData and add statistics about cells into columnsMetaData. If you already have columnsMetaData and rowData, you can give it to the function (see manual). It will keep your columns and add new ones at the end. If you do not want to lose any cell after quality metrics check, select alreadyCellFiltered = TRUE, by default it is FALSE. Before scran and scater normalization, the function will call scran::quickCluster (see manual for details). If you want to skip this step, set runQuickCluster = FALSE, by default it is TRUE. We advice to first try the analysis with this option and to set it to FALSE if no rare populations are found.

scr <- normaliseCountMatrix(scr, coldata=columnsMetaData)

The method normaliseCountMatrix returns a scRNASeq object with its sceNorm slot updated. This slot contains a SingleCellExperiment object having the normalized count matrix, the colData (table with cells informations), and the rowData (table with the genes informations). See ?SingleCellExperiment for more details.

The rowdata can help to study cross-talk between cell types or find surface protein-coding marker genes suitable for flow cytometry. The columns with the GO terms are go_id and name_1006 (see manual).

The slots can be accessed as indicated below:

## Accessing slots
originalMat <- getCountMatrix(scr)
SCEobject <- getSceNorm(scr)
normMat <- SingleCellExperiment::logcounts(SCEobject)

# checking what changed after the normalisation
dim(originalMat)
dim(normMat)

# show first columns and rows of the count matrix
originalMat[1:5,1:5]

# show first columns and rows of the normalized count matrix
normMat[1:5,1:5]

# visualize first rows of metadata (coldata)
coldataSCE <- as.data.frame(SummarizedExperiment::colData(SCEobject))
head(coldataSCE)

# visualize beginning of the rowdata containing gene information
rowdataSCE <- as.data.frame(SummarizedExperiment:::rowData(SCEobject))
head(rowdataSCE)

Generation of t-SNE coordinates

runCONCLUS generates an object of fourteen (by default) tables with tSNE coordinates. Fourteen because it will vary seven values of principal components PCs=c(4, 6, 8, 10, 20, 40, 50) and two values of perplexity perplexities=c(30, 40) in all possible combinations.

The chosen values of PCs and perplexities can be changed if necessary. We found that this combination works well for sc-RNA-seq datasets with 400-2000 cells. If you have 4000-9000 cells and expect more than 15 clusters, we recommend to use more first PCs and higher perplexity, for example, PCs=c(8, 10, 20, 40, 50, 80, 100) and perplexities=c(200, 240). For details about perplexities parameter see ‘?Rtsne’.

scr <- generateTSNECoordinates(scr, cores=2)

Results can be explored as follows:

tsneList <- getTSNEList(scr)
head(getCoordinates(tsneList[[1]]))

Clustering with DB-SCAN

Following the calculation of t-SNE coordinates, DBSCAN is run with a range of epsilon and MinPoints values which will yield a total of 84 clustering solutions (PCs x perplexities x MinPoints x epsilon). minPoints is the minimum cluster size which you assume to be meaningful for your experiment and epsilon is the radius around the cell where the algorithm will try to find minPoints dots. Optimal epsilon must lay on the knee of the k-NN function as shown in the "test_clustering/distance_graph.pdf" (See Test clustering section above).

scr <- runDBSCAN(scr, cores=2)

Results can be explored as follows:

dbscanList <- getDbscanList(scr)
clusteringList <- lapply(dbscanList, getClustering)
clusteringList[[1]][, 1:10]

Cell and cluster similarity matrix calculation

The above calculated results are combined together in a matrix called “cell similarity matrix”. runDBSCAN function returns an object of class scRNASeq with its dbscanList slot updated. The list represents 84 clustering solutions (which is equal to number of PCs x perplexities x MinPoints x epsilon). Since the range of cluster varies from result to result, there is no exact match between numbers in different elements of the list. Cells having the same number within an element are guaranteed to be in one cluster. We can calculate how many times out of 84 clustering solutions, every two cells were in one cluster and that is how we come to the similarity matrix of cells. We want to underline that a zero in the dbscan results means that a cell was not assigned to any cluster. Hence, cluster numbers start from one. clusterCellsInternal is a general method that returns an object of class scRNASeq with its cellsSimilarityMatrix slot updated.

Note that the number of clusters is set to 10 (it was set to 11 in the original study, see data section). Considering that clusters can be merged afterwards (see Supervised clustering section), we advise to keep this number. One might want to increase it if the cell similarity matrix or the tSNE strongly suggest more clusters but this number is suitable for most experiments in our hands.

scr <- clusterCellsInternal(scr, clusterNumber=10)
cci <- getCellsSimilarityMatrix(scr)
cci[1:10, 1:10]

After looking at the similarity between elements on the single-cell level, which is useful if we want to understand if there is any substructure which we did not highlight with our clustering, a "bulk" level where we pool all cells from a cluster into a representative "pseudo cell" can also be generated. This gives a clusterSimilarityMatrix:

scr <- calculateClustersSimilarity(scr)
csm <- getClustersSimilarityMatrix(scr)
csm[1:10, 1:10]

Plotting

t-SNE colored by clusters or conditions

CONCLUS generated 14 tSNE combining different values of PCs and perplexities. Each tSNE can be visualized either using coloring reflecting the results of DBScan clustering, the conditions (if the metadata contain a 'state' column) or without colors. Here plotClusteredTSNE is used to generate all these possibilities of visualization.

tSNEclusters <- plotClusteredTSNE(scr, columnName="clusters",
                returnPlot=TRUE, silentPlot=TRUE)

tSNEnoColor <- plotClusteredTSNE(scr, columnName="noColor",
                returnPlot=TRUE, silentPlot=TRUE)

tSNEstate <- plotClusteredTSNE(scr, columnName="state",
                returnPlot=TRUE, silentPlot=TRUE)                

For visualizing the 5th (out of 14) tSNE cluster:

tSNEclusters[[5]]

DBscan results on the 5th
tSNE.{width="357px" height="359px"}

This tSNE suggests that the cluster 8 corresponds to a rare population of cells. We can also appreciate that clusters 1 and 2, as clusters 9 and 10 could be considered as single clusters respectively.

For visualizing the 5th (out of 14) tSNE cluster without colors:

tSNEnoColor[[5]]

The 5th tSNE solution without
coloring.{width="364px" height="365px"}

For visualizing the 5th (out of 14) tSNE cluster colored by state:

tSNEstate[[5]]

The 5th tSNE solution colored by cell
condition.{width="355px" height="362px"}

One can see that the cluster 8 contains only cells of the YS.

\newpage

Cell similarity heatmap

The cellsSimilarityMatrix is then used to generate a heatmap summarizing the results of the clustering and to show how stable the cell clusters are across the 84 solutions.

plotCellSimilarity(scr)

Cell similarity matrix showing the conservation of clustering across the 84
solutions.{width="459px" height="465px"}

CellsSimilarityMatrix is symmetrical and its size proportional to the "number of cells x number of cells". Each vertical or horizontal tiny strip is a cell. Intersection shows the proportion of clustering iterations in which a pair of cells was in one cluster (score between 0 and 1, between blue and red). We will call this combination "consensus clusters" and use them everywhere later. We can appreciate that cellsSimilarityMatrix is the first evidence showing that CONCLUS managed not only to distinguish VE-Cad plus cells from the VE-Cad minus but also find sub-populations within these groups.

Cluster similarity heatmap

plotClustersSimilarity(scr)

Cluster similarity matrix. Each cell belonging to a particular cluster were
merged into a
pseudo-cell.{width="459px" height="460px"}

In the clusterSimilarityMatrix, we can see that all clusters have a high value of similarity across all clustering solutions. Red color on the diagonal means that the group is homogenous, and usually, it is what we want to get. The yellow on the diagonal indicates that either that group consists of two or more equal sized subgroups. The orange on the diagonal suggests that groups can be merged. Bluish color points to a cluster of dbscan "outliers" that usually surrounds dense clouds of cells in t-SNE plots.

As observed previously on the tSNE, clusters 1 and 2 as 9 and 10 are very similar. Clusters 3 and 8 are very different.

Marker genes identification

To understand the nature of the consensus clusters identified by CONCLUS, it is essential to identify genes which could be classified as marker genes for each cluster. To this aim, each gene should be "associated" to a particular cluster. This association is performed by looking at up-regulated genes in a particular cluster compared to the others (multiple comparisons). The method rankGenes performs multiple comparisons of all genes from the object and rank them according to a score reflecting a FDR power.

In summary, the method conclus::rankGenes() gives a list of marker genes for each cluster, ordered by their significance. See ?rankGenes for more details.

scr <- rankGenes(scr)
markers <- getMarkerGenesList(scr, cluster=1)

The top 10 markers by cluster (default) can be selected with:

scr <- retrieveTopClustersMarkers(scr, removeDuplicates=TRUE)
topMarkers <- getTopMarkers(scr)
topMarkers

Cluster 8 contains the marker genes A2m, Cldn1, Maob, Pcbd1, Pdzk1, Krt20, Klb, Serpind1, and F2. Alpha-2-Macroglobulin (A2m) is a large (720 KDa) plasma protein found in the blood and is mainly produced by the liver; Claudins (CLDNs) are a family of integral membrane proteins, Cldn1 being expressed by epithelial liver cells; The monoamine oxidase (Maob) is found in abundance in the liver. All these markers play a role in the liver. This sub-population is hence called liver-like hereafter.

Plot a heatmap with positive marker genes

Following the execution of the retrieveTopClustersMarkers method, CONCLUS offers the option to visualize the marker genes on a heatmap. Below we chose to show the selected 10 marker genes per cluster which should generate a heatmap with 100 genes (10 marker genes x 10 clusters). This is convenient for visualization. In practice, the number of genes in this heatmap will be less than 100 because some genes were classified as markers for more than one cluster. This can happen when several clusters correspond to similar cellular types.

After selecting the top markers with the method retrieveTopClustersMarkers, the method plotCellHeatmap is used to order clusters and genes by similarity (the same order as in the clusterSimilarityMatrix) and show mean-centered normalized data. Mean-centering allows seeing the relative expression of a gene compared to the mean.

plotCellHeatmap(scr, orderClusters=TRUE, orderGenes=TRUE)

Heatmap showing the expression of the top 10 markers for each cluster. The
values are normalized according to the mean.

One can also visualize the heatmap without mean centering and appreciate the importance of the normalization of colors by row. Indeed, the different markers are much harder to identify.

plotCellHeatmap(scr, orderClusters=TRUE, orderGenes=TRUE, meanCentered=FALSE)

Same heatmap as before without normalizing by the
mean.

Alternative order of clusters is by name or by hierarchical clustering as in the default pheatmap function.

Plot t-SNE colored by expression of a selected gene

PlotGeneExpression allows visualizing the normalized expression of one gene in a t-SNE plot. It can be useful to inspect the specificity of top markers.

Cluster 8 was identified as being a liver-like rare population. Below are examples of the expression of its marker genes:

plotGeneExpression(scr, "Maob", tSNEpicture=5)

Maob Marker gene of cluster
8.{width="363px" height="362px"}

plotGeneExpression(scr, "Pcbd1", tSNEpicture=5)

Pcbd1 Marker gene of cluster
8.{width="364px" height="366px"}

plotGeneExpression(scr, "Serpinf2", tSNEpicture=5)

Serpinf2 Marker gene of cluster
8.{width="360px" height="363px"}

plotGeneExpression(scr, "Cldn1", tSNEpicture=5)

Cldn1 Marker gene of cluster
8.{width="359px" height="362px"}

Collect publicly available info about marker genes

Collect information for the top 10 markers for each cluster

retrieveGenesInfo retrieves gene information from NCBI, MGI, and UniProt. It requires the retrieveTopMarkers method to have been run on the object.

scr <- retrieveGenesInfo(scr)
result <- getGenesInfos(scr)
head(result)

result contains the following columns:

Supervised clustering

Until now, we have been using CONCLUS in an unsupervised fashion. This is a good way to start the analysis of a sc-RNA-seq dataset. However, the knowledge of the biologist remains a crucial asset to get the maximum of the data. This is why we have included in CONCLUS, additional options to do supervised analysis (or “manual” clustering) to allow the researcher to use her/his biological knowledge in the CONCLUS workflow. Going back to the example of the Shvartsman et al. dataset above (cluster similarity heatmap), one can see that some clusters clearly belong to the same family of cells after examining the clusters_similarity matrix generated by CONCLUS.

As previously mentioned, clusters 1 and 2 as 9 and 10 are very similar. In order to figure out what marker genes are defining these families of clusters, one can use manual clustering in CONCLUS to fuse clusters of similar nature: i.e. combine clusters 1 and 2 (9 and 10) together.

## Retrieving the table indicating to which cluster each cell belongs
clustCellsDf <- retrieveTableClustersCells(scr)

## Replace "2/10" by "1/9" to merge 1/2 and 9/10
clustCellsDf$clusters[which(clustCellsDf$clusters == 2)] <- 1
clustCellsDf$clusters[which(clustCellsDf$clusters == 10)] <- 9

## Modifying the object to take into account the new classification
scrUpdated <- addClustering(scr, clusToAdd=clustCellsDf)

Now we can visualize the new results taking into account the new classification:

plotCellSimilarity(scrUpdated)

Updated cells similarity matrix with merged 1/2 and
9/10.{width="459px" height="463px"}

plotCellHeatmap(scrUpdated, orderClusters=TRUE, orderGenes=TRUE)

Updated cells heatmap with merged 1/2 and
9/10.

tSNEclusters <- plotClusteredTSNE(scrUpdated, columnName="clusters",
                                    returnPlot=TRUE, silentPlot=TRUE)
tSNEclusters[[5]]

5th tSNE solution colored by dbscan result showing the merged
clusters.{width="360px" height="362px"}

The cell heatmap above shows that Col1a1 is a good marker of cluster 1 and that Cdk8 is a good marker of cluster 9 (at the bottom). One can visualize them in the t-SNE plots below.

plotGeneExpression(scrUpdated, "Col1a1", tSNEpicture=5)

Col1a1 Marker gene for cluster 1 (mix of old clusters 1 and
2){width="359px" height="364px"}

plotGeneExpression(scrUpdated, "Cdk8", tSNEpicture=5)

Cdk8 Marker gene for clusters 9 (mix of old clusters 9 and
10){width="361px" height="363px"}

Conclusion

Here we demonstrated how to use CONCLUS and combine multiple parameters testing for sc-RNA-seq analysis. It allowed us to identify a rare population in the data of Shvartsman et al and will help gaining deeper insights into others.

Session info

sessionInfo()


ilyessr/conclus documentation built on April 8, 2022, 1:43 p.m.