ClusDec -- clustering approach to solve complete deconvolution problem

Here we introduce ClustDec, de novo deconvolution method that allows identification of individual cell types in the mixture without knowing either cell types proportions or their corresponding cell-specific markers.

The basic idea of ClusDec is to find clusters of genes with linear expression profiles that then will be used as putative signatures for DSA Algorithm described in 1.

The basic workflow consists several steps: preprocessing (including clustering), evaluating accuracies of combination of clusters and then using best combination as putative signatures for DSA algorithm (link). Lets have a quick overview.

Loading library and setting random seed for reproducibility

library(clusdec)
set.seed(31)

Loading example data. First data set is datasetLiverBrainLung. It is publicly available GEO dataset GSE19830^[Expression data from pure mixed brain, liver and lung to test feasability and sensitivity of statistical deconvolution] with microarray probes mapped to genes. This dataset is widely used for any deconvolution purposes. Second data set is proportionsLiverBrainLung. It contains actual proportions of liver, brain and lung in datasetLiverBrainLung.

data("datasetLiverBrainLung")
data("proportionsLiverBrainLung")

head(datasetLiverBrainLung[, 10:15])
head(proportionsLiverBrainLung[, 10:15])

We don't take first 9 samples: these are pure samples, and we only leave mixed samples.

mixedGed <- datasetLiverBrainLung[, 10:42]
mixedProportions <- proportionsLiverBrainLung[, 10:42]

First step of ClusDec is preprocessing. It will cluster our dataset so genes in the same cluster will have linear expression profiles.

clusteredGed <- preprocessDataset(mixedGed, k=5)
head(clusteredGed[, 1:6]) # every gene associated with cluster

Next step is evaluating accuracy of every combination of this clusters as putative sigantures for DSA algorithm. We assume our mix to consist of 3 cell types. The lower LogFrobNorm value you get -- better deconvolution goes.

accuracy <- clusdecAccuracy(clusteredGed, 3) 
head(accuracy)

Now lets choose the best combination of clusters and perform deconvolution using them as putative signatures. Lets then plot results and compare them with actual proportions.

results <- chooseBest(clusteredGed, accuracy)
plotProportions(results$H, mixedProportions[c(1, 3, 2), ],
                pnames=c("Estimated", "Actual"))

Preprocessing

First stage of ClusDec algorithm is preprocessing. Preprocessing consists of 3 major steps:

  1. If needed, rows (usually microarray probes) corresponding to the same genes are collapsed, only most expressed probe is taken for further analysis. It's common technique in microarray data analysis.
  2. If needed, only highly expressed genes are taken for further analysis. The main reason to do that is noise reduction.
  3. All genes are clustered (in linear space) using Kmeans with cosine simillarity as distance.

Function preprocessDataset takes several arguments. Most important are dataset, and k which is number of centers in Kmeans.

library(clusdec)
data("datasetLiverBrainLung")

mixedGed <- datasetLiverBrainLung[, 10:42]
head(mixedGed[, 1:5])
nrow(mixedGed)

set.seed(31)
clusteredGed <- preprocessDataset(mixedGed, k=4)
head(clusteredGed[, 1:5], 10)
nrow(clusteredGed)

What should you see after preprocessing:

  1. Every row is associated with a cluster. Profiles of genes that lay in the same cluster are linear.
  2. Number of rows now is decreased to 10000. This number is deafult value of argument topGenes of function preprocessDataset.
  3. Values of expression are now in linear scale instead of logarithmic.

Evaluating accuracy

For now lets assume we fixed number of clusters $k$ as well as number of expected cell types $c$. Genes that lay in one (linear) cluster might be considered as signatures for single cell type. Linearity is necessary but not sufficient condition. We can evaluate the accuracy of deconvolution if every combination of $k$ from $c$ clusters were assumed as signature genes for DSA algorithm. Function clusdecAccuracy takes several arguments: clustered dataset, number of expected cell types and CPU cores to perform evaluation.

accuracy <- clusdecAccuracy(clusteredGed, 3, cores=2)
accuracy

Here we evaluate 2 values for every combination of our clusters.

First value is sum to one error. It describes how good deconvolution proportions fit sum to one constraint. And second value is logarithmic frobenius norm.



ctlab/ClusDec documentation built on May 14, 2019, 12:29 p.m.