knitr::opts_chunk$set(
    collapse=TRUE,
    comment="#>"
)

Introduction

Complex tissues are dynamic eco-systems consisting of molecularly distinct yet interacting cell types. Computational deconvolution aims to dissect bulk tissue data into cell type compositions and cell-specific expressions. With few exceptions, most existing deconvolution tools exploit supervised approaches requiring various types of references that may be unreliable or even unavailable for specific tissue microenvironments. We have developed an efficient and fully unsupervised deconvolution tool – Convex Analysis of Mixtures (CAM3.0) – specifically designed for studying biologically diverse conditions. We improve and extend our previous CAM framework [@CAM2016] to introduce three new and effective algorithms, namely, radius-fixed clustering to identify reliable markers, linear programming to detect an initial scatter simplex, and a smart floating search for the optimum latent variable model.

CAM3.0 is an R package for fully unsupervised deconvolution of complex tissues. It provides basic functions to perform unsupervised deconvolution on mixture expression profiles by Convex Analysis of Mixtures (CAM) and some auxiliary functions to help understand the cell type-specific results. It also implements functions to perform supervised deconvolution based on prior knowledge of molecular markers, S matrix or A matrix. Combining molecular markers from CAM and from prior knowledge can achieve semi-supervised deconvolution of mixtures.

You can install the latest version of CAM3.0 from GitHub by

devtools::install_github("ChiungTingWu/CAM3")

or from Bioconductor by

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("CAM3")

Quick Start

Similar to the previous versions of CAM [@debCAM], the function CAM3() includes all necessary steps to decompose a matrix of mixture expression profiles. Optional steps include upstream procedures like norm-based filtering, dimension deduction, perspective projection, local outlier removal and aggregation of gene expression vectors by clustering. Each step in CAM3() can also be performed separately if you prefer a more flexible workflow. More details will be introduced in the section 'CAM3.0 Workflow'.

Starting your analysis by CAM3() where you need to specify the percentage of low/high-expressed molecules to be removed, in order to reduce the effect of noise and also to speed up the computation. dim.rdc refers to the reduced data dimension, which should also be specified and be no less than the maximum number of cell type K:

rCAM3 <- CAM3Run(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95)

Or, if the range of possible cell type is given instead of an exact number:

rCAM3 <- CAM3Run(data, K = 2:5, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95)

CAM3.0 Workflow

Here we use a data set down-sampled from GSE19830 as an example to illustrate how CAM3.0 works, as an integrated function or step by step.

library(debCAM)
library(CAM3)

data(ratMix3) 
#ratMix3$X: X matrix containing mixture expression profiles to be analyzed
#ratMix3$A: ground truth A matrix containing proportions
#ratMix3$S: ground truth S matrix containing subpopulation-specific expression profiles

data <- ratMix3$X
#10000 genes * 21 tissues
#meet the input data requirements

Format of Input

The input should be a matrix of mixture expression profiles. Data frame, SummarizedExperiment or ExpressionSet object will be internally coerced into a matrix. Each row is a gene and each column is a sample. Data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally.

CAM3.0 as an Integrated Function

The quickest way of applying CAM3.0 is to run CAM3Run() directly:

rCAM3 <- CAM3Run(data, K = 2:5, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95)

This function performs a fully unsupervised computational deconvolution to identify marker features that define each of the multiple cell types, and estimate the proportions of these cell types in the mixture tissues as well as their respective expression profiles.

For Post-CAM Analysis please go to the next section.

CAM3.0 Step by Step

Pre-processing

The integrated pre-processing function perform preprocessing for CAM, including norm-based filtering, dimension deduction, perspective projection, local outlier removal and aggregation of gene expression vectors by clustering.

# Setting hyper-parameters
K = 2:5
dim.rdc = 10
thres.low = 0.30
thres.high = 0.95
radius.thres = 0.95
sim.thres = 0.95 
cluster.num = 50 
MG.num.thres = 20 
sample.weight = NULL

# Commencing pre-processing
PrepResult <- CAM3Prep(data, dim.rdc, thres.low, thres.high,
                       cluster.method = c('Fixed-Radius' ,'K-Means'),
                       radius.thres, sim.thres, cluster.num,
                       MG.num.thres, sample.weight)

Low/high-expressed genes are filtered by their L2-norm ranks, which is controlled by thres.low and thres.high. Dimension reduction is slightly different from PCA. The first loading vector is forced to be c(1,1,...,1) with unit norm normalization. The remaining are eigenvectors from PCA in the space orthogonal to the first vector. Perspective projection is to project dimension-reduced gene expression vectors to the hyperplane orthogonal to c(1,0,...,0), i.e., the first axis in the new coordinate system. This function is controlled by dim.rdc, referring to the dimension of the reduced data. Finally, gene expression vectors are aggregated by clustering to further reduce the impact of noise/outlier and help improve the efficiency of simplex corner detection. Default clustering method, namely 'Fixed-Radius', is based on cosine similarity, which is turned by a set of hyper-parameters: radius.thres controlling the cosine radius of "Fixed-Radius" clustering, sim.thres controlling the cosine similarity threshold of cluster centers; for clusters with cosine similarity higher than the threshold, they would be merged until the number of clusters equals to cluster.num, which controls the lower bound of cluster number.

Marker Gene Selection

This function finds corner clusters as Marker Gene (MG) clusters from the pre-processed data. fast.mode Use fast mode of greedy search or not. The normal mode may give more accurate results, but computation time is much longer.

# Setting hyper-parameters 
fast.mode = TRUE

# Commencing marker gene selection
MGResult <- CAM3MGCluster(PrepResult, fast.mode)
MGResult <- MGResult[K]
names(MGResult) <- as.character(K)

Internally it provides two solutions: the forward and backward greedy search. Then the best one of certain source number is selected as the output based on reconstruction errors of all data points in original space.

A and S Matrix Estimation

This function estimates A and S matrix based on marker gene clusters detected previously, guided by Sequential Forward Floating Search (SFFS), Sequential Backward Floating Search (SBFS) and the Minimum Description Length (MDL) criterion.

# Setting hyper-parameters     
generalNMF = FALSE    

# Commencing A and S estimation
ASestResultF <- lapply(MGResult, CAM3ASest, PrepResult, data,
                       1, generalNMF)
ASestResultB <- lapply(MGResult, CAM3ASest, PrepResult, data,
                       2, generalNMF)

ASestResult <- vector("list", length(K))
for (k in seq_along(K)){
     if (ASestResultF[[k]]@mdl < ASestResultB[[k]]@mdl)
         ASestResult[[k]] <- ASestResultF[[k]]
     else
         ASestResult[[k]] <- ASestResultB[[k]]
}
names(ASestResult) <- as.character(K)

Similar to the previous step, two sets of A and S estimation are drawn from two corner clusters detection strategies: (1) minimum sum of margin-of-errors and (2) minimum sum of reconstruction errors, and the final output of A and S matrix shall be the one with lower MDL value. generalNMF, if set to be TRUE, the decomposed proportion matrix has no sum-to-one constraint for each row. Without this constraint, the scale ambiguity of each column vector in proportion matrix will not be removed.

The MDL value is based on original data with A matrix estimated by transforming dimension-reduced A matrix back to original space. The MDL value is the sum of two terms: code length of data under the model and code length of model. Both MDL value and the first term (code length of data) will be returned.

The final output consists of three parts:

   rCAM3 <- new("CAMObj",PrepResult=PrepResult, MGResult=MGResult,
                ASestResult=ASestResult)

Post-CAM Analysis

Model Selection: Determining the Number of Cell Types

As a fully unsupervised deconvolution method, CAM3.0 automatically determines the most likely number of cell-types in a bulk tissue data, and this is visualized through an MDL plot which links each number of cell-types with an MDL value.

    K = 2:5
    plot(K, MDL(rCAM3, 1)@mdls, ylab='MDL Value')

Looking for Cell Type-Specific Genes

Suppose we know that the optimal cell-type number is 3, as suggested by the previous step:

    S<-rCAM3@ASestResult$'3'@Sest
    MGlist <- cotMG(data,rCAM3@ASestResult$`3`@Sest, cos.thres = 0.99)

Thus we successfully draw a list of marker genes from the input mixed matrix where these genes are selected based on their cosine values with the exact mathematical definition of ideal marker genes.

Scatter Simplex Plot of Mixture

For CAM, the scatter simplex of mixed expressions is a rotated and compressed version of the scatter simplex of pure expressions, where the marker genes are located at each vertex. Simplex plot allows users to see how data points are located along the axis formed by the cell types, and more importantly, the markers gathered near the vertices.

    simplexplot(data, rCAM3@ASestResult$`3`@Aest, MGlist = MGlist$mg.list,
                 col = "grey", mg.col = c("red","orange","green"))

Case Study

In this section we will demonstrate one application of CAM3.0 on the real-world data, GSE 64385.

GSE 64385

As a benchmark tool for deconvolution methods, this data contains five immune cell types and one cancer cell line (six sources). In the dataset, there are 12 samples with different proportions of cell types and 54,675 RNAs.

library(GEOquery)

gsm <- getGEO("GSE64385")
data <- 2^exprs(gsm[[1]])

rCAM<-CAM3Run(data=data, K= 2:8, dim.rdc = 10, thres.low = 0.30, 
              thres.high = 0.95)

Again, we use the MDL plot to approach the estimated number of cell types:

K = 2:8
plot(K, MDL(rCAM, 1)@mdls, ylab='MDL Value')

The number of cell types estimated by CAM3.0 is consistent with the ground truth (5 immune cell types and 1 cancer cell line). Once K is determined, we can further extract the relevant cell type-specific marker genes from the data.

# Detecting cell type-specific markers
    S<-rCAM@ASestResult$'6'@Sest
    MGlist <- cotMG(data,rCAM@ASestResult$`6`@Sest, cos.thres = 0.99)

Similarly we can visualize the data in a 2D plot where the markers are highlighted and gather near the vertices of the scatter simplex:

    simplexplot(data, rCAM@ASestResult$`6`@Aest, MGlist = MGlist$mg.list,
                 col = "grey", mg.col = c("red","orange","green","purple","blue","pink"))

This example again showcase that CAM3.0 is able to accurately estimate the number of sources in the mixture, commence its deconvolution, identify and visualize the source-specific marker features.

References



ChiungTingWu/CAM3 documentation built on Feb. 14, 2024, 9:22 a.m.