# CAMTHC User Manual In CAMTHC: Convex Analysis of Mixtures for Tissue Heterogeneity Characterization

knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )  # Introduction Convex Analysis of Mixtures (CAM) is a fully unsupervised computational method to analyze tissue samples composed of unknown numbers and varying proportions of distinct subpopulations [@CAM2016]. CAM assumes that the measured expression level is the weighted sum of each subpopulation’s expression, where the contribution from a single subpopulation is proportional to the abundance and specific expression of that subpopulation. This linear mixing model can be formulated as$\mathbf{X'=AS'}$. CAM can identify molecular markers directly from the original mixed expression matrix,$\mathbf{X}$, and further estimate the constituent proportion matrix,$\mathbf{A}$, as well as subpopulation-specific expression profile matrix,$\mathbf{S}$. CAMTHC is an R package developed for tissue heterogeneity characterization by CAM algorithm. It provides basic functions to perform unsupervised deconvolution on mixture expression profiles by CAM and some auxiliary functions to help understand the subpopulation-specific results. CAMTHC also implements functions to perform supervised deconvolution based on prior knowledge of molecular markers, S matrix or A matrix. Semi-supervised deconvolution can also be achieved by combining molecular markers from CAM and from prior knowledge to analyze mixture expressions. # Quick Start The function CAM() includes all necessary steps to decompose a matrix of mixture expression profiles. There are some optional steps upstream of CAM() that downsample the matrix for reducing running time. Each step in CAM() can also be performed separately if you prefer a more flexible workflow. More details will be introduced in sections below. Starting your analysis by CAM(), you need to specify the range of possible subpopulation numbers and the percentage of low/high-expressed molecules to be removed. Typically, 30% ~ 50% low-expressed genes can be removed from gene expression data. Much less low-expressed proteins are removed, e.g. 0% ~ 10%, due to a limited number of proteins in proteomics data. The removal of high-expressed molecules has much less impact on results, usually set to be 0% ~ 10%. rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95)  # Unsupervised Deconvolution ## Datatypes and Input Format Theoretically, CAMTHC accepts any molecular expression data types as long as the expressions follow the linear mixing model. We have validated the feasibility of CAM in gene expression data (microarray, RNAseq), proteomics data and DNA methylation data. Requirements for the input expression data: • be in non-log linear space with non-negative numerical values (i.e.>=0). • be already processed by normalization and batch effect removal. • no missing values; the molecules containing missing values should be removed prior to CAM. • no all-zero expressed molecules; otherwise, all-zero expressed molecules will be removed internally. The input expression data should be stored in a matrix. Data frame, SummarizedExperiment or ExpressionSet object is also accepted and will be internally coerced into a matrix format before analysis. Each column in the matrix should be a tissue sample. Each row should be a probe/gene/protein/etc. Row names should be provided so that CAM can return the names of detected markers. Otherwise, rows will be automatically named by 1,2,3,... ## CAM Workflow We use a data set downsampled from GSE19830 as an example to show CAM workflow. This data set provides gene expression profiles of pure tissues (brain, liver, lung) and their biological mixtures with different proportions. library(CAMTHC) data(ratMix3) #ratMix3$X: X matrix containing mixture expression profiles to be analyzed
#ratMix3$A: ground truth A matrix containing proportions #ratMix3$S: ground truth S matrix containing subpopulation-specific expression profiles

#rre$Sest: re-estimated S matrix  ### Simplex Plot - 2D Fundamental to the success of CAM is that the scatter simplex of mixed expressions is a rotated and compressed version of the scatter simplex of pure expressions, where the marker genes are located at each vertex. simplexplot() function can show the scatter simplex and detected marker genes in a 2D plot. The vertices in the high-dimensional simplex will still locate at extreme points of the low-dimensional simplex. layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE)) simplexplot(data, Aest, MGlist, main=c('Initially detected markers')) simplexplot(data, Aest, MGlist.FC, main=c('fc > 10')) simplexplot(data, Aest, MGlist.FCboot, main=c(expression(bold(paste('fc(bootstrap,',alpha,'=0.05) > 10'))))) simplexplot(data, Aest, MGlist.re, main=c("fc >= 'q0.2', error <= 'q1.2'"))  The colors and the vertex order displayed in 2D plot can be changed as simplexplot(data, Aest, MGlist.FCboot, data.extra=rbind(t(ratMix3$A),t(Aest)),
corner.order = c(2,1,3), col = "blue",
mg.col = c("red","orange","green"),
ex.col = "black", ex.pch = c(19,19,19,17,17,17))
legend("bottomright", cex=1.2, inset=.01, c("Ground Truth","CAM Estimation"),
pch=c(19,17), col="black")


### Simplex Plot - 3D

We can also observe the convex cone and simplex of mixture expressions in 3D space by PCA. Note that PCA cannot guarantee that vertices are still preserved as extreme points of the dimension-reduced simplex.

Code to show convex cone:

library(rgl)
Xp <- data %*% t(PCAmat(rCAM))
plot3d(Xp[, 1:3], col='gray', size=3,
xlim=range(Xp[,1]), ylim=range(Xp[,2:3]), zlim=range(Xp[,2:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}


Code to show simplex:

library(rgl)
clear3d()
Xproj <- XWProj(data, PCAmat(rCAM))
Xp <- Xproj[,-1]
plot3d(Xp[, 1:3], col='gray', size=3,
xlim=range(Xp[,1:3]), ylim=range(Xp[,1:3]), zlim=range(Xp[,1:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}


### Validation with Ground Truth

If the ground truth A and S matrix are available, the estimation from CAM can be evaluated:

cor(Aest, ratMix3$A) cor(Sest, ratMix3$S)


Considering the presence of many co-expressed genes (housekeeping genes) may dominate the correlation coefficients between ground truth and estimated expression profiles, it is better to assess correlation coefficients over marker genes only.

unlist(lapply(seq_len(3), function(k) {
k.match <- which.max(cor(Aest[,k], ratMix3$A)); mgk <- MGlist.FCboot[[k]]; corr <- cor(Sest[mgk, k], ratMix3$S[mgk, k.match]);
names(corr) <- colnames(ratMix3$A)[k.match]; corr}))  ## Alternative CAM Workflow There major steps in CAM (data preprocessing, marker gene cluster detection, and matrix decomposition) can also be performed separately as a more flexible choice. set.seed(111) #Data preprocession rPrep <- CAMPrep(data, thres.low = 0.30, thres.high = 0.95) #Marker gene cluster detection with a fixed K rMGC <- CAMMGCluster(3, rPrep) #A and S matrix estimation rASest <- CAMASest(rMGC, rPrep, data) #Obtain A and S matrix Aest <- Amat(rASest) Sest <- Smat(rASest) #obtain marker gene list detected by CAM and used for A estimation MGlist <- MGsforA(PrepResult = rPrep, MGResult = rMGC) #obtain a full list of marker genes MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000) MGlist.FC <- lapply(seq_len(3), function(x) rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10]) MGlist.FCboot <- lapply(seq_len(3), function(x) rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10]) MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')  ## Optional Sample Clustering We have implemented PCA in CAM()/CAMPrep() to reduce data dimensions. Sample clustering, as another data dimension reduction method, is optional prior to CAM()/CAMPrep(). #clustering library(apcluster) apres <- apclusterK(negDistMat(r=2), t(data), K = 10) #You can also use apcluster(), but need to make sure the number of clusters is large than potential subpopulation number. data.clusterMean <- lapply(slot(apres,"clusters"), function(x) rowMeans(data[, x, drop = FALSE])) data.clusterMean <- do.call(cbind, data.clusterMean) set.seed(111) rCAM <- CAM(data.clusterMean, K = 2:5, thres.low = 0.30, thres.high = 0.95) # or rPrep <- CAMPrep(data.clusterMean, thres.low = 0.30, thres.high = 0.95)  We can still follow the workflow in \@ref(cam-workflow) or \@ref(alternative-cam-workflow) to obtain marker gene list and estimated S matrix. However, the estimated A matrix is for the new data composed of cluster centers. The A matrix for the original data can be obtained by Sest <- Smat(rCAM,3) MGlist <- MGsforA(rCAM, K = 3) Aest <- AfromMarkers(data, MGlist) #(Optional) alternative re-estimation of A and S matrix rre <- redoASest(data, MGlist, maxIter = 10)  ## Case Study ### GSE11058 GSE11058 run four immune cell lines on microarrays as well as their mixtures of various relative proportions. The code chunk below shows how to use CAM to blindly separate the four mixtures into four pure cell lines. Note that this data set contains 54613 probe/probesets. We can reduce running time by downsampling probe/probesets or remove more low-expressed genes (e.g. 70%). #download data and phenotypes library(GEOquery) gsm<- getGEO('GSE11058') pheno <- pData(phenoData(gsm[[1]]))$characteristics_ch1
mat <- exprs(gsm[[1]])
mat <- mat[-grep("^AFFX", rownames(mat)),]
mat.aggre <- sapply(unique(pheno), function(x) rowMeans(mat[,pheno == x]))
data <- mat.aggre[,5:8]

#running CAM
set.seed(111)
rCAM <- CAM(data, K = 4, thres.low = 0.70, thres.high = 0.95)
Aest <- Amat(rCAM, 4)
Aest

#Use ground truth A to validate CAM-estimated A matrix
Atrue <- matrix(c(2.50, 0.50, 0.10, 0.02,
1.25, 3.17, 4.95, 3.33,
2.50, 4.75, 1.65, 3.33,
3.75, 1.58, 3.30, 3.33), 4,4,
dimnames = list(c("MixA", "MixB", "MixC","MixD"),
c("Jurkat", "IM-9", "Raji", "THP-1")))
Atrue <- Atrue / rowSums(Atrue)
Atrue
cor(Aest, Atrue)


### GSE41826 (methylation)

GSE41826 conducted a reconstitution experiment by mixing neuron and glial derived DNA from a single individual from 10% ~ 90%. The code chunk below shows how to use CAM to blindly separate 9 mixtures into neuron and glial specific CpG methylation quantification. Additional requirements of running CAM on methylation data:

• Use beta-value, not M-value, as methylation quantification.
• CpG sites in X or Y chromosomes must be removed if mixture tissues are from both males and females.
• CpG sites needs to be downsampled since the huge number of CpG sites will slow the clustering process.
#download data
library(GEOquery)
gsm <- getGEO('GSE41826')
mixtureId <- unlist(lapply(paste0('Mix',seq_len(9)),
function(x) gsm[[1]]$geo_accession[gsm[[1]]$title==x]))
data <- gsm[[1]][,mixtureId]

#Remove CpG sites in sex chromosomes if tissues are from both males and females
#Not necessary in this example as mixtures are from the same subject
#gpl<- getGEO('GPL13534')
#annot<-dataTable(gpl)@table[,c('Name','CHR')]
#rownames(annot) <- annot$Name #annot <- annot[rownames(data),] #data <- data[annot$CHR != 'X' & annot$CHR != 'Y',] #downsample CpG sites featureId <- sample(seq_len(nrow(data)), 50000) #Running CAM #When input data has too many probes, lof has to be disabled due to space limit. #When cluster.num is large, quick.select can be enable to increase the speed rCAM <- CAM(data[featureId,], K = 2, thres.low = 0.10, thres.high = 0.60, cluster.num = 100, MG.num.thres = 10, lof.thres = 0, quick.select = 20) MGlist <- MGsforA(rCAM, K = 2) #Identify markers from all CpG sites MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2') #re-esitmation with methylation constraint rre <- redoASest(data, MGlist.re, maxIter = 20, methy = TRUE) #Validation using ground truth A matrix Atrue <- cbind(seq(0.1, 0.9, 0.1), seq(0.9, 0.1, -0.1)) cor(rre$Aest, Atrue)


We can also run CAM on unmethylated quantification, 1 - beta, to obtain quite similar results since the underlying linear mixing model is also applicable for unmethylated probe intensity.

rCAM <- CAM(1 - exprs(data[featureId,]),
K = 2, thres.low = 0.10, thres.high = 0.60,
cluster.num = 100, MG.num.thres = 10,
lof.thres = 0, quick.select = 20)


# Supervised/Semi-supervised Deconvolution

## Molecular markers are known

CAM algorithm can estimate A and S matrix based on blindly detected markers. Thus, we can also use part of CAM algorithm to estimate A and S matrix based on known markers.

The package provides AfromMarkers() to estimate A matrix from marker list.

Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation


Or we can use redoASest() to estimate A and S matrix from marker list and alternatively re-estimate two matrices by ALS to further reduce mean squared error. Note that allowing too many iterations of ALS may bring the risk of a significant deviation from initial values. The constraint for methylation data, $\mathbf{S\in[0,1]}$, can be imposed during re-estimation.

rre <- redoASest(data, MGlist, maxIter = 10)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation
#maxIter = 0: No re-estimation by ALS
#rre$Aest: estimated A matrix #rre$Sest: estimated S matrix


## S matrix is known

Many datasets provide expression profiles for purified cell lines or even every single cell, which can be treated as references of S matrix. Some methods use least squares techniques or support vector regression [@CIBERSORT] to estimate A matrix based on known S matrix. CAMTHC will estimate A matrix by identified markers from known S matrix, which has better performance in terms of correlation coefficient with ground truth A matrix.

data <- ratMix3$X S <- ratMix3$S #known S matrix

#Identify markers
pMGstat <- MGstatistic(S, c("Liver","Brain","Lung"))
pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x)
rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])

#Estimate A matrix
Aest <- AfromMarkers(data, pMGlist.FC)

#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, maxIter = 10)


CAMTHC also supports estimating A matrix directly from known S matrix using least squares method. Marker identification from S matrix is still needed because only markers' squared errors will be counted which facilitates (1) faster computational running time and (2) a greater signal-to-noise ratio owing to markers' discriminatory power [@CIBERSORT].

Aest <- redoASest(data, pMGlist.FC, S=S, maxIter = 0)$Aest  ## A matrix is known With known A matrix, CAMTHC estimates S matrix using non-negative least squares (NNLS) from r CRANpkg("NMF"), and further identify markers. data <- ratMix3$X
A <- ratMix3$A #known A matrix #Estimate S matrix Sest <- t(NMF::.fcnnls(A, t(data))$coef)

#Identify markers
pMGstat <- MGstatistic(data, A)
pMGlist.FC <- lapply(unique(pMGstat$idx), function(x) rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10]) #(Optional) alternative re-estimation rre <- redoASest(data, pMGlist.FC, A=A, maxIter = 10)  ## Semi-supervised Deconvolution When prior information of markers, S matrix and/or A matrix is available, semi-supervised deconvolution can also be performed by combining markers from prior information and markers identified by CAM. While supervised deconvolution cannot handle the underlying subpopulations without prior information, unsupervised deconvolution may miss the subpopulation without enough discrimination power. Therefore, semi-supervised deconvolution can take advantage of both methods. Aest <- AfromMarkers(data, MGlist) #MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation  Or rre <- redoASest(data, MGlist, maxIter = 10) #MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation #maxIter = 0: No re-estimation by ALS #rre$Aest: estimated A matrix
#rre\$Sest: estimated S matrix


# References

## Try the CAMTHC package in your browser

Any scripts or data that you put into this service are public.

CAMTHC documentation built on April 29, 2020, 2:29 a.m.