\def\T{\text{T}} \newcommand{\bf}[1]{\mathbf{#1}} \newcommand{\bcdot}[2]{\left. #1 \middle| #2 \right.} \newcommand{\bigCur}[1]{\left{#1\right}}
knitr::opts_chunk$set( collapse = TRUE,comment = "#>", echo = TRUE,cache = TRUE, dev = "png") # Fix seed set.seed(12)
Assuming all software dependencies and SMASH [@little2019associating] are installed, we can begin.
req_packs = c("devtools","ggplot2","smarter","SMASH") for(pack in req_packs){ library(package = pack,character.only = TRUE) } # List package's exported functions ls("package:SMASH")
A biopsied tumor sample could contain one population or multiple subpopulations or subclones that is commonly referred to as intra-tumor heterogeneity or ITH for short. Each subclone is assumed to be characterized by the same somatically altered genomic profile. We will introduce several vocabulary used to characterize ITH.
Suppose we know that a tumor sample consists of DNA from three cancer subclones and DNA from normal cells. The tumor purity is the proportion of the tumor specimen that is cancerous and denoted by $\phi$. Let us refer to the three cancer subclones as $A$, $B$, and $C$ and they appear in that order. If we assume a tumor population originated from one parental subclone, each subclone descends from a single parental subclone, and that no subclone can disappear, then one of two scenarios could have occurred.
1) $A \rightarrow B \rightarrow C$ 2) $A \rightarrow B$ and $A \rightarrow C$
We can express these scenarios by subclone configuration matrices where columns read from left to right represent subclones and rows correspond to how mutations arise across subclones or each mutation's possible allocation.
For example in 1) we have
$$ \bf{S}{3,1} \equiv \begin{bmatrix} 1 & 1 & 1 \ 0 & 1 & 1 \ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \bf{a}{3,1,1}^\T \ \bf{a}{3,1,2}^\T \ \bf{a}{3,1,3}^\T \end{bmatrix} $$
stored as eS[[3]][[1]]
with value
eS[[3]][[1]]
and in 2) we have
$$ \bf{S}{3,2} = \begin{bmatrix} 1 & 1 & 1 \ 0 & 1 & 0 \ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \bf{a}{3,2,1}^\T \ \bf{a}{3,2,2}^\T \ \bf{a}{3,2,3}^\T \end{bmatrix} $$
stored as eS[[3]][[2]]
with value
eS[[3]][[2]]
Currently, the SMASH
package has built-in subclone configuration matrices from one up to five subclones stored in the object eS
. The biopsying of a tumor sample is similar to sampling a variety of colored marbles from an urn. Thus the tumor sample has unknown proportions for each subclone (denoted $\eta_A, \eta_B, \eta_C$) such that these subclone proportions sum up to the tumor purity $(\phi = \eta_A + \eta_B + \eta_C)$.
In general, let $\bf{\eta} = (\eta_1,\ldots,\eta_Q)^\T$ denote the tumor's subclone proportions for $Q$ subclones with $\phi = \sum_{q=1}^Q \eta_q$. Let $\bf{q} \equiv \frac{1}{\phi} \bf{\eta}$ correspond to the cancer's subclone proportions.
Copy number aberations or CNAs is another component to characterizing tumor data and ITH. In normal cell DNA, we expect to have one allelic copy of maternal DNA and one allelic copy of paternal DNA and thus a genome-wide average total copy number or ploidy of two. In cancer DNA, short or long stretches of genome may undergo copy number change, resulting in unknown integer allelic copy numbers that deviate from the normal state. For SMASH
, these integer copy number states need to be inferred by SNP array or next generation sequencing platforms and algorithmic pipelines.
A given genomic segment can be characterized by the underlying pair of allelic copy numbers. We simply refer to the minimum of the two as the minor copy number $(C_m)$ and the maximum of the two as the major copy number $(C_M)$ and the pairing is denoted $(C_m,C_M)$. The total copy number is denoted $C_T$ and defined as $C_T = C_m + C_M$.
To simplify ITH characterization, we assume CNAs are clonal or that the copy number changes occurred in the first subclone and are carried over into all descending subclones. Somatic point mutations could occur in regions overlapping CNAs and thus a point mutation's copy number or multiplicity, denoted $m$, can vary as a function of when it occurs.
Some investigators may be curious what proportion of cancer cells harbor at least one somatic mutation. We define this notion as cellular prevalence which combines a mutation's allocation with the cancer's subclone proportions to correct for sample to sample variability in tumor purity. Here are some examples:
If a mutation in scenario 1 occurred in the second row, the cellular prevalence is $$ \frac{\eta_B + \eta_C}{\phi} = \bf{a}_{3,1,2}^\T \bf{q}. $$
If a mutation in scenario 2 occurred in the second row, the cellular prevalence is $$ \frac{\eta_B}{\phi} = \bf{a}_{3,2,2}^\T \bf{q}. $$
The $b$-th point mutation is characterized by a pair of read counts that span the allele. The reads either harbor the reference allele or non-reference or alternate allele and thus we have reference and alternate read counts, respectively, and denoted as $R_{br}$ and $R_{ba}$. The variant allele frequency or VAF is the proportion of total read counts $(R_{ba} + R_{br})$ that harbor the alternate allele.
The expected VAF is a function of a $b$-th mutation's multiplicity $(m_b)$, allocation $(\bf{a}b)$, subclone proportions, and overlapping allelic copy numbers $(C{m,b},C_{M,b})$. The current SMASH
model is
$$ \bcdot{R_{ba}}{R_{ba} + R_{br},p_b} \sim Binomial(R_a + R_r,p_b), $$
where
\begin{align} p_b &= \frac{m_b \bf{a}b^\T \bf{\eta}}{(C{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \ &= \frac{m_b \phi \bf{a}b^\T \bf{q}}{(C{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \end{align}
maxLOCI = 50 meanDP = 1e3 nCN = 3
The code below generates a R data.frame as input for SMASH
. We have selected subclone configuration matrix eS[[3]][[2]]
as the true tree structure with r maxLOCI
mutated loci with overlaying copy number alterations and r nCN
copy number pairings.
sim = gen_subj_truth( mat_eS = eS[[3]][[2]], maxLOCI = maxLOCI, nCN = nCN) class(sim) names(sim)
We can inspect the object's elements. CN_1
and CN_2
correspond to the minor and major allelic copy numbers, respectively.
# Underlying true allocation, multiplicity, and cellular prevalence per point mutation dim(sim$subj_truth) sim$subj_truth[1:5,] # Combinations of allelic copy number, allocation, and multiplicity # with resulting mean VAFs (written as MAF) and cellular prevalences uniq_states = unique(sim$subj_truth[, c("CN_1","CN_2","true_A","true_M","true_MAF","true_CP")]) rownames(uniq_states) = NULL round(uniq_states,3) # Tumor purity sim$purity # Tumor subclone proportions sim$eta # Cancer subclone proportions sim$eta / sim$purity sim$q
Next we need to generate the corresponding read counts per mutation. We will set the mean total read depth to r meanDP
. Higher read depths result in narrow measurement of the VAF per mutation and can lead to improved ITH inference. In addition, more detected mutations and help improve clustering performance.
mat_RD = gen_ITH_RD(DATA = sim$subj_truth,RD = meanDP) dim(mat_RD); mat_RD[1:5,] smart_hist(rowSums(mat_RD),breaks = 20, xlab = "Total Depth",cex.lab = 1.5) # Combine copy number and read count information input = cbind(sim$subj_truth,mat_RD) # Calculate observed VAF input$VAF = input$tAD / rowSums(input[,c("tAD","tRD")]) # Remove rows with no alternate depth input = input[which(input$tAD > 0),] dim(input) input[1:3,]
We can take a look at the observed VAF distribution and overlay the underlying expected VAF over all loci and by copy number pairing.
# All loci smart_hist(input$VAF,breaks = 30,main = "All Loci", xlab = "Observed VAF",cex.lab = 1.5) abline(v = unique(input$true_MAF),lty = 2,lwd = 2,col = "magenta") # Loci per copy number state uCN = unique(input[,c("CN_1","CN_2")]) tmp_range = range(input$VAF) for(ii in seq(nrow(uCN))){ # ii = 3 idx = which(input$CN_1 == uCN$CN_1[ii] & input$CN_2 == uCN$CN_2[ii]) smart_hist(input$VAF[idx],breaks = 20,xlim = tmp_range, main = sprintf("Loci with (CN_1,CN_2) = (%s,%s)",uCN$CN_1[ii],uCN$CN_2[ii]), xlab = "Observed VAF",cex.lab = 1.5) abline(v = unique(input$true_MAF[idx]),lty = 2,lwd = 2,col = "magenta") }
SMASH assumes the true or estimate tumor purity is provided as well as each somatic point mutation's pair of overlapping allelic copy numbers.
The code below will perform optimization for the true subclone configuration matrix.
Here we will set the mixture proportion for loci that do not fit any multiplicity and allocation combination to zero (forcing all loci to classify to a combination). In addition, we will initialize the subclone proportions to the truth.
# Calculate true_unc_q, the unconstrained subclone proportions in cancer true_unc_q = log(sim$q[-length(sim$q)] / sim$q[length(sim$q)]) true_unc_q # Optimize ith_out = ITH_optim(my_data = input, my_purity = sim$purity, pi_eps0 = 0, my_unc_q = true_unc_q, init_eS = eS[[3]][[2]]) # Estimate of unclassified loci ith_out$pi_eps # Model fit BIC ith_out$BIC # Estimate of subclone proportions in cancer nSC = length(ith_out$q) tmp_df = smart_df(SC = as.character(rep(seq(nSC),2)), CLASS = c(rep("Truth",nSC),rep("Estimate",nSC)), VALUE = c(sim$q,ith_out$q)) tmp_df ggplot(data = tmp_df,mapping = aes(x = SC,y = VALUE,fill = CLASS)) + geom_bar(width = 0.5,stat = "identity",position = position_dodge()) + xlab("Subclone") + ylab("Proportion in Cancer") + theme(legend.position = "bottom", text = element_text(size = 20)) # Compare truth vs estimated/inferred ## Variant Allele Frequency smoothScatter(input$true_MAF,ith_out$infer$infer_MAF, main = "VAF",xlab = "Truth",ylab = "Inferred",cex.lab = 1.5) abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red") smoothScatter(input$VAF,ith_out$infer$infer_MAF, main = "VAF",xlab = "Observed",ylab = "Inferred",cex.lab = 1.5) abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red") ## Cellular prevalence smoothScatter(input$true_CP, ith_out$infer$infer_CP,main = "Cellular Prevalence", xlim = c(0,1),ylim = c(0,1), xlab = "Truth",ylab = "Estimate",cex.lab = 1.5) abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red") ## Allocation smoothScatter(input$true_A,ith_out$infer$infer_A, main = "Allocation",xlab = "Truth",ylab = "Inferred", cex.lab = 1.5) abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red") if( any(is.na(ith_out$infer$infer_A)) ) cat("Some loci couldn't classify\n") ## Multiplicity smart_table(Truth = input$true_M, Inferred = ith_out$infer$infer_M)
We may or may not have achieved the global optimal solution and underlying truth in the above section. Luckily SMASH
provides a hands-off thorough grid search without any prior knowledge of the number of subclones, subclone configuration, subclone proportions, proportion of unclassified point mutations, etc. in the following code.
grid_ith = grid_ITH_optim( my_data = input, my_purity = sim$purity, list_eS = eS, trials = 50, max_iter = 4e3) names(grid_ith) # Grid of solutions grid_ith$GRID # Order solution(s) based on BIC (larger BIC correspond to better fits) gg = grid_ith$GRID gg = gg[order(-gg$BIC),] head(gg) # true cancer proportions sim$q # true entropy -sum(sim$q * log(sim$q))
From the grid of solutions (grid_ith$GRID
), we can use AIC or BIC to score the model fit per solution. Sometimes the same configuration matrix can lead to multiple solutions. The column definitions are provided.
cc
= the number of subcloneskk
= the index for a configuration matrix given cc
subclonesms
= the model size or number of parameters estimatedentropy
= the negative dot product of the subclone proportions in cancer and their logarithmLL
= solution's maximum log-likelihoodAIC
= Akaike information criterionBIC
= Bayesian information criterionq
= subclone proportions in cancer (rounded off to two decimal places)unc_q0
= unconstrained subclone proportions in cancer used to initialize the optimization, not the maximum likelihood estimates!alloc
= the number of new point mutations that characterize a subclone. pi_eps
= the proportion of inputted mutations that failed to classify to an allocation and multiplicityWe can calculate a posterior probability to compare and visualize model fits with the following formula:
$$ p_s = \frac{\exp(0.5 IC_s)}{\sum_{t=1}^T exp(0.5 IC_t)}, $$
where $IC_s$ corresponds to the $s$-th model's information criterion. The function logSumExp
from package smartr
will aid us here.
pp = vis_GRID(GRID = grid_ith$GRID) print(pp)
The above figure provides a landscape of which solutions appear more favorable to others.
Multiple metrics from SMASH
output can be extracted. One could obtain ...
gg = grid_ith$GRID idx = which(gg$BIC == max(gg$BIC)) gg[idx,] gg$cc[idx][1]
opt_entropy = gg$entropy[idx] names(opt_entropy) = sprintf("Solution %s",idx) opt_entropy
props = list() for(jj in seq(length(idx))){ opt_prop = gg$q[idx[jj]] opt_prop = as.numeric(strsplit(opt_prop,",")[[1]]) names(opt_prop) = sprintf("SC%s",seq(length(opt_prop))) props[[sprintf("Solution %s",idx[jj])]] = opt_prop } props
for(jj in seq(length(idx))){ cat(sprintf("Solution %s\n",idx[jj])) print(head(grid_ith$INFER[[idx[jj]]])) }
opt_cps = list() for(jj in seq(length(idx))){ # jj = 1 tab = table(smart_digits(grid_ith$INFER[[idx[jj]]]$infer_CP,4), grid_ith$INFER[[idx[jj]]]$infer_A) rownames(tab) = sprintf("CP = %s",rownames(tab)) colnames(tab) = sprintf("ALLOC = %s",colnames(tab)) opt_cps[[sprintf("Solution %s",idx[jj])]] = tab } opt_cps
opt = list() for(jj in seq(length(idx))){ # jj = 1 opt_cc = gg$cc[idx[jj]] opt_kk = gg$kk[idx[jj]] mat = eS[[opt_cc]][[opt_kk]] dimnames(mat) = list(sprintf("ALLOC = %s",seq(opt_cc)),sprintf("SC%s",seq(opt_cc))) opt[[sprintf("Solution %s",idx[jj])]] = mat } opt
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.