\def\T{\text{T}} \newcommand{\bf}[1]{\mathbf{#1}} \newcommand{\bcdot}[2]{\left. #1 \middle| #2 \right.} \newcommand{\bigCur}[1]{\left{#1\right}}

knitr::opts_chunk$set(
    collapse = TRUE,comment = "#>",
    echo = TRUE,cache = TRUE,
    dev = "png")

# Fix seed
set.seed(12)

Software

Assuming all software dependencies and SMASH [@little2019associating] are installed, we can begin.

req_packs = c("devtools","ggplot2","smarter","SMASH")
for(pack in req_packs){
  library(package = pack,character.only = TRUE)
}

# List package's exported functions
ls("package:SMASH")

Introduction

A biopsied tumor sample could contain one population or multiple subpopulations or subclones that is commonly referred to as intra-tumor heterogeneity or ITH for short. Each subclone is assumed to be characterized by the same somatically altered genomic profile. We will introduce several vocabulary used to characterize ITH.

Definitions

Suppose we know that a tumor sample consists of DNA from three cancer subclones and DNA from normal cells. The tumor purity is the proportion of the tumor specimen that is cancerous and denoted by $\phi$. Let us refer to the three cancer subclones as $A$, $B$, and $C$ and they appear in that order. If we assume a tumor population originated from one parental subclone, each subclone descends from a single parental subclone, and that no subclone can disappear, then one of two scenarios could have occurred.

1) $A \rightarrow B \rightarrow C$ 2) $A \rightarrow B$ and $A \rightarrow C$

We can express these scenarios by subclone configuration matrices where columns read from left to right represent subclones and rows correspond to how mutations arise across subclones or each mutation's possible allocation.

For example in 1) we have

$$ \bf{S}{3,1} \equiv \begin{bmatrix} 1 & 1 & 1 \ 0 & 1 & 1 \ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \bf{a}{3,1,1}^\T \ \bf{a}{3,1,2}^\T \ \bf{a}{3,1,3}^\T \end{bmatrix} $$

stored as eS[[3]][[1]] with value

eS[[3]][[1]]

and in 2) we have

$$ \bf{S}{3,2} = \begin{bmatrix} 1 & 1 & 1 \ 0 & 1 & 0 \ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \bf{a}{3,2,1}^\T \ \bf{a}{3,2,2}^\T \ \bf{a}{3,2,3}^\T \end{bmatrix} $$

stored as eS[[3]][[2]] with value

eS[[3]][[2]]

Currently, the SMASH package has built-in subclone configuration matrices from one up to five subclones stored in the object eS. The biopsying of a tumor sample is similar to sampling a variety of colored marbles from an urn. Thus the tumor sample has unknown proportions for each subclone (denoted $\eta_A, \eta_B, \eta_C$) such that these subclone proportions sum up to the tumor purity $(\phi = \eta_A + \eta_B + \eta_C)$.

In general, let $\bf{\eta} = (\eta_1,\ldots,\eta_Q)^\T$ denote the tumor's subclone proportions for $Q$ subclones with $\phi = \sum_{q=1}^Q \eta_q$. Let $\bf{q} \equiv \frac{1}{\phi} \bf{\eta}$ correspond to the cancer's subclone proportions.

Copy Number

Copy number aberations or CNAs is another component to characterizing tumor data and ITH. In normal cell DNA, we expect to have one allelic copy of maternal DNA and one allelic copy of paternal DNA and thus a genome-wide average total copy number or ploidy of two. In cancer DNA, short or long stretches of genome may undergo copy number change, resulting in unknown integer allelic copy numbers that deviate from the normal state. For SMASH, these integer copy number states need to be inferred by SNP array or next generation sequencing platforms and algorithmic pipelines.

A given genomic segment can be characterized by the underlying pair of allelic copy numbers. We simply refer to the minimum of the two as the minor copy number $(C_m)$ and the maximum of the two as the major copy number $(C_M)$ and the pairing is denoted $(C_m,C_M)$. The total copy number is denoted $C_T$ and defined as $C_T = C_m + C_M$.

To simplify ITH characterization, we assume CNAs are clonal or that the copy number changes occurred in the first subclone and are carried over into all descending subclones. Somatic point mutations could occur in regions overlapping CNAs and thus a point mutation's copy number or multiplicity, denoted $m$, can vary as a function of when it occurs.

Cellular prevalence

Some investigators may be curious what proportion of cancer cells harbor at least one somatic mutation. We define this notion as cellular prevalence which combines a mutation's allocation with the cancer's subclone proportions to correct for sample to sample variability in tumor purity. Here are some examples:

Variant Allele Frequency

The $b$-th point mutation is characterized by a pair of read counts that span the allele. The reads either harbor the reference allele or non-reference or alternate allele and thus we have reference and alternate read counts, respectively, and denoted as $R_{br}$ and $R_{ba}$. The variant allele frequency or VAF is the proportion of total read counts $(R_{ba} + R_{br})$ that harbor the alternate allele.

The expected VAF is a function of a $b$-th mutation's multiplicity $(m_b)$, allocation $(\bf{a}b)$, subclone proportions, and overlapping allelic copy numbers $(C{m,b},C_{M,b})$. The current SMASH model is

$$ \bcdot{R_{ba}}{R_{ba} + R_{br},p_b} \sim Binomial(R_a + R_r,p_b), $$

where

\begin{align} p_b &= \frac{m_b \bf{a}b^\T \bf{\eta}}{(C{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \ &= \frac{m_b \phi \bf{a}b^\T \bf{q}}{(C{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \end{align}

Simulation

maxLOCI = 50
meanDP = 1e3
nCN    = 3

The code below generates a R data.frame as input for SMASH. We have selected subclone configuration matrix eS[[3]][[2]] as the true tree structure with r maxLOCI mutated loci with overlaying copy number alterations and r nCN copy number pairings.

sim = gen_subj_truth(
  mat_eS = eS[[3]][[2]],
  maxLOCI = maxLOCI,
  nCN = nCN)
class(sim)
names(sim)

We can inspect the object's elements. CN_1 and CN_2 correspond to the minor and major allelic copy numbers, respectively.

# Underlying true allocation, multiplicity, and cellular prevalence per point mutation
dim(sim$subj_truth)
sim$subj_truth[1:5,]

# Combinations of allelic copy number, allocation, and multiplicity 
#   with resulting mean VAFs (written as MAF) and cellular prevalences
uniq_states = unique(sim$subj_truth[,
  c("CN_1","CN_2","true_A","true_M","true_MAF","true_CP")])
rownames(uniq_states) = NULL
round(uniq_states,3)

# Tumor purity
sim$purity

# Tumor subclone proportions
sim$eta

# Cancer subclone proportions
sim$eta / sim$purity
sim$q

Next we need to generate the corresponding read counts per mutation. We will set the mean total read depth to r meanDP. Higher read depths result in narrow measurement of the VAF per mutation and can lead to improved ITH inference. In addition, more detected mutations and help improve clustering performance.

mat_RD = gen_ITH_RD(DATA = sim$subj_truth,RD = meanDP)
dim(mat_RD); mat_RD[1:5,]
smart_hist(rowSums(mat_RD),breaks = 20,
  xlab = "Total Depth",cex.lab = 1.5)

# Combine copy number and read count information
input = cbind(sim$subj_truth,mat_RD)

# Calculate observed VAF
input$VAF = input$tAD / rowSums(input[,c("tAD","tRD")])

# Remove rows with no alternate depth
input = input[which(input$tAD > 0),]
dim(input)
input[1:3,]

We can take a look at the observed VAF distribution and overlay the underlying expected VAF over all loci and by copy number pairing.

# All loci
smart_hist(input$VAF,breaks = 30,main = "All Loci",
  xlab = "Observed VAF",cex.lab = 1.5)
abline(v = unique(input$true_MAF),lty = 2,lwd = 2,col = "magenta")

# Loci per copy number state
uCN = unique(input[,c("CN_1","CN_2")])
tmp_range = range(input$VAF)

for(ii in seq(nrow(uCN))){
  # ii = 3
  idx = which(input$CN_1 == uCN$CN_1[ii] & input$CN_2 == uCN$CN_2[ii])
  smart_hist(input$VAF[idx],breaks = 20,xlim = tmp_range,
    main = sprintf("Loci with (CN_1,CN_2) = (%s,%s)",uCN$CN_1[ii],uCN$CN_2[ii]),
    xlab = "Observed VAF",cex.lab = 1.5)
  abline(v = unique(input$true_MAF[idx]),lty = 2,lwd = 2,col = "magenta")
}

Optimization

SMASH assumes the true or estimate tumor purity is provided as well as each somatic point mutation's pair of overlapping allelic copy numbers.

Known configuration

The code below will perform optimization for the true subclone configuration matrix.

Here we will set the mixture proportion for loci that do not fit any multiplicity and allocation combination to zero (forcing all loci to classify to a combination). In addition, we will initialize the subclone proportions to the truth.

# Calculate true_unc_q, the unconstrained subclone proportions in cancer
true_unc_q = log(sim$q[-length(sim$q)] / sim$q[length(sim$q)])
true_unc_q

# Optimize
ith_out = ITH_optim(my_data = input,
  my_purity = sim$purity,
  pi_eps0 = 0,
  my_unc_q = true_unc_q,
  init_eS = eS[[3]][[2]])

# Estimate of unclassified loci
ith_out$pi_eps

# Model fit BIC
ith_out$BIC

# Estimate of subclone proportions in cancer
nSC = length(ith_out$q)
tmp_df = smart_df(SC = as.character(rep(seq(nSC),2)),
  CLASS = c(rep("Truth",nSC),rep("Estimate",nSC)),
  VALUE = c(sim$q,ith_out$q))
tmp_df

ggplot(data = tmp_df,mapping = aes(x = SC,y = VALUE,fill = CLASS)) +
  geom_bar(width = 0.5,stat = "identity",position = position_dodge()) +
  xlab("Subclone") + ylab("Proportion in Cancer") +
  theme(legend.position = "bottom",
    text = element_text(size = 20))

# Compare truth vs estimated/inferred

## Variant Allele Frequency
smoothScatter(input$true_MAF,ith_out$infer$infer_MAF,
  main = "VAF",xlab = "Truth",ylab = "Inferred",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")

smoothScatter(input$VAF,ith_out$infer$infer_MAF,
  main = "VAF",xlab = "Observed",ylab = "Inferred",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")

## Cellular prevalence
smoothScatter(input$true_CP,
  ith_out$infer$infer_CP,main = "Cellular Prevalence",
  xlim = c(0,1),ylim = c(0,1),
  xlab = "Truth",ylab = "Estimate",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")

## Allocation
smoothScatter(input$true_A,ith_out$infer$infer_A,
  main = "Allocation",xlab = "Truth",ylab = "Inferred",
  cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")

if( any(is.na(ith_out$infer$infer_A)) )
  cat("Some loci couldn't classify\n")

## Multiplicity
smart_table(Truth = input$true_M,
  Inferred = ith_out$infer$infer_M)

Unknown configuration

We may or may not have achieved the global optimal solution and underlying truth in the above section. Luckily SMASH provides a hands-off thorough grid search without any prior knowledge of the number of subclones, subclone configuration, subclone proportions, proportion of unclassified point mutations, etc. in the following code.

grid_ith = grid_ITH_optim(
  my_data = input,
  my_purity = sim$purity,
  list_eS = eS,
  trials = 50,
  max_iter = 4e3)
names(grid_ith)

# Grid of solutions
grid_ith$GRID

# Order solution(s) based on BIC (larger BIC correspond to better fits)
gg = grid_ith$GRID
gg = gg[order(-gg$BIC),]
head(gg)

# true cancer proportions
sim$q

# true entropy
-sum(sim$q * log(sim$q))

From the grid of solutions (grid_ith$GRID), we can use AIC or BIC to score the model fit per solution. Sometimes the same configuration matrix can lead to multiple solutions. The column definitions are provided.

We can calculate a posterior probability to compare and visualize model fits with the following formula:

$$ p_s = \frac{\exp(0.5 IC_s)}{\sum_{t=1}^T exp(0.5 IC_t)}, $$

where $IC_s$ corresponds to the $s$-th model's information criterion. The function logSumExp from package smartr will aid us here.

pp = vis_GRID(GRID = grid_ith$GRID)
print(pp)

The above figure provides a landscape of which solutions appear more favorable to others.

Downstream Applications

Multiple metrics from SMASH output can be extracted. One could obtain ...

gg = grid_ith$GRID
idx = which(gg$BIC == max(gg$BIC))
gg[idx,]

gg$cc[idx][1]
opt_entropy = gg$entropy[idx]
names(opt_entropy) = sprintf("Solution %s",idx)
opt_entropy
props = list()
for(jj in seq(length(idx))){
  opt_prop = gg$q[idx[jj]]
  opt_prop = as.numeric(strsplit(opt_prop,",")[[1]])
  names(opt_prop) = sprintf("SC%s",seq(length(opt_prop)))
  props[[sprintf("Solution %s",idx[jj])]] = opt_prop
}
props
for(jj in seq(length(idx))){
  cat(sprintf("Solution %s\n",idx[jj]))
  print(head(grid_ith$INFER[[idx[jj]]]))
}
opt_cps = list()
for(jj in seq(length(idx))){
  # jj = 1
  tab = table(smart_digits(grid_ith$INFER[[idx[jj]]]$infer_CP,4),
    grid_ith$INFER[[idx[jj]]]$infer_A)
  rownames(tab) = sprintf("CP = %s",rownames(tab))
  colnames(tab) = sprintf("ALLOC = %s",colnames(tab))
  opt_cps[[sprintf("Solution %s",idx[jj])]] = tab
}
opt_cps
opt = list()
for(jj in seq(length(idx))){
  # jj = 1
  opt_cc = gg$cc[idx[jj]]
  opt_kk = gg$kk[idx[jj]]
  mat = eS[[opt_cc]][[opt_kk]]
  dimnames(mat) = list(sprintf("ALLOC = %s",seq(opt_cc)),sprintf("SC%s",seq(opt_cc)))
  opt[[sprintf("Solution %s",idx[jj])]] = mat
}
opt

Session Info

sessionInfo()

References



Sun-lab/SMASH documentation built on March 5, 2025, 8 p.m.