knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(pictograph)

1. Introduction

This tutorial walks through how to run PICTograph on a toy example. PICTograph infers the clonal evolution of tumors from multi-region sequencing data. It models uncertainty in assigning mutations to subclones using a Bayesian hierarchical model and reduces the space of possible evolutionary trees by using constraints based on principles of lineage precedence, sum condition, and optionally by sample-presence. The inputs to PICTograph are variant ("alt allele") read counts of small somatic mutations, sequencing depth at the mutation loci, DNA copy number of the tumor genome at the mutation loci, and, if known, the tumor purity and the number of major alleles. If major allele is not entered, PICTograph will estimate it. If multiple tumor samples are considered, an option is available to restrict the number of possible evolutionary trees by partitioning mutations according to sample presence. PICTograph summarizes the posterior distributions of the mutation cluster assignments and the mutation cell fractions (MCF) for each cluster by the mode. The estimates of cluster MCFs are then used to determine the most probable trees. Multiple trees that share the same score can be summarized as an ensemble tree, where edges are weighted by their concordance among constituent trees in the ensemble.

2. Input data

The required user input is a csv file that contains at least columns named "sample", "mutation", "total_reads", "alt_reads", "tumor_integer_copy_number", and "cncf".

head(read.csv(system.file('extdata/example1_snv.csv', package = 'pictograph')))

Users can also provide an optional column "major_integer_copy_number" that provides the information of the integer copy number of the major allele. If "major_integer_copy_number" is not provided, it will be estimated using an internal function built in the package.

head(read.csv(system.file('extdata/example2_snv.csv', package = 'pictograph')))

Another optional column is "purity" column that provides the information of normal contamination of a sample. Purity of 0.8 wil be used if not provided.

head(read.csv(system.file('extdata/example2_snv_with_purity.csv', package = 'pictograph')))

3. Run PICTograph in one function

Run mcmcMain to generate all the data in one function. This will store all the output files in the user-provided output directory.

mcmcMain(mutation_file=system.file('extdata/example1_snv.csv', package = 'pictograph'),
         outputDir=system.file('extdata/output', package = 'pictograph'),
         sample_presence=TRUE,
         score="silhouette", # either BIC or silhouette
         max_K = 10, 
         min_mutation_per_cluster=5, 
         cluster_diff_thresh=0.05,
         n.iter=5000, 
         n.burn=1000, 
         thin=10, 
         mc.cores=8, 
         inits=list(".RNG.name" = "base::Wichmann-Hill",".RNG.seed" = 123))

4. Step-through of the tool in a chain

If the user want to run each individual function in the tool, the following steps can be used.

4.1. Import data

The input file can be read using the importCSV function. Alt read count (y), total read count (n), tumor copy number (tcn), and multiplicity (m) are stored in matrices where the columns are samples, and rows are variants. Purity is supplied as a vector. I and S are integers representing the number of variants and number of samples, respectively.

data <- importFiles(mutation_file=system.file('extdata/example1_snv.csv', package = 'pictograph'), 
                    outputDir=system.file('extdata/output', package = 'pictograph'))
input_data <- list(y=data$y,
                     n=data$n,
                     tcn=data$tcn,
                     is_cn=data$is_cn,
                     mtp=data$mtp,
                     icn=data$icn,
                     cncf=data$cncf,
                     MutID=data$MutID,
                     purity=data$purity)

4.2. Clustering mutations and estimating MCFs

The first step of evolutionary analysis is clustering mutations and estimating their mutation cell fractions (MCFs). This is comprised of three steps:

a. By default, for individuals with multiple tumor samples, the mutation data is initially split into sets by sample-presence. Next, PICTograph estimates the joint posterior of MCFs and cluster assignments by Markov chain Monte Carlo (MCMC) across a range of possible values for the number of clusters, $K$ b. Selecting the best $K$ for each mutation set c. Merging the best chains of each mutation set

4.2a. Run clustering and MCF estimation separately for each mutation set

In toy example, we run a short MCMC chain with only 5000 iterations (n.iter), burn-in of 1000 (n.burn), and 10 thinning (thin). In practice, we recommend running the MCMC for longer e.g. 10,000 iterations, burn-in of 1000, and thinning by 10. By default, PICTograph separates mutations into sets by sample-presence patterns, and clusters mutations separately within each set. The maximum number of clusters can be set with the max_K option. To run the MCMC chains in parallel, set the mc.cores option to the number of desired cores.

# using sample presence
sep_list <- separateMutationsBySamplePresence(input_data)
all_set_results <- runMCMCForAllBoxes(sep_list, max_K = 5, min_mutation_per_cluster = 10, n.iter=5000, n.burn=1000, thin=10,
                                      cluster_diff_thresh = 0.1, mc.cores = 8)

# if not use sample presence, run the following
# all_set_results <- runMCMCForAllBoxes(input_data, sample_presence = FALSE, max_K = 5, n.iter=5000, n.burn=1000, thin=10,
#                                       min_mutation_per_cluster = 10, 
#                                       cluster_diff_thresh = 0.1, mc.cores = 8)

This gives us a list with results for each mutation set, which contains all_chains, BIC, best_chains, silhouette, and best_K.

all_chains is a list of MCMC chains for each value of $K$ tested. For each $K$, there are chains for cluster MCF (mcf_chain), mutation cluster assignments (z_chain), and simulated variant read counts (ystar_chain) for posterior predictive distributions.

As default, PICTograph chooses the $K$ with the highest silhouette coefficient. The MCMC chains for this chosen $K$ are under best_chains and the $K$ chosen is listed under best_K.

4.2b. Select the best number of clusters, $K$, for each mutation set

Use writeSetKTable to generate a table with the K at minimum, elbow, and knee points of the BIC plot for each mutation set.

set_k_choices <- writeSetKTable(all_set_results)
set_k_choices

We can extract the MCMC chains for the best $K$ of each mutation set using collectBestKChains. As default, PICTograph chooses the $K$ with the highest silhouette coefficient, and these chains will be automatically extracted. Users also have the option to specify the $K$ to choose for each mutation set by supplying a vector of integers to the parameter chosen_K in the same order as the listed sets in all_set_results.

best_set_chains <- collectBestKChains(all_set_results, chosen_K = set_k_choices$silhouette_K)
str(best_set_chains, give.attr = F, max.level = 2)

4.2c. Merge results for all mutation sets

Finally, we can merge best_set_chains to obtain chains with the final mutation cluster numbering and correct mutation indices (original order provided in input data).

chains <- mergeSetChains(best_set_chains, input_data)

4.2d Visualizing clustering and MCF estimation results

Traces for MCF chains can be visualized to check for convergence.

plotChainsMCF(chains$mcf_chain)

The posterior distribtuion of cluster MCFs can be visualized as violin plots. The number of mutations assigned to each cluster is listed in brackets after the cluster name.

plotMCFViolin(chains$mcf_chain, chains$z_chain, indata = input_data)

We can also visualize the posterior probabilities of mutation cluster assignments and determine the most probable cluster assignments. In this toy example, there is high concordance of the cluster assignments of mutations across the MCMC chain.

plotClusterAssignmentProbVertical(chains$z_chain, chains$mcf_chain)

We can write tables for estimated cluster MCFs and mutation cluster assignments.

writeClusterMCFsTable(chains$mcf_chain)
writeClusterAssignmentsTable(chains$z_chain, Mut_ID = input_data$MutID)

4.3. Tree inference

We can then use the mutation cluster MCF estimates for tree inference. We first estimate the possible edges by applying lineage precedence filters to order the clusters on a tree. The lineage_precedence_thresh option allows relaxation of lineage precedence constraints so that the child cluster MCF can exceed a parent cluster MCF by a small amounts. The default of lineage_precedence_thresh is 0.1. Filtered edges are stored in an object named graph_G.

Next, we enumerate this constrained tree space, and apply a filter based on the sum condition. The sum_filter_thresh option allows relaxation of sum condition constraints so that the sum of child cluster MCFs at a branch point can exceed the parent cluster MCF by a small amount. The default of sum_filter_thresh is 0.2.

generateAllTrees(chains$mcf_chain, data$purity, lineage_precedence_thresh = 0.1, sum_filter_thresh = 0.2)

All spanning trees given by the possible edges that pass the sum condition filter are stored in all_spanning_trees.

length(all_spanning_trees)
all_spanning_trees

We then calculate a fitness score for all the trees that have passed our filtering and identify the highest scoring tree.

# calculate SCHISM fitness score for all trees
scores <- calcTreeScores(chains$mcf_chain, all_spanning_trees, purity=data$purity)
scores
# highest scoring tree
best_tree <- all_spanning_trees[[which.max(scores)]]
# plot tree
plotTree(best_tree)

In this toy example, there is only one tree with the maximum score. In some cases, multiple trees will share the maximum score. We can plot an ensemble tree to visualize the evolutionary relationships (edges) that are shared among multiple trees. In the ensemble tree, edges are weighted by the number of trees in which they are represented. To illustrate this plotting function, here we plot an ensemble of the two trees that were enumerated. The solid black edges represent those supported in both trees.

plotEnsembleTree(all_spanning_trees)

4.4 Subclone proportions

For individuals with multiple available tumor samples, the proportion of each subclone in each sample can be calculated using calcSubcloneProportions. This is calculated using the estimated cluster MCFs and the ordering of clusters on the best scoring tree. Two available options for visualizing subclone proportions are pie charts using plotSubclonePie and stacked bar graphs using plotSubcloneBar.

subclone_props <- calcSubcloneProportions(mcf_mat, best_tree)
plotSubclonePie(subclone_props, sample_names = colnames(input_data$y))
plotSubcloneBar(subclone_props, sample_names = colnames(input_data$y))


KarchinLab/pictograph documentation built on Oct. 25, 2024, 10:10 a.m.