knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(pictograph)
This tutorial walks through how to run PICTograph on a toy example. PICTograph infers the clonal evolution of tumors from multi-region sequencing data. It models uncertainty in assigning mutations to subclones using a Bayesian hierarchical model and reduces the space of possible evolutionary trees by using constraints based on principles of lineage precedence, sum condition, and optionally by sample-presence. The inputs to PICTograph are variant ("alt allele") read counts of small somatic mutations, sequencing depth at the mutation loci, DNA copy number of the tumor genome at the mutation loci, and, if known, the tumor purity and the number of major alleles. If major allele is not entered, PICTograph will estimate it. If multiple tumor samples are considered, an option is available to restrict the number of possible evolutionary trees by partitioning mutations according to sample presence. PICTograph summarizes the posterior distributions of the mutation cluster assignments and the mutation cell fractions (MCF) for each cluster by the mode. The estimates of cluster MCFs are then used to determine the most probable trees. Multiple trees that share the same score can be summarized as an ensemble tree, where edges are weighted by their concordance among constituent trees in the ensemble.
The required user input is a csv file that contains at least columns named "sample", "mutation", "total_reads", "alt_reads", "tumor_integer_copy_number", and "cncf".
head(read.csv(system.file('extdata/example1_snv.csv', package = 'pictograph')))
Users can also provide an optional column "major_integer_copy_number" that provides the information of the integer copy number of the major allele. If "major_integer_copy_number" is not provided, it will be estimated using an internal function built in the package.
head(read.csv(system.file('extdata/example2_snv.csv', package = 'pictograph')))
Another optional column is "purity" column that provides the information of normal contamination of a sample. Purity of 0.8 wil be used if not provided.
head(read.csv(system.file('extdata/example2_snv_with_purity.csv', package = 'pictograph')))
Run mcmcMain to generate all the data in one function. This will store all the output files in the user-provided output directory.
mcmcMain(mutation_file=system.file('extdata/example1_snv.csv', package = 'pictograph'), outputDir=system.file('extdata/output', package = 'pictograph'), sample_presence=TRUE, score="silhouette", # either BIC or silhouette max_K = 10, min_mutation_per_cluster=5, cluster_diff_thresh=0.05, n.iter=5000, n.burn=1000, thin=10, mc.cores=8, inits=list(".RNG.name" = "base::Wichmann-Hill",".RNG.seed" = 123))
If the user want to run each individual function in the tool, the following steps can be used.
The input file can be read using the importCSV function. Alt read count (y), total read count (n), tumor copy number (tcn), and multiplicity (m) are stored in matrices where the columns are samples, and rows are variants. Purity is supplied as a vector. I and S are integers representing the number of variants and number of samples, respectively.
data <- importFiles(mutation_file=system.file('extdata/example1_snv.csv', package = 'pictograph'), outputDir=system.file('extdata/output', package = 'pictograph'))
input_data <- list(y=data$y, n=data$n, tcn=data$tcn, is_cn=data$is_cn, mtp=data$mtp, icn=data$icn, cncf=data$cncf, MutID=data$MutID, purity=data$purity)
The first step of evolutionary analysis is clustering mutations and estimating their mutation cell fractions (MCFs). This is comprised of three steps:
a. By default, for individuals with multiple tumor samples, the mutation data is initially split into sets by sample-presence. Next, PICTograph estimates the joint posterior of MCFs and cluster assignments by Markov chain Monte Carlo (MCMC) across a range of possible values for the number of clusters, $K$ b. Selecting the best $K$ for each mutation set c. Merging the best chains of each mutation set
In toy example, we run a short MCMC chain with only 5000 iterations (n.iter
), burn-in of 1000 (n.burn
), and 10 thinning (thin
). In practice, we recommend running the MCMC for longer e.g. 10,000 iterations, burn-in of 1000, and thinning by 10. By default, PICTograph separates mutations into sets by sample-presence patterns, and clusters mutations separately within each set. The maximum number of clusters can be set with the max_K
option. To run the MCMC chains in parallel, set the mc.cores
option to the number of desired cores.
# using sample presence sep_list <- separateMutationsBySamplePresence(input_data)
all_set_results <- runMCMCForAllBoxes(sep_list, max_K = 5, min_mutation_per_cluster = 10, n.iter=5000, n.burn=1000, thin=10, cluster_diff_thresh = 0.1, mc.cores = 8) # if not use sample presence, run the following # all_set_results <- runMCMCForAllBoxes(input_data, sample_presence = FALSE, max_K = 5, n.iter=5000, n.burn=1000, thin=10, # min_mutation_per_cluster = 10, # cluster_diff_thresh = 0.1, mc.cores = 8)
This gives us a list with results for each mutation set, which contains all_chains
, BIC
, best_chains
, silhouette
, and best_K
.
all_chains
is a list of MCMC chains for each value of $K$ tested. For each $K$, there are chains for cluster MCF (mcf_chain
), mutation cluster assignments (z_chain
), and simulated variant read counts (ystar_chain
) for posterior predictive distributions.
As default, PICTograph chooses the $K$ with the highest silhouette coefficient. The MCMC chains for this chosen $K$ are under best_chains
and the $K$ chosen is listed under best_K
.
Use writeSetKTable
to generate a table with the K at minimum, elbow, and knee points of the BIC plot for each mutation set.
set_k_choices <- writeSetKTable(all_set_results) set_k_choices
We can extract the MCMC chains for the best $K$ of each mutation set using collectBestKChains
. As default, PICTograph chooses the $K$ with the highest silhouette coefficient, and these chains will be automatically extracted. Users also have the option to specify the $K$ to choose for each mutation set by supplying a vector of integers to the parameter chosen_K
in the same order as the listed sets in all_set_results
.
best_set_chains <- collectBestKChains(all_set_results, chosen_K = set_k_choices$silhouette_K) str(best_set_chains, give.attr = F, max.level = 2)
Finally, we can merge best_set_chains
to obtain chains with the final mutation cluster numbering and correct mutation indices (original order provided in input data).
chains <- mergeSetChains(best_set_chains, input_data)
Traces for MCF chains can be visualized to check for convergence.
plotChainsMCF(chains$mcf_chain)
The posterior distribtuion of cluster MCFs can be visualized as violin plots. The number of mutations assigned to each cluster is listed in brackets after the cluster name.
plotMCFViolin(chains$mcf_chain, chains$z_chain, indata = input_data)
We can also visualize the posterior probabilities of mutation cluster assignments and determine the most probable cluster assignments. In this toy example, there is high concordance of the cluster assignments of mutations across the MCMC chain.
plotClusterAssignmentProbVertical(chains$z_chain, chains$mcf_chain)
We can write tables for estimated cluster MCFs and mutation cluster assignments.
writeClusterMCFsTable(chains$mcf_chain) writeClusterAssignmentsTable(chains$z_chain, Mut_ID = input_data$MutID)
We can then use the mutation cluster MCF estimates for tree inference. We first estimate the possible edges by applying lineage precedence filters to order the clusters on a tree. The lineage_precedence_thresh
option allows relaxation of lineage precedence constraints so that the child cluster MCF can exceed a parent cluster MCF by a small amounts. The default of lineage_precedence_thresh
is 0.1. Filtered edges are stored in an object named graph_G
.
Next, we enumerate this constrained tree space, and apply a filter based on the sum condition. The sum_filter_thresh
option allows relaxation of sum condition constraints so that the sum of child cluster MCFs at a branch point can exceed the parent cluster MCF by a small amount. The default of sum_filter_thresh
is 0.2.
generateAllTrees(chains$mcf_chain, data$purity, lineage_precedence_thresh = 0.1, sum_filter_thresh = 0.2)
All spanning trees given by the possible edges that pass the sum condition filter are stored in all_spanning_trees
.
length(all_spanning_trees)
all_spanning_trees
We then calculate a fitness score for all the trees that have passed our filtering and identify the highest scoring tree.
# calculate SCHISM fitness score for all trees scores <- calcTreeScores(chains$mcf_chain, all_spanning_trees, purity=data$purity) scores # highest scoring tree best_tree <- all_spanning_trees[[which.max(scores)]] # plot tree plotTree(best_tree)
In this toy example, there is only one tree with the maximum score. In some cases, multiple trees will share the maximum score. We can plot an ensemble tree to visualize the evolutionary relationships (edges) that are shared among multiple trees. In the ensemble tree, edges are weighted by the number of trees in which they are represented. To illustrate this plotting function, here we plot an ensemble of the two trees that were enumerated. The solid black edges represent those supported in both trees.
plotEnsembleTree(all_spanning_trees)
For individuals with multiple available tumor samples, the proportion of each subclone in each sample can be calculated using calcSubcloneProportions
. This is calculated using the estimated cluster MCFs and the ordering of clusters on the best scoring tree. Two available options for visualizing subclone proportions are pie charts using plotSubclonePie
and stacked bar graphs using plotSubcloneBar
.
subclone_props <- calcSubcloneProportions(mcf_mat, best_tree) plotSubclonePie(subclone_props, sample_names = colnames(input_data$y)) plotSubcloneBar(subclone_props, sample_names = colnames(input_data$y))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.