knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(bascule)
knitr::opts_chunk$set(warning = FALSE, message = FALSE) reticulate::install_python(version="3.9.16") reticulate::py_install(packages="pybascule", pip=TRUE, python_version="3.9.16") py = reticulate::import("pybascule")
We can load the data example_dataset
, a bascule object containing the true signatures and exposures used to generate the mutation counts matrix. In the object, data for SBS and DBS is reported.
data("synthetic_data")
We can extract the mutation count matrix from the object using the get_input
function. With reconstructed=FALSE
we are obtaining the observed counts, and not the reconstructed ones computed as the matrix multiplication of exposures and signatures.
counts = synthetic_data$counts head(counts[["SBS"]][, 1:10]) head(counts[["DBS"]][, 1:10])
We use as reference the COSMIC catalogue for SBS and DBS.
reference_cat = list("SBS"=COSMIC_sbs_filt, "DBS"=COSMIC_dbs)
In the example dataset, the true number of signatures (reference plus de novo) is 5 for both SBS and DBS, thus we can provide as list of K de novo signatures to test values from 0 to 7.
K_list = 0:7
Now, we can fit the model. Let's first fit the NMF to perform signatures deconvolution.
x = fit(counts=counts, k_list=K_list, n_steps=3000, reference_cat=reference_cat, keep_sigs=c("SBS1","SBS5"), # force fixed signatures store_fits=TRUE, py=py)
x = synthetic_data$x
You can inspect the model selection procedure. The plot shows for each tested value of K (i.e., number of de novo signatures), the value of the BIC and likelihood of the respective model. In our implementation, the model with lowest BIC is considered as the best model.
plot_scores(x)
Another variable of interest is the evolution over the iterations of the norms of the gradients for each inferred parameter. A good result, as shown below, is when the norms decreases with inference, reporting an increased stability.
plot_gradient_norms(x)
You can visualize the inferred signatures. Here, we notice that from r nrow(COSMIC_sbs_filt)
and r nrow(COSMIC_dbs)
SBS and DBS signatures, the model only selects r length(get_fixed_signames(x)$SBS)
and r length(get_fixed_signames(x)$DBS)
signatures from the catalogue. Moreover, it also infers a denovo signature from the SBS counts.
plot_signatures(x)
We can notice from the signatures plots a clear similarity between signature SBSD1 and SBS31. We can compute a linear combination on de novo signatures to remove those similar to reference ones. On the refined set of signatures we can run the model to perform clustering of samples based on exposures.
x_refined = refine_denovo_signatures(x) x_refined_cluster = fit_clustering(x_refined, cluster=3)
x_refined_cluster = synthetic_data$x_refined_cluster
The post-fit de novo signatures refinement discarded signature SBSD1, since it showed high similarity with reference signature SBS31.
plot_signatures(x_refined_cluster)
We can visualise the exposures for each patient divided by final group assignment. In this case, BASCULE retrieved 3 groups. Each cluster is characterised by a set of SBS and DBS. For instance, group G0 is the largest one and is characterised by signatures DBS1, DBS11 and DBS3, and SBS1, SBS5 and SBS31.
plot_exposures(x_refined_cluster)
We can also inspect the inferred clustering centroids, reporting for each cluster an average of the group-specific signatures exposures.
plot_centroids(x_refined_cluster)
We can finally visualise the posterior probabilities for each sample's assignment. Each row (samples) of this heatmap sums to 1 and reports the posterior probabilities for each sample to be assigned to each cluster (columns).
plot_posterior_probs(x_refined_cluster)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.