knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(mSigHdp)
library(ICAMS)
library(cosmicsig)

mSigHdp uses hierarchical Dirichlet process mixture modeling to discover mutational signatures. For real data this is very compute intensive, so here we will show only a toy example with tumors and the typical classification of single-base-substitution (SBS) mutations in 96 types. We generate the synthetic spectra based on two mutational signatures, SBS1 and SBS22. We get the signatures from CRAN package cosmicsig, which provides signatures from https://cancer.sanger.ac.uk/signatures/. We set a specific seed so that the synthetic spectra are the same every time we run the vignette.

sigs <- cosmicsig::COSMIC_v3.2$signature$GRCh37$SBS96[  , c("SBS1", "SBS22")]
set.seed(2022)

We generate 10 random exposures and 2 highly-enriched exposures for each signature (so we can discover them quickly), and make them into an exposure matrix.

n.tumor <- 10
exposures1 <- c(rnbinom(n.tumor, mu = 2000, size = 5), 10000, 100)
exposures2 <- c(rnbinom(n.tumor, mu = 10000, size = 1000), 100, 5000)
exposures <- rbind(exposures1, exposures2)

Then we generate spectra by multiplying the signatures times the exposures and rounding, because we need integer mutation counts. We convert the matrix into an ICAMS catalog so we can use custom plotting later.

toy_data <- round(sigs %*% exposures)
colnames(toy_data) <- paste("T", 1:(n.tumor + 2), sep="")
toy_data <- ICAMS::as.catalog(toy_data, catalog.type = "counts")

Here are the first few and last few rows of the 12 spectra. Each column is one of the 12 synthetic tumors. Each row is a mutation type, for example in in row 5 the row name CCAA indicates a mutation from CCA to CAA.

knitr::kable(toy_data[1:5,  ])
knitr::kable(toy_data[91:96, ])

Most people will want to use the RunHdxParallel function. Most of the parameters in the example here are not suitable for real data. Please see our paper to be posted on bioRxiv soon for suggestions for actual values to use.

results <- mSigHdp::RunHdpxParallel(
  input.catalog        = toy_data,
  out.dir              = "vignette_output",
  num.child.process    = 4, 
  CPU.cores            = 2,
  seedNumber           = 123,
  K.guess              = 5,
  burnin               = 1000,
  burnin.multiplier    = 2,
  post.n               = 5, 
  post.space           = 10, 
  multi.types          = FALSE, 
  overwrite            = TRUE,
  gamma.alpha          = 1,
  gamma.beta           = 20, 
  high.confidence.prop = 0.9,
  checkpoint           = TRUE,
  verbose              = FALSE) 

The results are in both the variable results and the directory vignette_output, which also contains plots of the output and diagnostic plots. It is convenient to automatically save the results to files since the program will likely have been running for days, and you will likely want to save the outputs for later examination in any case. The return value of RunHdpxParallel will be included in the directory in binary form as hdp.retval.Rdata.

The directory vignette_output contains a plot of the extracted (discovered) signatures in file extracted.signatures.pdf.

knitr::include_graphics("vignette_output/extracted.signatures.pdf")

The signatures in numerical form are in file extracted.signatures.csv.

extracted.sigs <- ICAMS::ReadCatalog("vignette_output/extracted.signatures.csv")
knitr::kable(extracted.sigs[1:10, ])


steverozen/mSigHdp documentation built on Feb. 6, 2023, 1:36 a.m.