SynSigGen: SynSigGen

SynSigGenR Documentation

SynSigGen

Description

Create catalogs of synthetic mutational spectra for assessing the performance of mutational-signature analysis programs.

Overview

The main focus is generating synthetic catalogs of mutational spectra (mutations in tumors) based on known mutational signature profiles and software-inferred exposures (software's estimate on number of mutations induced by mutational signatures in tumors) in the PCAWG7 data. We call this kind of synthetic data broadly "reality-based" synthetic data. The package also has a set of functions that generate random mutational signature profiles and then create synthetic mutational spectra based on these random signature profiles. We call this kind of synthetic data "random" synthetic data, while pointing out that much depends on the distributions from which the random signature profiles and attributions are generated.

Workflow for generating "reality-based" synthetic mutational spectra

Typical workflow for generating synthetic mutational spectra is as follows.

  1. Input (based on SignatureAnalyzer or SigProfiler analysis of PCAWG tumors) E, matrix of software-inferred exposures of mutational signatures (signatures x samples) S, mutational signature profiles (mutation types x signatures)

  2. Obtain distribution parameters from software-inferred exposures

      P <- GetSynSigParamsFromExposures(E, ...)
    
  3. Generate exposures for synthetic mutational spectra based on P

      synthetic.exposures <- GenerateSyntheticExposures(P, ...)
    
  4. Generate synthetic mutational spectra by multiplying S and synthetic.exposures, and round the product to the nearest unit:

      synthetic.spectra <- CreateAndWriteCatalog(S, synthetic.exposures, ...)
    

Workflow for generating "random" synthetic mutational spectra

The top-level function for generating "random" synthetic mutational spectra is CreateRandomSyn. It adopts the following steps to generate catalogs of "random" synthetic mutational spectra.

  1. Create random mutational signature profiles:

      S <- CreateRandomMutSigProfiles(...)
    
  2. Generate distribution parameters for exposures of random signatures:

      P <- CreateMeanAndStdevForSigs(sig.names = colnames(S),...)
    
  3. Create exposures for mutational signatures based on P and other parameters:

      synthetic.exposures <- CreateRandomExposures(sigs = S, per.sig.mean.and.sd = P)
    
  4. Generate synthetic mutational spectra by multiplying S and synthetic.exposures and round the product to the nearest unit:

      synthetic.spectra <- NewCreateAndWriteCatalog(S, synthetic.exposures, ...)
    

Function for generating "SBS1-SBS5-correlated" synthetic mutational spectra

CreateSBS1SBS5CorrelatedSyntheticData is the top-level function for generating 20 data sets which only have 2 active signatures (SBS1 and SBS5) with positively-correlated exposures.

This function is used for generating synthetic mutational spectra used in paper "Performance of Mutational Signature Software on Correlated Signatures".

Functions for generating synthetic tumor spectra used in paper The repertoire of mutational signatures in human cancer

The repertoire of mutational signatures in human cancer (https://doi.org/10.1038/s41586-020-1943-3) involves evaluation of performances on two computational approaches (SigProfiler and SignatureAnalyzer) on 11 synthetic data sets (Synapse ID: syn18497223).

  1. Function PancAdenoCA1000 creates 1000 pancreatic adenocarcinoma spectra data set (syn18500212).

  2. Script

    creates 2,700 synthetic spectra (syn18500213). This data set consists of 9 cancer types each with 300 synthetic tumors:

    • bladder transitional cell carcinoma,

    • oesophageal adenocarcinoma,

    • breast adenocarcinoma,

    • lung squamous cell carcinoma,

    • renal cell carcinoma,

    • ovarian adenocarcinoma,

    • osteosarcoma,

    • cervical adenocarcinoma and

    • stomach adenocarcinoma.

  3. Function RCCOvary1000 creates spectra dataset consists of 500 synthetic kidney (RCC) with high prevalence and mutation load from SBS5 and SBS40 signatures, and 500 synthetic ovarian adenocarcinoma with high prevalence and mutation load from SBS3.

    Notes:

    • Mutation loads from other mutational signatures (besides SBS3, SBS5, SBS30) also exist in the spectra dataset created by function RCCOvary1000;

    • SBS3, SBS5, SBS40 are flat signatures. This dataset challenges the computational approaches on accurately separating these 3 mutational signatures, as mixing SBS5 and SBS40 can get a mutational signature similar to SBS3.

  4. Function Create.3.5.40.Abstract creates 1000 synthetic spectra all constructed entirely from SBS3, SBS5, and SBS40, using mutational loads modelled on kidney-RCC (SBS5 and SBS40) and ovarian adenocarcinoma (SBS3). Most synthetic spectra have contributions from all three signatures.


steverozen/SynSigGen documentation built on April 1, 2022, 8:54 p.m.