In kgori/sigfit: Flexible Bayesian inference of mutational signatures

knitr::opts_chunk$set(echo = TRUE)
suppressPackageStartupMessages(library(sigfit))
par(mar = c(6, 4, 6, 4))

sigfit is used to estimate signatures of mutational processes and their degree of activity in a collection of cancer (or normal) samples. Starting from a set of single-nucleotide variants (SNVs), it allows both estimation of the exposure of samples to predefined mutational signatures (including whether the signatures are present at all), and identification of signatures de novo from the mutation counts. These two procedures are often called signature fitting and signature extraction, respectively. In addition, sigfit implements a novel methodology that combines signature fitting and extraction in a single inferential process. Moreover, the signature analysis methods in sigfit can be seamlessly applied to mutational profiles beyond SNV data, including insertion/deletion (indel) or rearrangement count data, or even to other types of biological data, such as DNA methylation profiles. The package also provides a range of functions to generate publication-quality graphics of the corresponding mutational catalogues, signatures and exposures.

Vignette contents

Installation

Example 1: Fitting mutational signatures to a single simulated sample

Fitting mutational signatures
Retrieving signature exposures
Visualisation

Example 2: Extracting signatures from multiple breast cancer samples

Building mutational catalogues
Extracting mutational signatures de novo
Retrieving, plotting and re-fitting extracted signatures

Other functionalities

Alternative signature models
Alternative mutation type definitions
Using mutational opportunities and signature/exposure priors
Converting between genome- and exome-relative representations of signatures
Using Fit-Ext models to discover rare or novel signatures
Signature extraction using multiple chains
Using the Dirichlet process prior to determine the number of signatures
List of available data sets

Installation

sigfit is an R package. As it is in early development it is not yet on CRAN, but can be installed from GitHub using the devtools package.

devtools::install_github("kgori/sigfit", build_opts = c("--no-resave-data", "--no-manual"))

Solutions to some of the problems that may arise during installation are listed in the README file for the package.

Example 1: Fitting mutational signatures to a single simulated sample

This example will use the mutational signatures from COSMIC to generate simulated mutation counts, and then use sigfit to fit signatures to these simulated data.

First of all, we need some mutational signatures to fit to our data. The line below loads the 30 mutational signatures published in COSMIC (v2).

library(sigfit)
data("cosmic_signatures_v2")

Let's use these signatures to simulate some mutation data. This code will generate 20,000 mutations from a 4:3:2:1 mixture of signatures 1, 3, 7 and 11.

set.seed(1)
probs <- c(0.4, 0.3, 0.2, 0.1) %*% as.matrix(cosmic_signatures_v2[c(1, 3, 7, 11), ])
mutations <- matrix(rmultinom(1, 20000, probs), nrow = 1)
colnames(mutations) <- colnames(cosmic_signatures_v2)

Here is what the mutational spectrum of our simulated counts looks like.

par(mar = c(4.5,5.5,6.5,1))
plot_spectrum(mutations, name = "Simulated counts")

Fitting mutational signatures

Next, we can use the fit_signatures function in sigfit to fit the COSMIC signatures, i.e. to estimate the exposure to each signature (pretending we ignore that it was generated from signatures 1, 3, 7 and 11). This function uses the Stan framework to produces Markov chain Monte Carlo (MCMC) samples from a Bayesian model of signatures.

Arguments to the rstan::sampling function, such as iter, warmup, chains, seed, etc., can be passed through the fit_signatures function.

The arguments warmup and iter specify the number of ‘warm-up’ (or ‘burn-in’) iterations and the total number of sampling iterations (including warm-up iterations), respectively. We recommend that the value of warmup be between one-third and half the value of iter. The behaviour of the MCMC sampler (which ultimately affects the quality of the analysis) depends on parameters set during the warm-up, so it is important to run enough warm-up iterations. By default, rstan::sampling uses iter = 2000 and warmup = floor(iter/2); going below these values is generally not advised.

In general, one should run as many MCMC iterations (iter) as one’s computer and patience allow, with time and memory being the major constraints.

The chains argument can be used to parallelise the sampling process by using multiple Markov chains, the number of which should never exceed the number of processors (the rstan::sampling function uses chains = 4 by default). To run multiple chains in parallel, the number of processors first needs to be set using options(mc.cores = parallel::detectCores()).
The seed argument can be used to make the MCMC samples reproducible over different runs.
For further sampling options, type ?rstan::sampling to read the documentation.

The fit_signatures function can be used to fit the COSMIC signatures to the simulated counts as follows.

mcmc_samples_fit <- fit_signatures(counts = mutations, 
                                   signatures = cosmic_signatures_v2,
                                   iter = 2000, 
                                   warmup = 1000, 
                                   chains = 1, 
                                   seed = 1756)

For details about further arguments of fit_signatures, including alternative signature models, mutational opportunities and exposure priors, see the Other functionalities section at the end of this vignette, or type ?fit_signatures to read the documentation.

Retrieving signature exposures

Once we have the result of the MCMC sampling in mcmc_samples_fit, we can retrieve the estimated exposures from it using the retrieve_pars function. This returns a named list with three matrices, one containing the mean exposures, and the others containing the values corresponding to the lower and upper limits of the highest posterior density (HPD) interval for each signature exposure in each sample (HPD intervals are the Bayesian alternative to classical confidence intervals). The prob argument can be used to indicate the target probability content of the HPD interval; by default, 95% HPD intervals are returned.

Since we are fitting known signatures instead of extracting them de novo, the exposures will be automatically labelled with the signature names. If the signatures have no names, or if they have been extracted instead of fitted, they will be labelled as ‘Signature A’, ‘Signature B’, etc.

exposures <- retrieve_pars(mcmc_samples_fit, 
                           par = "exposures", 
                           hpd_prob = 0.90)
names(exposures)
exposures$mean

In this case, because there is only one sample, the exposures are returned as a numeric vector. For cases with multiple samples, exposures$mean is a matrix with one row per sample and one column per signature.

The entire posterior distribution of the signature exposures and other model parameters in the mcmc_samples_fit object can be further explored by means of the functions provided by the rstan package. In addition, ShinyStan can be easily used in R for visual exploration of the MCMC samples.

Visualisation

sigfit provides several easy-to-use plotting functions. As seen in the previous section, the plot_spectrum function allows visualisation of both mutational catalogues and mutational signatures. These plots can be produced even for catalogues and signatures with arbitrary mutation types (e.g. indel or rearrangement signatures; see the Other functionalities section below).

Note that the plotting functions in sigfit are designed with a preference for plotting directly to a PDF file.

The path to an output PDF file can be specified using the pdf_path argument (however, we will avoid this option here, so that the plots are shown in the vignette). When plotting to a PDF, each function will automatically define appropriate graphical parameters, such as plot sizes and margins. However, some parameters can be customised via additional arguments; for instance, custom plot sizes can be specified using the pdf_width and pdf_height arguments. For more details, see the documentation for each function (e.g. ?plot_spectrum).

The plot_exposures function produces barplots of the estimated signature exposures across the sample set; because there is only one sample in this case, a single barplot will be produced. The function needs to be supplied with either the object resulting from MCMC sampling (mcmc_samples argument) or the exposures themselves (exposures argument), the latter being either a matrix, or a list like the one returned by the retrieve_pars function (see above). In the present case, since we have the stanfit object generated by fit_signatures, we will make use of the mcmc_samples argument.

par(mar=c(8,5,3.5,0))
plot_exposures(mcmc_samples = mcmc_samples_fit)

The bars in this plot are coloured blue if the estimated exposure value is ‘sufficiently non-zero’, and grey otherwise. It is difficult for the model to make hard assignments of which signatures are present or absent due to the non-negative constraint on the estimate, which means that the range of values in the sample will not normally include zero. In practice, ‘sufficiently non-zero’ means that the lower end of the Bayesian HPD interval (see the previous section) is above a threshold value close to zero (by default 0.01, and adjustable via the thresh argument). In this example, sigfit has identified the 4 signatures used to construct the sample.

Next, we would recommend re-running fit_signatures, this time to fit only those signatures (i.e. those rows of the signatures matrix) which have been highlighted as ‘sufficiently non-zero’ in the plot above, in order to obtain more-accurate estimates of exposures. This is done as follows.

selected_signatures <- c(1, 3, 7, 11)
mcmc_samples_fit_2 <- fit_signatures(mutations, 
                                     cosmic_signatures_v2[selected_signatures, ],
                                     iter = 2000, 
                                     warmup = 1000, 
                                     chains = 1, 
                                     seed = 1756)

We can also examine how effectively the estimated signatures and/or exposures can reconstruct the original mutational catalogues, by producing spectrum reconstruction plots with the plot_reconstruction function.

par(mar=c(5,6,6.5,1))
plot_reconstruction(mcmc_samples = mcmc_samples_fit)

Finally, the plot_all function provides a simpler way of simultaneously calling the plot_spectrum, plot_exposures and plot_reconstructions functions for a single set of results. This function plots exclusively to PDF files, and the out_path argument is used to indicate the path of the directory where the files should be produced (if the directory does not exist, it will be automatically created). The prefix argument applies to the output file names, and can be used to distinguish different sets of plots from each other. An example is shown below.

plot_all(mcmc_samples = mcmc_samples_fit, 
         out_path = "your/output/dir/here",
         prefix = "Fitting")

Example 2: Extracting signatures from multiple breast cancer samples

Building mutational catalogues

In this second example, we will use single-nucleotide variant (SNV) data from the set of 21 breast cancer samples originally analysed by Nik-Zainal et al. (2012). These data can be accessed using the data function as follows.

library(sigfit)
data("variants_21breast")
head(variants_21breast)

This variant table illustrates the structure of the SNV data that can be used as input for the package (unless you already have mutational catalogues for your samples). It is a matrix or data frame with one row per variant, and four (or five) columns:

Sample ID (character, e.g. "Sample 1").
Reference base (character: "A", "C", "G", or "T").
Mutated base (character: "A", "C", "G", or "T").
Trinucleotide context of the variant (character; reference sequence between the positions immediately before (-1) and after (+1) the variant, e.g. "TCA"). This can be obtained from the reference genome that was used to call the variants, using an R package like BSgenome or Biostrings; however, sequence context information is sometimes provided by variant callers within the ‘INFO’ field of the VCF file.
Optionally: for variants in protein-coding regions, if information is available about the transcriptional strand where each variant occurred, this can be incorporated as a fifth column taking character values "1" / "U" (for the untranscribed strand) or "-1" / "T" (for the transcribed strand). If the table is a data frame, integer values 1 and -1 are also admitted. If this fifth column is present in the table, all the estimation and plotting functions will automatically incorporate transcriptional strand information. The interpretation of 1 and -1 as the untranscribed and transcribed strands, respectively, allows direct use of the ‘Strand’ field that is normally provided by variant annotation tools, such as the VEP (Annotation tools use ‘Strand’ values of 1 and -1 to indicate that the corresponding gene is located on the forward or reverse strand, respectively. Since variants are always called on the forward strand, this value informs about the transcriptional strand of variants.)

Importantly, since a variant can only have a single sample ID, variants which are found in more than one sample need to be included multiple times in the table, using different sample IDs. The order in which the samples are found in this table is the order in which they will be displayed thereafter. In this case, the samples are already sorted alphabetically:

unique(variants_21breast[, 1])

The first step is to transform these variants into mutational catalogues, which is done by the build_catalogues function. (You can skip this step if you already have mutational catalogues for your samples.)

counts_21breast <- build_catalogues(variants_21breast)
dim(counts_21breast)
counts_21breast[1:5, 1:8]

The mutational catalogues are stored as a matrix of mutation counts, where each row refers to a sample and each column corresponds to a trinucleotide mutation type. (This particular set of 21 catalogues is also available via data("counts_21breast").)

Note that the build_catalogues function only admits mutational catalogues defined over the typical 96 trinucleotide mutation types (or 192 types if using transcriptional strand information). Mutational catalogues defined over other sets of mutation types need to be generated by the user.

We can plot the spectrum of each mutational catalogue using plot_spectrum, as in the previous example. For tables containing multiple catalogues, this function will produce one plot per catalogue, making the use of an output PDF file (pdf_path argument) more convenient. Here, however, we will plot all the catalogues on a grid.

par(mar = c(5,6,7,2))
par(mfrow = c(7, 3))
plot_spectrum(counts_21breast)

Extracting mutational signatures de novo

To extract signatures from this set of catalogues, we will use the extract_signatures function, specifying the number of signatures to extract via the nsignatures argument; this can be a single integer value or a range of integers (e.g. 3:6). The recommended approach is to first run the function for a small number of iterations and a reasonably wide range of numbers of signatures (e.g. nsignatures = 2:8). When nsignatures is a range of values, sigfit will automatically determine the most plausible number of signatures present in the data (which is done by assessing goodness-of-fit through the calculate_gof and plot_gof functions).

Note that, for the correct determination of the number of signatures, nsignatures should include at least four values, and should extend beyond the ‘reasonable’ range for the expected number of signatures (e.g. if the number of signatures is expected to be between 3 and 8, nsignatures could take values 2:10). Also, note that for ranges of nsignatures the output will be a list, in which element [[N]] corresponds to the extraction results for nsignatures = N.

The number of Markov chains in extract_signatures has a default value of chains = 1. We do not recommend the use of multiple chains for signature extraction, as this can result in a problem known as ‘label switching’ which can invalidate the inferences.

With the exception of chains, the extract_signatures function recognises the same MCMC sampling options as fit_signatures (iter, warmup, seed, etc.; see Example 1 above).

Extraction of signatures de novo from the 21 mutational catalogues is done for a range of 2–7 signatures as follows.

mcmc_samples_extr <- extract_signatures(counts = counts_21breast,
                                        nsignatures = 2:7,
                                        iter = 1000, 
                                        seed = 1756)
plot_gof(mcmc_samples_extr)

## Plot precalculated GOF in order to avoid running the model
data("sigfit_vignette_data", package = "sigfit")
plot(nS, gof, type = "o", lty = 3, pch = 16, col = "dodgerblue4",
     main = paste0("Goodness-of-fit (", stat, ")\nmodel: multinomial"),
     xlab = "Number of signatures", 
     ylab = paste0("Goodness-of-fit (", stat, ")"))
points(nS[best], gof[best], pch = 16, col = "orangered", cex = 1.1)
cat("Estimated best number of signatures:", nS[best], "\n")

The plot above shows that the most plausible number of signatures is four, based on the evolution of the goodness-of-fit (which by default is reconstruction accuracy measured via cosine similarity).

Next, it is recommended to re-run the extract_signatures function, this time with nsignatures = 4 and a much larger number of iterations, to obtain more-accurate estimates. We will skip this step in the present example.

For details about further arguments of extract_signatures, including alternative signature models, mutational opportunities and signature/exposure priors, see the Other functionalities section at the end of this vignette, or type ?extract_signatures to read the documentation.

Retrieving, plotting and re-fitting extracted signatures

As in the case of signature fitting (see above), the extracted signatures and exposures can be retrieved using the retrieve_pars function with par = "signatures" or par = "exposures". However, because the output is a list containing the results for each value in nsignatures, we need to specify which number of signatures we are interested in, using the [[ ]] operator. This is not necessary if a single value of nsignatures was used for extraction.

names(mcmc_samples_extr)

print(c("nsignatures=1", "nsignatures=2", "nsignatures=3", "nsignatures=4", "nsignatures=5", "nsignatures=6", "nsignatures=7", "best"))
signatures <- extr_signatures

## Note: mcmc_samples_extr[[N]] contains the extraction results for N signatures
signatures <- retrieve_pars(mcmc_samples_extr[[4]],
                            par = "signatures")

rownames(signatures$mean)

Plotting can be done through the functions seen in Example 1, with the difference that, when plotting directly from MCMC results (via the mcmc_samples argument), we need to select the relevant element of the results list using mcmc_samples_extr[[N]], as seen above. This is not necessary if a single value of nsignatures was used for extraction.

Below we plot the signatures extracted from these 21 catalogues.

par(mar = c(6,7,6,1))
par(mfrow = c(2, 2))
plot_spectrum(signatures)

The extracted signatures seem to be a combination of COSMIC signatures 1, 2, 3, 5 and 13. The signatures published in COSMIC were obtained using a collection of hundreds of catalogues across many cancer types, which offered much higher statistical power than the 21 breast cancer catalogues employed here. Furthermore, the signatures obtained by sigfit show high similarity to those originally reported by Nik-Zainal et al. (2012) (Fig. 2A). Note that signatures C and D in Nik-Zainal et al., which are very similar, have been identified by sigfit as a single signature (Signature B in the plot above).

The extracted signatures can be matched against a set of known signatures, such as the COSMIC signatures, to identify the best match for each signature, using the match_signatures function. This returns, for each signature in the first set, the index of the best match in the second signature set.

data("cosmic_signatures_v2")
match_signatures(signatures, cosmic_signatures_v2)

In this case, the closest matches of the extracted signatures A, B, C and D are, respectively, COSMIC signatures 2, 3, 13 and 1.

As a last remark, once that more-accurate signatures have been extracted by re-running extract_signatures for a single value of nsignatures (as recommended above), it might be useful to re-fit these signatures back to the original catalogues, as this sometimes results in exposure estimates that are more accurate than the ones obtained by signature extraction. Assuming that the MCMC samples from the definitive signature extraction are stored in an object named mcmc_samples_extr_2, re-fitting is done as follows.

signatures <- retrieve_pars(mcmc_samples_extr_2, "signatures")
mcmc_samples_refit <- fit_signatures(counts = counts_21breast,
                                     signatures = signatures,
                                     iter = 2000,
                                     warmup = 1000)
exposures <- retrieve_pars(mcmc_samples_refit, "exposures")

Other functionalities

Alternative signature models

By default, the functions fit_signatures and extract_signatures make use of a multinomial model of signatures, which is statistically equivalent to the non-negative matrix factorisation (NMF) approach pioneered by Alexandrov et al. (2013). However, sigfit provides three additional signature models which can be selected using the model argument.

The Poisson model (model = "poisson") is equivalent to the probabilistic model introduced by Fischer et al. (2013), and its utility is comparable to that of the multinomial model.
The negative binomial model (model = "negbin") is a more noise-robust version of the Poisson model, which makes use of the overdispersed negative binomial distribution to model additional variation in the observed mutation counts. Although this model provides increased precision for fitting signatures to sparse or noisy mutational catalogues, its use is not recommended for cleaner catalogues or for signature extraction, due to its low efficiency.
The normal model (model = "normal") makes use of a normal distribution in order to model catalogues that are composed of continuous numeric values, instead of discrete count values. This allows the analysis of patterns in catalogues describing continuous variables, such as DNA methylation levels. This model is intended for cases where it is not possible to round the values in the catalogues without losing information (e.g. if the values are real numbers in [0, 1]). If the values are large enough that the information loss caused by rounding would be negligible, then one of the other models should be specified, in which case the catalogues will be automatically rounded (with a warning message).

For further details on these signature models, see the sigfit paper or type ?extract_signatures to read the documentation.

Alternative mutation type definitions

sigfit allows the analysis of mutational patterns beyond the typical trinucleotide-context single-base substitution (SBS) spectrum. Additional supported mutation types include SBS with transcriptional strand information (strand-wise spectrum), small insertions and deletions (indel spectrum), doublet base substitutions (DBS spectrum) and arbitrary mutation types defined by the user (generic spectrum).

As explained in Example 2 (see above), the inclusion of transcriptional strand information for each variant results in transcriptional strand-wise representations of mutational catalogues and signatures. These also have their own graphical representation via strand-wise mutational spectra, as shown in the example below.

par(mar = c(4.5,5.5,6.5,1))
# Load and plot a strand-wise catalogue
data("counts_88liver_strand")
plot_spectrum(counts_88liver_strand[1, ],
              name = rownames(counts_88liver_strand)[1])

sigfit can also process and plot mutational catalogues and signatures for indels and doublet base substitutions, defined over the same sets of mutation categories as the COSMIC ID and DBS signatures (v3.2). However, note that the function build_catalogues is not currently capable of producing mutational catalogues for these mutation types. Catalogues of indels and DBS can be produced using the SigProfilerMatrixGeneratorR package (for human, mouse, rat and dog genomes). Indel catalogues can also be produced via the indel.spectrum function in the Indelwald tool (for any reference genome).

par(mar = c(4.5,5.5,6.5,1))
# Load and plot an indel signature
data("cosmic_signatures_indel_v3.2")
plot_spectrum(cosmic_signatures_indel_v3.2[4, ],
              name = rownames(cosmic_signatures_indel_v3.2)[4])

par(mar = c(4.5,5.5,6.5,1))
# Load and plot a doublet signature
data("cosmic_signatures_doublet_v3.2")
plot_spectrum(cosmic_signatures_doublet_v3.2[9, ],
              name = rownames(cosmic_signatures_doublet_v3.2)[9])

Furthermore, mutational catalogues and signatures can be defined using any set of arbitrary mutation categories. In this case, a ‘generic’ mutational spectrum is produced by sigfit, as shown in the example below. The colours of the spectrum bars can be customised using the colors argument in plot_spectrum.

par(mar = c(6.5,5.5,6.5,1)); set.seed(0xC0FFEE)
# Create and plot an arbitrary catalogue
counts <- as.numeric(rmultinom(1, 5000, runif(100)))
plot_spectrum(counts, name = "Arbitrary catalogue")

Using mutational opportunities and signature/exposure priors

The signature models in sigfit are able to account for variation in mutational opportunities, which are the opportunities for each mutation type to occur in each sample. For catalogues composed of 96 trinucleotide mutation types, these opportunities are given by the frequency of each trinucleotide in the target sequence (usually a genome or exome). Mutational opportunities can be specified using the opportunities argument in the functions fit_signatures and extract_signatures. Opportunities can be provided as a numeric matrix with one row per sample and one column per mutation type. Alternatively, the mutational opportunities of the reference human genome or exome (for SNV catalogues with 96 or 192 mutation types) can be selected using the options opportunities = "human-genome" or opportunities = "human-exome".

Importantly, using mutational opportunities during signature extraction will result in signatures which are already normalised by the given opportunities (that is, their mutation probabilities are not relative to the composition of the target sequence). In contrast, signatures extracted without using opportunities are relative to the sequence composition; this is not a problem unless the signatures are to be subsequently applied to other types of sequences. (See below for details on how to convert signatures between different sets of opportunities.)

In addition, fit_signatures and extract_signatures admit the specification of priors for signatures and exposures via the arguments sig_prior and exp_prior. These priors allow the incorporation of a priori knowledge about these parameters; for instance, using certain signatures from COSMIC as our signature priors reflects our belief that the signatures found in the data should be very similar to these. Given their potential for introducing subjective biases into the analysis, the use of priors should be strictly limited to situations where there is both strong evidence and good reason to use them. Priors must be numeric matrices with the same dimensions as the signature and exposure matrices to be inferred: signature priors must have one row per signature and one column per mutation type, while exposure priors must have one row per sample and one column per signature.

Converting between genome- and exome-relative representations of signatures

In some analyses, signatures that have been inferred from whole-genome mutation data need to be fitted to mutation data from whole-exome sequencing samples. For instance, if the cohort of samples includes both whole genomes and exomes, it is normally advisable to extract signatures from the whole-genome samples (provided that there are enough of them) and re-fit the resulting signatures to the whole-exome samples. (An alternative approach would be to provide a matrix of sample-specific mutational opportunities that reflects the origin of each catalogue, as explained in the previous section.)

However, it is important to note that, unless appropriate mutational opportunities were used during signature extraction, the probability values of signatures that have been extracted from whole genomes (including the standard COSMIC signatures) are relative to the specific mutational opportunities of that genome, and therefore should not be directly fitted to mutation counts obtained from exomes or from other genomes (such as genomes from different species). Instead, the signatures first need to be converted so that they are relative to the new sequence.

The convert_signatures function can be used to convert a matrix of (human) genome-derived mutational signatures into their exome-relative representations as follows.

# (Following from the signature extraction example presented above)
# Retrieve genome-derived signatures
genome_signatures <- retrieve_pars(mcmc_samples_extr[[4]],
                                   par = "signatures")

# Apply exome mutational opportunities
exome_signatures <- convert_signatures(genome_signatures, 
                                       opportunities_from = "human-genome",
                                       opportunities_to = "human-exome")

par(mfrow = c(2, 1))
plot_spectrum(genome_signatures$mean[4, ],
              name = "Signature D, Genome-relative probabilities")
plot_spectrum(exome_signatures[4, ],
              name = "Signature D, Exome-relative probabilities")

exome_signatures <- convert_signatures(signatures, opportunities_from = "human-genome", opportunities_to = "human-exome")
par(mfrow = c(2, 1), mar = c(5,5.5,6.5,1))
plot_spectrum(signatures$mean[4,], name="Signature D, Genome-relative probabilities")
plot_spectrum(exome_signatures[4,], name="Signature D, Exome-relative probabilities")

Signatures can be similarly converted from exome-relative to genome-relative representations, or between any two given vectors of opportunities (with one element per mutation type). In addition, if the opportunities_to argument is omitted, the signatures will only be normalised by the original opportunities (given by opportunities_from), such that they will not be relative to any particular reference sequence. (Another way of obtaining sequence-independent signatures is to specify the mutational opportunities when running extract_signatures.)

Conversely, if the opportunities_from argument is omitted, the signatures will be interpreted as being already normalised, and the desired opportunities (given by opportunities_to) will be directly imposed on the signatures. For further details on signature conversion, type ?convert_signatures to read the documentation.

Using Fit-Ext models to discover rare or novel signatures

One novelty in sigfit is the introduction of Fit-Ext models, which are able to extract mutational signatures de novo while fitting a set of predefined signatures that are already known to be present in the data. Such models are useful for the discovery of rare signatures for which there is some qualitative evidence, but insufficient support as to enable deconvolution via conventional methods, or in cases where only signature fitting is possible, yet the data clearly display a mutational pattern which is not captured by any of the available signatures.

The Fit-Ext models can be accessed via the fit_extract_signatures function. This is used similarly to extract_signatures, with the exception that a matrix of known signatures to be fitted needs to be provided via the signatures argument (as in fit_signatures), and that the number of additional signatures to extract is specified using the num_extra_sigs argument. Unlike the nsignatures argument in extract_signatures, num_extra_sigs currently admits only single integer values (e.g. num_extra_sigs = 2), and not ranges.

For further details on the Fit-Ext models, see the sigfit paper or type ?fit_extract_signatures to read the documentation.

Signature extraction using multiple chains

NOTE: This is an experimental feature and may not work properly.

As explained above, the use of multiple Markov chains for signature extraction can give rise to a problem known as ‘label switching’ which can invalidate the inferences (see Example 2). In many cases, this problem can be mitigated by initialising the sampler with parameter values obtained from a preliminary sampling run. To do this, we can run the extract_signatures functions on a single chain for a short number of iterations, and then use the get_initializer_list function to pass initial parameter values obtained from this short run to a longer, multi-chain extraction run, via the init argument. An example is given below.

# Load mutation counts
data("counts_21breast")

# Set number of cores for multi-chain sampling
ncores <- parallel::detectCores()
options(mc.cores = ncores)

# Run a short single chain to find initial parameter values
ext_init <- extract_signatures(counts_21breast,
                               nsignatures = 4,
                               iter = 1000,
                               chains = 1)

# Obtain a list of initial parameter values
inits <- get_initializer_list(ext_init, chains = ncores)

# Run multiple chains in parallel, initialised at the values found in the short run
# (make sure to use the same number of chains as in `get_initializer_list`)
ext_multichain <- extract_signatures(counts_21breast,
                                     nsignatures = 4,
                                     iter = 5000,
                                     chains = ncores,
                                     init = inits)

Using the Dirichlet process prior to determine the number of signatures

NOTE: This is an experimental feature and may not work properly.

The Dirichlet process prior (DPP) allows the automatic determination of the number of signatures present in the data, without the need to perform signature extraction for different numbers of signatures. We recommend using the standard goodness-of-fit procedure for the determination of the number of signatures, as shown in Example 2 (above).

The DPP can be activated in the functions extract_signatures and fit_extract_signatures using dpp = TRUE. The number of signatures to extract should be larger than the maximum number of signatures expected (e.g. nsignatures = 20). Those signatures whose presence is not supported by the data should collapse into random noise, such that the number of signatures is given by the signatures which did not collapse. The dpp_conc argument allows to modify the value of the DPP concentration parameter.

List of available data sets

sigfit includes the following data sets, which can be loaded using the data function.

Three collections of COSMIC mutational signatures for single base substitutions: original set of 30 signatures (v2, 2015) and updated sets of 67 signatures (v3.0, 2019) and 78 signatures (v3.2, 2021). The v3.0 signature set is also available in transcriptional strand-wise version (192 mutation types). These data can be loaded using:

    data("cosmic_signatures_v3.2")
    data("cosmic_signatures_v3")
    data("cosmic_signatures_strand_v3")
    data("cosmic_signatures_v2")

Collections of COSMIC mutational signatures for indels and doublet base substitutions: sets of 18 ID signatures and 11 DBS signatures (v3.2, 2021). These data can be loaded using:

    data("cosmic_signatures_indel_v3.2")
    data("cosmic_signatures_doublet_v3.2")

Mutation table and mutational catalogues (96 mutation types) for the set of 21 breast cancer genomes used in Nik-Zainal et al. (2012). These data can be loaded using:

    data("variants_21breast")
    data("counts_21breast")

Mutation tables and mutational catalogues (96 and 192 mutation types) for the set of 119 breast cancer genomes used in Alexandrov et al. (2013). These data can be loaded using:

    data("variants_119breast")
    data("variants_119breast_strand")
    data("counts_119breast")
    data("counts_119breast_strand")

Mutation tables and mutational catalogues (96 and 192 mutation types) for the set of 88 liver cancer genomes used in Alexandrov et al. (2013). These data can be loaded using:

    data("variants_88liver")
    data("variants_88liver_strand")
    data("counts_88liver")
    data("counts_88liver_strand")

Catalogues composed of methylation beta values for 50 breast cancers and 27 normal solid tissue samples, obtained from the TCGA repository (project BRCA). These catalogues contain 99 'mutation types', corresponding to beta values for 99 CpG sites in the promoters of genes associated with breast cancer (see the sigfit paper for details). These data can be loaded using:

    data("methylation_50breast")
    data("methylation_27normal")

sigfit is an R package developed by the Transmissible Cancer Group in the University of Cambridge Department of Veterinary Medicine.

kgori/sigfit documentation built on Feb. 3, 2022, 12:04 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kgori/sigfit
Flexible Bayesian inference of mutational signatures

In kgori/sigfit: Flexible Bayesian inference of mutational signatures

Installation

Example 1: Fitting mutational signatures to a single simulated sample

Fitting mutational signatures

Retrieving signature exposures

Visualisation

Example 2: Extracting signatures from multiple breast cancer samples

Building mutational catalogues

Extracting mutational signatures de novo

Retrieving, plotting and re-fitting extracted signatures

Other functionalities

Alternative signature models

Alternative mutation type definitions

Using mutational opportunities and signature/exposure priors

Converting between genome- and exome-relative representations of signatures

Using Fit-Ext models to discover rare or novel signatures

Signature extraction using multiple chains

Using the Dirichlet process prior to determine the number of signatures

List of available data sets

R Package Documentation

Browse R Packages

We want your feedback!

kgori/sigfit Flexible Bayesian inference of mutational signatures

In kgori/sigfit: Flexible Bayesian inference of mutational signatures

Installation

Example 1: Fitting mutational signatures to a single simulated sample

Fitting mutational signatures

Retrieving signature exposures

Visualisation

Example 2: Extracting signatures from multiple breast cancer samples

Building mutational catalogues

Extracting mutational signatures de novo

Retrieving, plotting and re-fitting extracted signatures

Other functionalities

Alternative signature models

Alternative mutation type definitions

Using mutational opportunities and signature/exposure priors

Converting between genome- and exome-relative representations of signatures

Using Fit-Ext models to discover rare or novel signatures

Signature extraction using multiple chains

Using the Dirichlet process prior to determine the number of signatures

List of available data sets

R Package Documentation

Browse R Packages

We want your feedback!

kgori/sigfit
Flexible Bayesian inference of mutational signatures