knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

Mutational signatures

The package tempoSig implements mutational signature inference functions and utilities primarily based on maximum likelihood algorithms. Input data are mutational catalogs in the form of matrices, which catalog counts of mutation types found in multiple samples into pre-defined classes. The most often-used example is the catalog of single nucleotide substitutions (SBSs) grouped into trinucleotide contexts in which they occur. Mutational signatures are the patterns of the combination of these genetic features recurrent in tumor samples [@nikzainal2012; @alexandrov2013; @alexandrov2020]. Up to 30 or ~60 recurrent signatures have been established from studies of whole genome and exome sequences of tumor samples [@alexandrov2020]. The discovery of these signatures is based on mathematical decompositions of mutational catalog, represented by a matrix $\sf{X}$, of a large sample set into two matrices, $$ \sf{X} \sim \sf{W}\sf{H}, $$ where $\sf{W}$ and $\sf{H}$ are signature and exposure matrices, each containing repertoires of signatures associated with mutational processes operative in samples and the actual numbers of mutations contributed by each signature to individual sample genomes, respectively. The symbol $\sim$ indicates that the actual data are to be regarded as (Poisson) realizations of the mean given by the product on the right.

Two classes of analysis tasks encountered in genetic studies are de novo discovery of signatures and refitting: in de novo discovery, both signature and exposure matrices are unknown, and large samples of tumor genomes or exomes are used as input (catalog matrix $\sf{X}$ as a whole) for discovery of signatures. The primary algorithm used in de novo discovery of signatures is non-negative matrix factorization [@alexandrov2013]. In refitting, in contrast, existing signature lists are used (signature matrix $\sf{W}$ given as input) along with input catalogs to estimate the exposure matrix $\sf{H}$, whose columns give relative proportions of signatures present in each sample [@rosenthal2016].

Maximum likelihood

The refitting functions in tempoSig are based on the maximum likelihood inference algorithm adapted from mutation-signatures. In refitting, each column of catalog $\sf{X}$ and exposure $\sf{H}$ matrices, corresponding to each sample to be analyzed, can be treated independent since $\sf{W}$ is assumed known. For a given sample genome, the mutational catalog data consist of the observed counts $m_i$ (a single column of matrix $\sf{X}$), where $i=1,\cdots, c$ is the index denoting the mutation type categories under consideration ($c=96$ for SBS data classified into trinucleotide contexts). The inference task then reduces to the problem of estimating multinomial probabilities $p_i$ of observing each category $i$ out of 96 based on actual count data $m_i$. The log likelihood, or the probability of observing data given parameters ${p_i}$, is $$ L = \frac{1}{M}\sum_{i=1}^c m_i \ln p_i = \sum_{i<c}{\hat m}i \ln p_i + {\hat m}_c \ln \left(1-\sum{i<c} p_i\right) $$ where $M=\sum_i m_i$, $L$ is the log likelihood per mutation, ${\hat m}_i=m_i/M$ is the proportion of type $i$ in data, and in the second expression, the constraint $\sum_i p_i=1$ was used to eliminate $p_c$, making independent variable subset ${p_i,\: (i=1,\cdots,c-1)}$, explicit.

The multinomial probabilities are given by linear combinations of signature proportions via $$ p_i = \sum_{k=1}^K W_{ik} h_k = \frac{\sum_k W_{ik} x_k^2}{\sum_k x_k^2}, $$ where the summation is over all $K$ signatures represented by rows of matrix $\sf{W}$ and $h_k$ is the $k$-th component of the column of $\sf{H}$, giving the proportion of $k$-th signature. We assume $\sf{W}$ is normalized column-wise, such that $\sum_i W_{ik}=1$. In the second expression, we parametrized $h_k$ via $h_k = x_k^2/\sum_{l} x_{l}^2$ such that the non-negativity constraint for exposure $h_k\ge 0$ is automatically satisfied for any $x_k$.

The two equations above define the log likelihood as a function of unconstrained parameters: $$ L = L(x_1,\cdots,x_{K}), $$ whose maximization problem can be solved by any multi-dimensional optimization algorithm. In tempoSig, the analytic form of $L$ is exploited by providing its first derivatives as input to the quasi-Newton BFGS2 (Broyden-Fletcher-Goldfarb-Shanno) routine gsl_multimin_fdfminimizer_vector_bfgs2 in GNU Scientific Library (GSL).

Significance test

For real mutational data with limited sizes (mutational load $M \sim 100$ or less), it is important to gauge the statistical significance of exposure outcomes ($h_k$ for $k=1,\cdots,K$). In tempoSig, $p$-values for each exposure can be estimated with one of three algorithms:

  1. Permutation test of signatures (permutation)
  2. Likelihood ratio test (lrt)
  3. Permutation sampling of count data (x.permutation)

With signature permutation (default), the 96 elements of the reference signature vector $W_{ik}$ for signature $k$ is randomly permuted $N_p$ times. Inference is repeated for each replicate to get exposure samples under the null hypothesis, ${h_{k}^{(j)}}$, where $j=1,\cdots, N_p$ is the index for permutation replicates. The $p$-value estimate for the $k$-th signature exposure is $$ P_{k} = \frac{1}{N_p} \sum_j I\left(h_{k}^{(j)} \ge h_{k}\right), $$ where $I(x)$ is the indicator function; i.e., the proportion of replicates for which the observed exposure did not exceed the null values. Note that the resulting p-values are limited in range, with apparent $P_k=0$ indicating that $P_k < 1/N_p$.

In likelihood ratio test, the deviance $$ D_k = 2M\left[ L - L(x_k=0)\right], $$ where $L(x_k=0)$ is the likelihood of the restricted model of signature $k$ exposure zero, is computed. This deviance is distributed by $\chi^2$ with ${\rm d.f.}=1$ in the asymptotic limit $M\rightarrow\infty$ [@degroot_schervish]. Empirical tests suggest this asymptotic limit is satisfied if $M\sim 10^3$ or larger and the p-value computed will generally be an over-estimate. Therefore, we advise caution in using the likelihood ratio test for smaller mutation load. However, this test can give a quick upper bound estimate of p-values, especially for cases where $P < 10^{-4}$ and permutation test only gives zero p-values.

Finally, in count permutation, the catalog data $m_i$ for $i=1,\cdots,c$ are randomly permuted while keeping signatures fixed. This scheme is applied once to all signatures and requires less computation than signature permutation. We found it to genereally yield results similar to signature permutation but does not adequately deal with ``unspecific'' signatures such as Signature 3 (homologous recombination defect).

De novo inference

In de novo inference, both the signature $\sf{H}$ and exposure matrices $\sf{W}$ are inferred simultaneously in an unsupervised manner [@alexandrov2013]. tempoSig includes functions based on a Bayesian formulation of the non-negative matrix factorization algorithm [@lee_seung], where the marginal likelihood of models with varying number of signatures is estimated by a variational algorithm [@bishop; @cemgil]. This feature enables the most likely number of signatures present in the samples via evaluation of the statistical support provided by the data.

Software usage

Installation

The main non-R dependency for tempoSig is GSL. It can be installed in Ubuntu Linux by ```{bash eval=FALSE} $ sudo apt-get install libgsl-dev

or in CentOS,
```{bash eval=FALSE}
$ yum install gsl-devel

To install tempoSig along with its dependencies,

devtools::install_github('mskcc/tempoSig')

Data structure

Load the package via

library(tempoSig)

The main data structure used by tempoSig is of S4 class tempoSig with slots:

slotNames('tempoSig')

Except tmb and logLik, which are vectors, all are matrices. We denote by $n$ the number of samples (genomes to be analyzed; total number of columns in matrix $\sf{X}$). The slots are

The primary input data takes the form of catalog matrix $\sf{X}$. An example is provided by the file

catalog <- system.file('extdata', 'tcga-brca_catalog.txt', package = 'tempoSig')
x <- read.table(catalog)
dim(x)
x[1:5, 1:2]
rownames(x)

This file was generated by taking mutation count data of 10 tumor samples from the mutation annotation format (MAF) file of TCGA BRCA project. The rownames show 96 trinucleotide mutation types with the ref > alt inside brackets (ref either C or T; otherwise its complementary base taken) and its two adjacent nucleotides.

An object of class tempoSig is created with the catalog data frame as the main argument:

s <- tempoSig(data = x)
s

The object initiation also can include specification of reference signatures via the optional second argument in tempoSig(data, signat = NULL), which by default is read from the file

sbs <- read.table(system.file('extdata', 'cosmic_snv_signatures_v2.txt',
                              package = 'tempoSig'))
dim(sbs)
sbs[1:5, 1:5]
colnames(sbs)

This reference signature list is that of COSMIC version 2. There are other files for alternative references also included:

They can be used as follows:

sig2 <- read.table(system.file('extdata', 'cosmic_SigAnalyzer_SBS_signatures.txt',
                               package = 'tempoSig'))
dim(sig2)
sig2[1:5, 1:5]
s2 <- tempoSig(data = x, signat = sig2)
s2

The signat argument can take the file name directly. For convenience, a character short-hand can also be used:

s_v2 <- tempoSig(data = x, signat = 'v2')  # COSMIC v2 (default)
s_v3 <- tempoSig(data = x, signat = 'SA')  # COSMIC v3 SigAnalyzer
s_v4 <- tempoSig(data = x, signat = 'SP')  # COSMIC v3 sigProfiler
s_v3

We found that v2 generally achieves higher power than others under limited mutation loads.

The slots of a tempoSig object can be accessed via accessor functions:

tmb(s)
catalog(s)[1:2,1:2]
expos(s)
pvalue(s)

Note that slots expos and pvalue are empty since inference has not been performed yet.

Catalog matrix generation

A common outcome of sequencing experiments takes the form of MAF files, which collect variant data of multiple samples with annotations. The package maftools can be used to derive catalog matrices from MAF files:

if(!require('BiocManager')) install.packages('BiocManager')
if(!require('maftools')) BiocManager::install('maftools')
if(!require('BSgenome.Hsapiens.UCSC.hg19'))
  BiocManager::install('BSgenome.Hsapiens.UCSC.hg19')
laml.maf <- system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml <- read.maf(maf = laml.maf)
laml.tnm <- maftools::trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE,
                                          ref_genome = 'BSgenome.Hsapiens.UCSC.hg19')
catalog <- t(laml.tnm$nmf_matrix)

Refitting

Refitting using maximum likelihood inference is performed by

s <- extractSig(s, progress.bar = FALSE)
s <- extractSig(s, progress.bar = TRUE)

The option progress.bar = TRUE displays a progress bar, useful for large samples. The inference fills the expos slot of the object:

expos(s)[1:5,1:2]
rowSums(expos(s))

Note that each row gives proportions of reference signatures, which add up to 1.

By default, $p$-value estimation is skipped:

pvalue(s)

The flag compute.pval = TRUE turns it on:

set.seed(135)
s <- extractSig(s, compute.pval = TRUE, pvtest = 'permutation', nperm = 1000, 
                progress.bar = TRUE)
set.seed(135)
s <- extractSig(s, compute.pval = TRUE, pvtest = 'permutation', nperm = 1000, 
                progress.bar = FALSE)
pvalue(s)[1:5, 1:5]

The above uses the default signature permutation algorithm. Note that with the number of permutations nperm = 1000 (default), the lowest non-zero pvalue is $1/1000$. Therefore, the value of 0 indicates $P < 1/1000$. The other two options for algorithm are pvtest = 'lrt' (likelihood ratio test) and pvtest = 'x.permutation' (count permutation; see above). We try the lrt for comparison:

sl <- extractSig(s, compute.pval = TRUE, pvtest = 'lrt', progress.bar = TRUE)
sl <- extractSig(s, compute.pval = TRUE, pvtest = 'lrt', progress.bar = FALSE)
pvalue(sl)[1:5,1:5]

Note how $P=0$ in permutation comes out as nonzero from lrt but generally the values from the latter are larger in magnitude (over-estimation arising from the use of asymptotic $\chi^2$ distrubtion).

Visualization and output

A number of utility functions for plotting and output are included. The function plotExposure shows a bar plot of exposure proportions of a specified sample:

sid <- 'TCGA.BH.A0EI.01A.11D.A10Y.09'
plotExposure(s, sample.id = sid, cutoff = 1e-3, cex.axis = 0.7, cex.names = 0.8)

The argument sample.id can also be an integer index. The cutoff argument filters proportions with the minimum value supplied. The $p$-values

p <- pvalue(s)[sid,]
p[c('Signature.1','Signature.2','Signature.19','Signature.22')]

indicate Signature.22 is insignificant and 'Signature.2' is borderline. We can get a non-zero upper bound for the Signature.1 p-value using lrt:

p2 <- pvalue(sl)[sid,]
p2[c('Signature.1','Signature.2','Signature.19','Signature.22')]

Again, note the over-estimation under lrt.

The function

writeExposure(s, output = 'exposure.tsv', sep = '\t', rm.na = TRUE)

will write the exposure, along with $p$-values if any, to a single file. Alternatively, one can choose to write two files, one for exposure and the other for $p$-values via

writeExposure(s, output = 'exposure.tsv', pv.out = 'pvalues.tsv')

De novo inference

We illustrate de novo inferences using a larger TCGA-BRCA catalog data:

brca <- read.table(system.file('extdata', 'tcga_brca_mutect_catalog.txt',
                               package = 'tempoSig'))
tcga <- tempoSig(data = brca)
tcga

The original NMF algorithm [@lee_seung] applied to signature extraction [@alexandrov2013] is available in tempoSig as a special case of the Bayesian algorithm (fixed prior with hyperparameter not updated):

set.seed(135)
tcga1 <- extractSig(tcga, method = 'nmf', K = 5, nrun = 10)
tcga1

The nmf method will invoke the maximum likelihood (non-Bayesian) NMF algorithm, which solves the factorization problem for the count matrix iteratively nrun times under a given size K of signature set (row and column sizes of matrices $\sf{W}$ $\sf{H}, respectively). Among nrun solutions, the best one with the highest likelihood value is chosen and stored in the object returned. In production runs, an nrun value exceeding $\sim 50$ is recommended.

sd0 <- signat(tcga1)
cs0 <- cosineSimilarity(A = sd0, B = sbs)
head(cs0)
heatmap(t(cs0), Rowv = NA, Colv = NA, scale = 'none', revC = TRUE)

In the above, we computed the cosine similarity matrix of the 5 de novo signatures with the COSMIC v2 set. The heatmap indicates that S1-S5 correspond to Signatures 1 (Aging), 10 (POLE), 6 (MMR), 3 (HRD), and 2 (APOBEC), respectively. The following may be used to obtain an assignment based on such overlaps if visual inspection is ambiguous:

lsap <- clue::solve_LSAP(cs0, maximum = TRUE)
sassign <- colnames(cs0)[lsap]
names(sassign) <- rownames(cs0)
sassign

A function sigplot will plot signature profiles to aid comparison:

old.par <- par(mfrow = c(5, 2), mar = c(3,3,3,1), lwd = 0.2, cex = 1)
for(k in seq(5)){
  sigplot(sd0[, names(sassign)[k]])
  title(main = names(sassign)[k])
  sigplot(sbs[, sassign[k]])
  title(main = sassign[k])
}
par(old.par)

De novo inference with variable number of signatures

In the previous de novo inference example, the number of signatures was set as 5, while this number ($K$) is unknown in practice. Commonly, the maximum likelihood NMF is augmented by quality measures (reproducibility of extraction results, etc) to help decide the optimal $K$. A more direct way to determine $K$ is to use a Bayesian formulation and compute marginal likelihood. The non-Bayesian version invoked by extractSig(..., method = nmf) above is in fact a special case of this Bayesian algorithm. We scan a range of $K$ values via

set.seed(152)
tscan <- extractSig(tcga, method = 'bnmf', Kmax = 10, nrun = 10, verbose = 1)
me <- misc(tscan)$measure
me
plot(me, type = 'b', las = 1, bty = 'n')

Note that individual runs consist of $K$ values scanned from 2 (default) to Kmax. Often, increasing the size of signature set by one will result in redundant signatures beyond a certain $K$ value. If this occurrs, the scan will stop and print this upper bound Kmax. In the example above, this upper bound is often smaller than the maximum 10. A utility function kstar will find the optimal $K$ (cautioned against blind-usages; a visual review is often necessary):

kstar(me)$ropt

The object returned by extractSig(..., method = 'bnmf') scanning multiple $K$ will in general not contain the desired signature and exposure matrices in its slots. Therefore, one needs to run another inference with a single $K$ determined:

tcga2 <- extractSig(tcga, method = 'bnmf', Kmin = 7, Kmax = 7)
tcga2

It is worth noting that this single-$K$ run output uses a model more general (gamma priors with hyper-parameter optmized) than the non-Bayesian version obtained with method = 'nmf'.

sd1 <- signat(tcga2)
cs2 <- cosineSimilarity(A = sd1, B = sbs)
heatmap(t(cs2), Rowv = NA, Colv = NA, scale = 'none', revC = TRUE)
lsap2 <- clue::solve_LSAP(cs2, maximum = TRUE)
sassign2 <- colnames(cs2)[lsap2]
names(sassign2) <- rownames(cs2)
sassign2

Two additional signatures found here resemble Signature 24 and Signature 11.

References



mskcc/tempoSig documentation built on Feb. 3, 2023, 8:35 a.m.