knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The package tempoSig
implements mutational signature inference functions and
utilities primarily based on maximum likelihood algorithms.
Input data are mutational catalogs in the form of matrices, which catalog counts of
mutation types found in multiple samples into pre-defined classes. The most
often-used example is the catalog of single nucleotide substitutions (SBSs) grouped
into trinucleotide contexts in which they occur. Mutational signatures are the
patterns of the combination of these genetic features recurrent in tumor samples
[@nikzainal2012; @alexandrov2013; @alexandrov2020]. Up to 30 or ~60 recurrent
signatures
have been established from studies of whole genome and exome sequences of tumor
samples [@alexandrov2020]. The discovery of these signatures is based on mathematical
decompositions of mutational catalog, represented by a matrix $\sf{X}$, of a large sample set into two matrices,
$$
\sf{X} \sim \sf{W}\sf{H},
$$
where $\sf{W}$ and $\sf{H}$ are signature and exposure matrices, each containing
repertoires of signatures associated with mutational processes operative in samples
and the actual numbers of mutations contributed by each signature to individual sample genomes, respectively. The symbol $\sim$ indicates that the actual data are to be
regarded as (Poisson) realizations of the mean given by the product on the right.
Two classes of analysis tasks encountered in genetic studies are de novo discovery of signatures and refitting: in de novo discovery, both signature and exposure matrices are unknown, and large samples of tumor genomes or exomes are used as input (catalog matrix $\sf{X}$ as a whole) for discovery of signatures. The primary algorithm used in de novo discovery of signatures is non-negative matrix factorization [@alexandrov2013]. In refitting, in contrast, existing signature lists are used (signature matrix $\sf{W}$ given as input) along with input catalogs to estimate the exposure matrix $\sf{H}$, whose columns give relative proportions of signatures present in each sample [@rosenthal2016].
The refitting functions in
tempoSig
are based on the maximum likelihood inference algorithm adapted from
mutation-signatures.
In refitting, each column of catalog $\sf{X}$ and exposure $\sf{H}$ matrices,
corresponding to each sample to be analyzed,
can be treated independent since $\sf{W}$ is assumed known. For a given sample
genome, the mutational catalog data consist of the observed counts $m_i$
(a single column of matrix $\sf{X}$), where $i=1,\cdots, c$ is the index
denoting the mutation type categories under consideration ($c=96$
for SBS data classified into trinucleotide contexts). The inference task then
reduces to the problem of estimating multinomial probabilities $p_i$ of observing
each category $i$ out of 96 based on actual count data $m_i$. The log likelihood,
or the probability of observing data given parameters ${p_i}$, is
$$
L = \frac{1}{M}\sum_{i=1}^c m_i \ln p_i = \sum_{i<c}{\hat m}i \ln p_i +
{\hat m}_c \ln \left(1-\sum{i<c} p_i\right)
$$
where $M=\sum_i m_i$, $L$ is the log likelihood per mutation, ${\hat m}_i=m_i/M$
is the proportion of type $i$ in data, and in the second expression, the
constraint $\sum_i p_i=1$ was used to eliminate $p_c$, making independent variable
subset ${p_i,\: (i=1,\cdots,c-1)}$, explicit.
The multinomial probabilities are given by linear combinations of signature proportions via $$ p_i = \sum_{k=1}^K W_{ik} h_k = \frac{\sum_k W_{ik} x_k^2}{\sum_k x_k^2}, $$ where the summation is over all $K$ signatures represented by rows of matrix $\sf{W}$ and $h_k$ is the $k$-th component of the column of $\sf{H}$, giving the proportion of $k$-th signature. We assume $\sf{W}$ is normalized column-wise, such that $\sum_i W_{ik}=1$. In the second expression, we parametrized $h_k$ via $h_k = x_k^2/\sum_{l} x_{l}^2$ such that the non-negativity constraint for exposure $h_k\ge 0$ is automatically satisfied for any $x_k$.
The two equations above define the log likelihood as a function of unconstrained
parameters:
$$
L = L(x_1,\cdots,x_{K}),
$$
whose maximization problem can be solved by any multi-dimensional optimization
algorithm. In tempoSig
, the analytic form of $L$ is exploited by providing its
first derivatives as input to the quasi-Newton
BFGS2 (Broyden-Fletcher-Goldfarb-Shanno) routine
gsl_multimin_fdfminimizer_vector_bfgs2 in GNU Scientific Library (GSL).
For real mutational data with limited sizes (mutational load $M \sim 100$ or less),
it is important to gauge the statistical significance of exposure outcomes
($h_k$ for $k=1,\cdots,K$). In tempoSig
, $p$-values for each exposure can be estimated
with one of three algorithms:
permutation
)lrt
)x.permutation
)With signature permutation (default), the 96 elements of the reference signature vector $W_{ik}$ for signature $k$ is randomly permuted $N_p$ times. Inference is repeated for each replicate to get exposure samples under the null hypothesis, ${h_{k}^{(j)}}$, where $j=1,\cdots, N_p$ is the index for permutation replicates. The $p$-value estimate for the $k$-th signature exposure is $$ P_{k} = \frac{1}{N_p} \sum_j I\left(h_{k}^{(j)} \ge h_{k}\right), $$ where $I(x)$ is the indicator function; i.e., the proportion of replicates for which the observed exposure did not exceed the null values. Note that the resulting p-values are limited in range, with apparent $P_k=0$ indicating that $P_k < 1/N_p$.
In likelihood ratio test, the deviance
$$
D_k = 2M\left[ L - L(x_k=0)\right],
$$
where $L(x_k=0)$ is the likelihood of the restricted model of signature $k$ exposure zero,
is computed. This deviance is distributed by $\chi^2$ with ${\rm d.f.}=1$ in the
asymptotic limit $M\rightarrow\infty$ [@degroot_schervish]. Empirical tests suggest
this asymptotic limit is satisfied if $M\sim 10^3$ or larger and the p-value
computed will generally be an over-estimate. Therefore, we advise
caution in using the likelihood ratio test for smaller mutation load. However, this test
can give a quick upper bound estimate of p-values, especially for cases where $P < 10^{-4}$
and permutation test only gives zero
p-values.
Finally, in count permutation, the catalog data $m_i$ for $i=1,\cdots,c$ are randomly permuted while keeping signatures fixed. This scheme is applied once to all signatures and requires less computation than signature permutation. We found it to genereally yield results similar to signature permutation but does not adequately deal with ``unspecific'' signatures such as Signature 3 (homologous recombination defect).
In de novo inference, both the signature $\sf{H}$ and exposure matrices $\sf{W}$ are inferred simultaneously in an unsupervised manner [@alexandrov2013]. tempoSig
includes functions based on a Bayesian formulation of the
non-negative matrix factorization algorithm [@lee_seung], where the marginal likelihood of models with
varying number of signatures is estimated by a variational algorithm [@bishop; @cemgil]. This feature
enables the most likely number of signatures present in the samples via evaluation of the statistical support
provided by the data.
The main non-R dependency for tempoSig
is GSL.
It can be installed in Ubuntu Linux by
```{bash eval=FALSE}
$ sudo apt-get install libgsl-dev
or in CentOS, ```{bash eval=FALSE} $ yum install gsl-devel
To install tempoSig
along with its dependencies,
devtools::install_github('mskcc/tempoSig')
Load the package via
library(tempoSig)
The main data structure used by tempoSig
is of S4 class tempoSig
with slots:
slotNames('tempoSig')
Except tmb
and logLik
, which are vectors, all are matrices. We denote by $n$ the number of
samples (genomes to be analyzed; total number of columns in matrix $\sf{X}$). The slots
are
catalog
: catalog data matrix of dimension $c\times n$ with
mutation types in rows and samples in columns.signat
: matrix of reference signatures; dimension $c\times K$ with
mutation types in rows and signatures in columns.tmb
: vector of length $n$; total mutation burden (sum of all mutations) in each
sample.expos
: matrix of inferred exposure values; dimension $n \times K$ with
samples in rows and signatures in columns.pvalue
: matrix of estimated $p$-values: dimension $n \times K$ with
samples in rows and signatures in columns.logLik
: vector of log-likelihood values for each sample.misc
: List of objects associated with de novo infererence.The primary input data takes the form of catalog matrix $\sf{X}$. An example is provided by the file
catalog <- system.file('extdata', 'tcga-brca_catalog.txt', package = 'tempoSig') x <- read.table(catalog) dim(x) x[1:5, 1:2] rownames(x)
This file was generated by taking mutation count data of 10 tumor samples from the
mutation annotation format (MAF) file of
TCGA BRCA project.
The rownames show 96 trinucleotide mutation types with the ref > alt
inside
brackets (ref
either C
or T
; otherwise its complementary base taken) and its
two adjacent nucleotides.
An object of class tempoSig
is created with the catalog data frame as the main
argument:
s <- tempoSig(data = x) s
The object initiation also can include specification of reference signatures via
the optional second argument in tempoSig(data, signat = NULL)
, which by default
is read from the file
sbs <- read.table(system.file('extdata', 'cosmic_snv_signatures_v2.txt', package = 'tempoSig')) dim(sbs) sbs[1:5, 1:5] colnames(sbs)
This reference signature list is that of COSMIC version 2. There are other files for alternative references also included:
cosmic_sigProfiler_SBS_signatures.txt
: COSMIC version 3 from whole-genome data (sigProfiler)cosmic_sigProfiler_SBS_exome_signatures.txt
: COSMIC version 3 from exome datacosmic_SigAnalyzer_SBS_signatures.txt
: COSMIC version 3 from whole-genome data (SigAnalyzer)cosmic_sigProfiler_SBS_signatures_v3.1.txt
: COSMIC version 3.1They can be used as follows:
sig2 <- read.table(system.file('extdata', 'cosmic_SigAnalyzer_SBS_signatures.txt', package = 'tempoSig')) dim(sig2) sig2[1:5, 1:5] s2 <- tempoSig(data = x, signat = sig2) s2
The signat
argument can take the file name directly. For convenience, a character
short-hand can also be used:
s_v2 <- tempoSig(data = x, signat = 'v2') # COSMIC v2 (default) s_v3 <- tempoSig(data = x, signat = 'SA') # COSMIC v3 SigAnalyzer s_v4 <- tempoSig(data = x, signat = 'SP') # COSMIC v3 sigProfiler s_v3
We found that v2
generally achieves higher power than others under limited mutation loads.
The slots of a tempoSig
object can be accessed via accessor functions:
tmb(s) catalog(s)[1:2,1:2] expos(s) pvalue(s)
Note that slots expos
and pvalue
are empty since inference has not been performed yet.
A common outcome of sequencing experiments takes the form of MAF files, which collect variant data of multiple samples with annotations. The package maftools can be used to derive catalog matrices from MAF files:
if(!require('BiocManager')) install.packages('BiocManager') if(!require('maftools')) BiocManager::install('maftools') if(!require('BSgenome.Hsapiens.UCSC.hg19')) BiocManager::install('BSgenome.Hsapiens.UCSC.hg19') laml.maf <- system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') laml <- read.maf(maf = laml.maf) laml.tnm <- maftools::trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = 'BSgenome.Hsapiens.UCSC.hg19') catalog <- t(laml.tnm$nmf_matrix)
Refitting using maximum likelihood inference is performed by
s <- extractSig(s, progress.bar = FALSE)
s <- extractSig(s, progress.bar = TRUE)
The option progress.bar = TRUE
displays a progress bar, useful for
large samples. The inference fills the expos
slot of the object:
expos(s)[1:5,1:2] rowSums(expos(s))
Note that each row gives proportions of reference signatures, which add up to 1.
By default, $p$-value estimation is skipped:
pvalue(s)
The flag compute.pval = TRUE
turns it on:
set.seed(135) s <- extractSig(s, compute.pval = TRUE, pvtest = 'permutation', nperm = 1000, progress.bar = TRUE)
set.seed(135) s <- extractSig(s, compute.pval = TRUE, pvtest = 'permutation', nperm = 1000, progress.bar = FALSE)
pvalue(s)[1:5, 1:5]
The above uses the default signature permutation algorithm.
Note that with the number of permutations nperm = 1000
(default), the
lowest non-zero pvalue is $1/1000$. Therefore, the value of 0 indicates
$P < 1/1000$.
The other two options for algorithm are pvtest = 'lrt'
(likelihood ratio test) and
pvtest = 'x.permutation'
(count permutation; see above). We try the lrt
for comparison:
sl <- extractSig(s, compute.pval = TRUE, pvtest = 'lrt', progress.bar = TRUE)
sl <- extractSig(s, compute.pval = TRUE, pvtest = 'lrt', progress.bar = FALSE)
pvalue(sl)[1:5,1:5]
Note how $P=0$ in permutation
comes out as nonzero from lrt
but generally the values
from the latter are larger in magnitude (over-estimation arising from the use
of asymptotic $\chi^2$ distrubtion).
A number of utility functions for plotting and output are included.
The function plotExposure
shows a bar plot of exposure proportions of a
specified sample:
sid <- 'TCGA.BH.A0EI.01A.11D.A10Y.09' plotExposure(s, sample.id = sid, cutoff = 1e-3, cex.axis = 0.7, cex.names = 0.8)
The argument sample.id
can also be an integer index. The cutoff
argument filters
proportions with the minimum value supplied.
The $p$-values
p <- pvalue(s)[sid,] p[c('Signature.1','Signature.2','Signature.19','Signature.22')]
indicate Signature.22
is insignificant and 'Signature.2' is borderline.
We can get a non-zero upper bound for the
Signature.1
p-value using lrt
:
p2 <- pvalue(sl)[sid,] p2[c('Signature.1','Signature.2','Signature.19','Signature.22')]
Again, note the over-estimation under lrt
.
The function
writeExposure(s, output = 'exposure.tsv', sep = '\t', rm.na = TRUE)
will write the exposure, along with $p$-values if any, to a single file. Alternatively, one can choose to write two files, one for exposure and the other for $p$-values via
writeExposure(s, output = 'exposure.tsv', pv.out = 'pvalues.tsv')
We illustrate de novo inferences using a larger TCGA-BRCA catalog data:
brca <- read.table(system.file('extdata', 'tcga_brca_mutect_catalog.txt', package = 'tempoSig')) tcga <- tempoSig(data = brca) tcga
The original NMF algorithm [@lee_seung] applied to signature extraction [@alexandrov2013] is available
in tempoSig
as a special case of the Bayesian algorithm (fixed prior with hyperparameter not updated):
set.seed(135) tcga1 <- extractSig(tcga, method = 'nmf', K = 5, nrun = 10) tcga1
The nmf
method will invoke the maximum likelihood (non-Bayesian) NMF algorithm, which solves the
factorization problem for the count matrix iteratively nrun
times under a given size K
of signature set
(row and column sizes of matrices $\sf{W}$ $\sf{H}, respectively). Among nrun
solutions, the best one with
the highest likelihood value is chosen and stored in the object returned. In production runs, an nrun
value
exceeding $\sim 50$ is recommended.
sd0 <- signat(tcga1) cs0 <- cosineSimilarity(A = sd0, B = sbs) head(cs0) heatmap(t(cs0), Rowv = NA, Colv = NA, scale = 'none', revC = TRUE)
In the above, we computed the cosine similarity matrix of the 5 de novo signatures with the COSMIC v2 set.
The heatmap indicates that S1-S5
correspond to Signatures 1 (Aging), 10 (POLE), 6 (MMR), 3 (HRD), and
2 (APOBEC), respectively.
The following may be used to obtain an assignment based on such overlaps if visual inspection is ambiguous:
lsap <- clue::solve_LSAP(cs0, maximum = TRUE) sassign <- colnames(cs0)[lsap] names(sassign) <- rownames(cs0) sassign
A function sigplot
will plot signature profiles to aid comparison:
old.par <- par(mfrow = c(5, 2), mar = c(3,3,3,1), lwd = 0.2, cex = 1) for(k in seq(5)){ sigplot(sd0[, names(sassign)[k]]) title(main = names(sassign)[k]) sigplot(sbs[, sassign[k]]) title(main = sassign[k]) } par(old.par)
In the previous de novo inference example, the number of signatures was set as 5, while this number ($K$) is
unknown in practice. Commonly, the maximum likelihood NMF is augmented by quality measures (reproducibility of
extraction results, etc) to help decide the optimal $K$. A more direct way to determine $K$ is to use
a Bayesian formulation and compute marginal likelihood. The non-Bayesian version invoked by
extractSig(..., method = nmf)
above is in fact a special case of this Bayesian algorithm.
We scan a range of $K$ values via
set.seed(152) tscan <- extractSig(tcga, method = 'bnmf', Kmax = 10, nrun = 10, verbose = 1) me <- misc(tscan)$measure me plot(me, type = 'b', las = 1, bty = 'n')
Note that individual runs consist of $K$ values scanned from 2 (default) to Kmax
. Often, increasing
the size of signature set by one will result in redundant signatures beyond a certain $K$ value. If
this occurrs, the scan will stop and print this upper bound Kmax
. In the example above, this upper bound
is often smaller than the maximum 10.
A utility function kstar
will find the optimal $K$ (cautioned against blind-usages; a visual review is
often necessary):
kstar(me)$ropt
The object returned by extractSig(..., method = 'bnmf')
scanning multiple $K$ will in general not
contain the desired signature and exposure matrices in its slots. Therefore, one needs to run another
inference with a single $K$ determined:
tcga2 <- extractSig(tcga, method = 'bnmf', Kmin = 7, Kmax = 7) tcga2
It is worth noting that this single-$K$ run output uses a model more general (gamma priors with
hyper-parameter optmized) than the non-Bayesian version obtained with method = 'nmf'
.
sd1 <- signat(tcga2) cs2 <- cosineSimilarity(A = sd1, B = sbs) heatmap(t(cs2), Rowv = NA, Colv = NA, scale = 'none', revC = TRUE) lsap2 <- clue::solve_LSAP(cs2, maximum = TRUE) sassign2 <- colnames(cs2)[lsap2] names(sassign2) <- rownames(cs2) sassign2
Two additional signatures found here resemble Signature 24 and Signature 11.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.