bp | R Documentation |
These functions are combined to provide a best practice for optimally identifying mutational signatures and attributing their activities (exposures) in tumor samples. They are listed in order to use.
bp_extract_signatures()
for extracting signatures.
bp_show_survey()
for showing measures change under different
signature numbers to help user select optimal signature number.
At default, an aggregated score (named score) is generated to
suggest the best solution.
bp_show_survey2()
for showing simplified signature number survey like
show_sig_number_survey()
.
bp_get_sig_obj()
for get a (list of) Signature
object which is common
used in sigminer for analysis and visualization.
bp_attribute_activity()
for optimizing signature activities (exposures).
NOTE: the activities from extraction step may be better!
You can also use sig_extract to get optimal NMF result from multiple NMF runs.
Besides, you can use sig_fit to quantify exposures based on signatures extracted
from bp_extract_signatures()
.
bp_extract_signatures_iter()
for extracting signature in a iteration way.
bp_cluster_iter_list()
for clustering (hclust
with average linkage)
iterated signatures to help collapse
multiple signatures into one. The result cluster can be visualized by
plot()
or factoextra::fviz_dend()
.
bp_get_clustered_sigs()
for getting clustered (grouped) mean signatures from signature clusters.
Extra: bp_get_stats
() for obtaining stats for signatures and samples of a solution.
These stats are aggregated (averaged) as the stats for a solution
(specific signature number).
Extra: bp_get_rank_score()
for obtaining rank score for all signature numbers.
bp_extract_signatures(
nmf_matrix,
range = 2:5,
n_bootstrap = 20L,
n_nmf_run = 50,
RTOL = 0.001,
min_contribution = 0,
cores = min(4L, future::availableCores()),
cores_solution = min(cores, length(range)),
seed = 123456L,
handle_hyper_mutation = TRUE,
report_integer_exposure = FALSE,
only_core_stats = nrow(nmf_matrix) > 100,
cache_dir = file.path(tempdir(), "sigminer_bp"),
keep_cache = FALSE,
pynmf = FALSE,
use_conda = TRUE,
py_path = "/Users/wsx/anaconda3/bin/python"
)
bp_extract_signatures_iter(
nmf_matrix,
range = 2:5,
sim_threshold = 0.95,
max_iter = 10L,
n_bootstrap = 20L,
n_nmf_run = 50,
RTOL = 0.001,
min_contribution = 0,
cores = min(4L, future::availableCores()),
cores_solution = min(cores, length(range)),
seed = 123456L,
handle_hyper_mutation = TRUE,
report_integer_exposure = FALSE,
only_core_stats = nrow(nmf_matrix) > 100,
cache_dir = file.path(tempdir(), "sigminer_bp"),
keep_cache = FALSE,
pynmf = FALSE,
use_conda = FALSE,
py_path = "/Users/wsx/anaconda3/bin/python"
)
bp_cluster_iter_list(x, k = NULL, include_final_iteration = TRUE)
bp_get_clustered_sigs(SigClusters, cluster_label)
bp_get_sig_obj(obj, signum = NULL)
bp_get_stats(obj)
bp_get_rank_score(obj)
bp_show_survey2(
obj,
x = "signature_number",
left_y = "silhouette",
right_y = "L2_error",
left_name = left_y,
right_name = right_y,
left_color = "black",
right_color = "red",
left_shape = 16,
right_shape = 18,
shape_size = 4,
highlight = NULL
)
bp_show_survey(
obj,
add_score = FALSE,
scales = c("free_y", "free"),
fixed_ratio = TRUE
)
bp_attribute_activity(
input,
sample_class = NULL,
nmf_matrix = NULL,
method = c("bt", "stepwise"),
bt_use_prop = FALSE,
return_class = c("matrix", "data.table"),
use_parallel = FALSE,
cache_dir = file.path(tempdir(), "sigminer_attribute_activity"),
keep_cache = FALSE
)
nmf_matrix |
a |
range |
a |
n_bootstrap |
number of bootstrapped (resampling) catalogs used.
When it is |
n_nmf_run |
number of NMF runs for each bootstrapped or original catalog. At default, in total n_bootstrap x n_nmf_run (i.e. 1000) NMF runs are used for the task. |
RTOL |
a threshold proposed by Nature Cancer paper to control how to
filter solutions of NMF. Default is |
min_contribution |
a component contribution threshold to filer out small contributed components. |
cores |
number of cpu cores to run NMF. |
cores_solution |
cores for processing solutions, default is equal to argument |
seed |
a random seed to make reproducible result. |
handle_hyper_mutation |
default is |
report_integer_exposure |
if |
only_core_stats |
if |
cache_dir |
a directory for keep temp result files. |
keep_cache |
if |
pynmf |
if |
use_conda |
if |
py_path |
path to Python executable file, e.g. '/Users/wsx/anaconda3/bin/python'. In my
test, it is more stable than |
sim_threshold |
a similarity threshold for selecting samples to auto-rerun
the extraction procedure (i.e. |
max_iter |
the maximum iteration size, default is 10, i.e., at most run the extraction procedure 10 times. |
x |
result from |
k |
an integer sequence specifying the cluster number to get silhouette. |
include_final_iteration |
if |
SigClusters |
result from |
cluster_label |
cluster labels for a specified cluster number, obtain it
from |
obj |
a |
signum |
a integer vector to extract the corresponding |
left_y |
column name for left y axis. |
right_y |
column name for right y axis. |
left_name |
label name for left y axis. |
right_name |
label name for right y axis. |
left_color |
color for left axis. |
right_color |
color for right axis. |
left_shape , right_shape , shape_size |
shape setting. |
highlight |
a |
add_score |
if |
scales |
one of "free_y" (default) and "free" to control the scales of plot facet. |
fixed_ratio |
if |
input |
result from |
sample_class |
a named string vector whose names are sample names
and values are class labels (i.e. cancer subtype). If it is |
method |
one of 'bt' (use bootstrap exposure median, from reference #2, the most recommended way in my personal view) or stepwise' (stepwise reduce and update signatures then do signature fitting with last signature sets, from reference #2, the result tends to assign the contribution of removed signatures to the remaining signatures, maybe I misunderstand the paper method? PAY ATTENTION). |
bt_use_prop |
this parameter is only used for |
return_class |
string, 'matrix' or 'data.table'. |
use_parallel |
if |
The signature extraction approach is adopted from reference #1, #2, and the whole best practice is adopted from the pipeline used by reference #3. I implement the whole procedure with R code based on the method description of papers. The code is well organized, tested and documented so user will find it pretty simple and useful. Besides, the structure of the results is very clear to see and also visualize like other approaches provided by sigminer.
It depends on the called function.
The survey plot provides a pretty good way to facilitate the signature number
selection. A score
measure is calculated as the weighted mean of selected
measures and visualized as the first sub-plot. The optimal number is highlighted
with red color dot and the best values for each measures are also
highlighted with orange color dots. The detail of 6 measures shown in plot are
explained as below.
score
- an aggregated score based on rank scores from selected measures below.
The higher, the better. When two signature numbers have the same score,
the larger signature number is preferred (this is a rare situation, you
have to double check other measures).
silhouette
- the average silhouette width for signatures, also named as ASW in reference #2.
The signature number with silhouette decreases sharply is preferred.
distance
- the average sample reconstructed cosine distance, the lower value is better.
error
- the average sample reconstructed error calculated with L2 formula
(i.e. L2 error). This lower value is better. This measure represents a
similar concept like distance
above, they are all used to quantify how well
sample mutation profiles can be reconstructed from signatures, but distance
cares the whole mutation profile similarity while error
here cares value difference.
pos cor
- the average positive signature exposure correlation coefficient.
The lower value is better. This measure is constructed based on my understanding
about signatures: mutational signatures are typically treated as independent
recurrent patterns, so their activities are less correlated.
similarity
- the average similarity within in a signature cluster.
Like silhouette
, the point decreases sharply is preferred.
In the practice, results from multiple NMF runs are clustered
with "clustering with match" algorithm proposed by reference #2. This value
indicates if the signature profiles extracted from different NMF runs are similar.
Shixiang Wang w_shixiang@163.com
Alexandrov, Ludmil B., et al. "Deciphering signatures of mutational processes operative in human cancer." Cell reports 3.1 (2013): 246-259.
Degasperi, Andrea, et al. "A practical framework and online tool for mutational signature analyses show intertissue variation and driver dependencies." Nature cancer 1.2 (2020): 249-263.
Alexandrov, Ludmil B., et al. “The repertoire of mutational signatures in human cancer.” Nature 578.7793 (2020): 94-101.
See sig_estimate, sig_extract, sig_auto_extract, sigprofiler_extract for other approaches.
data("simulated_catalogs")
# Here I reduce the values for n_bootstrap and n_nmf_run
# for reducing the run time.
# In practice, you should keep default or increase the values
# for better estimation.
#
# The input data here is simulated from 10 mutational signatures
# e1 <- bp_extract_signatures(
# t(simulated_catalogs$set1),
# range = 8:12,
# n_bootstrap = 5,
# n_nmf_run = 10
# )
#
# To avoid computation in examples,
# Here just load the result
# (e1$signature and e1$exposure set to NA to reduce package size)
load(system.file("extdata", "e1.RData", package = "sigminer"))
# See the survey for different signature numbers
# The suggested solution is marked as red dot
# with highest integrated score.
p1 <- bp_show_survey(e1)
p1
# You can also exclude plotting and highlighting the score
p2 <- bp_show_survey(e1, add_score = FALSE)
p2
# You can also plot a simplified version
p3 <- bp_show_survey2(e1, highlight = 10)
p3
# Obtain the suggested solution from extraction result
obj_suggested <- bp_get_sig_obj(e1, e1$suggested)
obj_suggested
# If you think the suggested signature number is not right
# Just pick up the solution you want
obj_s8 <- bp_get_sig_obj(e1, 8)
# Track the reconstructed profile similarity
rec_sim <- get_sig_rec_similarity(obj_s8, t(simulated_catalogs$set1))
rec_sim
# After extraction, you can assign the signatures
# to reference COSMIC signatures
# More see ?get_sig_similarity
sim <- get_sig_similarity(obj_suggested)
# Visualize the match result
if (require(pheatmap)) {
pheatmap::pheatmap(sim$similarity)
}
# You already got the activities of signatures
# in obj_suggested, however, you can still
# try to optimize the result.
# NOTE: the optimization step may not truly optimize the result!
expo <- bp_attribute_activity(e1, return_class = "data.table")
expo$abs_activity
## Not run:
# Iterative extraction:
# This procedure will rerun extraction step
# for those samples with reconstructed catalog similarity
# lower than a threshold (default is 0.95)
e2 <- bp_extract_signatures_iter(
t(simulated_catalogs$set1),
range = 9:11,
n_bootstrap = 5,
n_nmf_run = 5,
sim_threshold = 0.99
)
e2
# When the procedure run multiple rounds
# you can cluster the signatures from different rounds by
# the following command
# bp_cluster_iter_list(e2)
## Extra utilities
rank_score <- bp_get_rank_score(e1)
rank_score
stats <- bp_get_stats(e2$iter1)
# Get the mean reconstructed similarity
1 - stats$stats_sample$cosine_distance_mean
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.