Nothing
# A Best Practice for Signature Extraction and Exposure (Activity) Attribution
# 先基于已知的 signature 进行拟合,然后再提取额外的未知signature也是一种可行方式
# 以后还要处理不同平台数据混合导致的差异性!
#' A Best Practice for Signature Extraction and Exposure (Activity) Attribution
#'
#' These functions are combined to provide a best practice for optimally
#' identifying mutational signatures and attributing their activities (exposures)
#' in tumor samples. They are listed in order to use.
#' - `bp_extract_signatures()` for extracting signatures.
#' - `bp_show_survey()` for showing measures change under different
#' signature numbers to help user select optimal signature number.
#' At default, an aggregated score (named score) is generated to
#' suggest the best solution.
#' - `bp_show_survey2()` for showing simplified signature number survey like
#' [show_sig_number_survey()].
#' - `bp_get_sig_obj()` for get a (list of) `Signature` object which is common
#' used in **sigminer** for analysis and visualization.
#' - `bp_attribute_activity()` for optimizing signature activities (exposures).
#' NOTE: the activities from extraction step may be better!
#' You can also use [sig_extract] to get optimal NMF result from multiple NMF runs.
#' Besides, you can use [sig_fit] to quantify exposures based on signatures extracted
#' from `bp_extract_signatures()`.
#' - `bp_extract_signatures_iter()` for extracting signature in a iteration way.
#' - `bp_cluster_iter_list()` for clustering (`hclust` with average linkage)
#' iterated signatures to help collapse
#' multiple signatures into one. The result cluster can be visualized by
#' `plot()` or `factoextra::fviz_dend()`.
#' - `bp_get_clustered_sigs()` for getting clustered (grouped) mean signatures from signature clusters.
#' - Extra: `bp_get_stats`() for obtaining stats for signatures and samples of a solution.
#' These stats are aggregated (averaged) as the stats for a solution
#' (specific signature number).
#' - Extra: `bp_get_rank_score()` for obtaining rank score for all signature numbers.
#'
#' @details
#' The signature extraction approach is adopted from reference #1, #2, and
#' the whole best practice is adopted from the pipeline used by reference #3.
#' I implement the whole procedure with R code based on the method description
#' of papers. The code is well organized, tested and documented so user will
#' find it pretty simple and useful. Besides, the structure of the results is
#' very clear to see and also visualize like other approaches provided by **sigminer**.
#'
#' @section Measure Explanation in Survey Plot:
#' The survey plot provides a pretty good way to facilitate the signature number
#' selection. A `score` measure is calculated as the weighted mean of selected
#' measures and visualized as the first sub-plot. The optimal number is highlighted
#' with red color dot and the best values for each measures are also
#' highlighted with orange color dots. The detail of 6 measures shown in plot are
#' explained as below.
#' - `score` - an aggregated score based on rank scores from selected measures below.
#' The higher, the better. When two signature numbers have the same score,
#' the larger signature number is preferred (this is a rare situation, you
#' have to double check other measures).
#' - `silhouette` - the average silhouette width for signatures, also named as ASW in reference #2.
#' The signature number with silhouette decreases sharply is preferred.
#' - `distance` - the average sample reconstructed cosine distance, the lower value is better.
#' - `error` - the average sample reconstructed error calculated with L2 formula
#' (i.e. L2 error). This lower value is better. This measure represents a
#' similar concept like `distance` above, they are all used to quantify how well
#' sample mutation profiles can be reconstructed from signatures, but `distance`
#' cares the whole mutation profile similarity while `error` here cares value difference.
#' - `pos cor` - the average positive signature exposure correlation coefficient.
#' The lower value is better. This measure is constructed based on my understanding
#' about signatures: mutational signatures are typically treated as independent
#' recurrent patterns, so their activities are less correlated.
#' - `similarity` - the average similarity within in a signature cluster.
#' Like `silhouette`, the point decreases sharply is preferred.
#' In the practice, results from multiple NMF runs are clustered
#' with "clustering with match" algorithm proposed by reference #2. This value
#' indicates if the signature profiles extracted from different NMF runs are similar.
#' @inheritParams sig_estimate
#' @param n_bootstrap number of bootstrapped (resampling) catalogs used.
#' When it is `0`, the original (input) mutation catalog is used for NMF decomposition,
#' this is not recommended, just for testing, user should not set it to `0`.
#' @param n_nmf_run number of NMF runs for each bootstrapped or original catalog.
#' At default, in total n_bootstrap x n_nmf_run (i.e. 1000) NMF runs are used
#' for the task.
#' @param RTOL a threshold proposed by Nature Cancer paper to control how to
#' filter solutions of NMF. Default is `0.1%` (from reference #2),
#' only NMF solutions with KLD (KL deviance) <= `100.1%` minimal KLD are kept.
#' @param min_contribution a component contribution threshold to filer out small
#' contributed components.
#' @param cores_solution cores for processing solutions, default is equal to argument `cores`.
#' @param seed a random seed to make reproducible result.
#' @param handle_hyper_mutation default is `TRUE`, handle hyper-mutant samples.
#' @param report_integer_exposure if `TRUE`, report integer signature
#' exposure by bootstrapping technique.
#' @param only_core_stats if `TRUE`, only calculate the core stats for signatures and samples.
#' @param cache_dir a directory for storing intermediate result, also avoid
#' repeated computation for same data.
#' @param keep_cache default is `FALSE`, if `TRUE`, keep cache data.
#' For small input data, it is not necessary.
#' @param pynmf if `TRUE`, use Python NMF driver [Nimfa](http://nimfa.biolab.si/index.html).
#' The seed currently is not used by this implementation, so the only way to reproduce
#' your result is setting `keep_cache = TRUE`.
#' @param use_conda if `TRUE`, create an independent conda environment to run NMF.
#' @param py_path path to Python executable file, e.g. '/Users/wsx/anaconda3/bin/python'. In my
#' test, it is more stable than `use_conda=TRUE`. You can install the Nimfa package by yourself
#' or set `use_conda` to `TRUE` to install required Python environment, and then set this option.
#' @return It depends on the called function.
#' @name bp
#' @author Shixiang Wang <w_shixiang@163.com>
#' @references
#' Alexandrov, Ludmil B., et al. "Deciphering signatures of mutational processes operative in human cancer." Cell reports 3.1 (2013): 246-259.
#'
#' Degasperi, Andrea, et al. "A practical framework and online tool for mutational signature analyses show intertissue variation and driver dependencies." Nature cancer 1.2 (2020): 249-263.
#'
#' Alexandrov, Ludmil B., et al. “The repertoire of mutational signatures in human cancer.” Nature 578.7793 (2020): 94-101.
#' @seealso See [sig_estimate], [sig_extract], [sig_auto_extract],
#' [sigprofiler_extract] for other approaches.
#' @examples
#' data("simulated_catalogs")
#' \donttest{
#' # Here I reduce the values for n_bootstrap and n_nmf_run
#' # for reducing the run time.
#' # In practice, you should keep default or increase the values
#' # for better estimation.
#' #
#' # The input data here is simulated from 10 mutational signatures
#'
#' # e1 <- bp_extract_signatures(
#' # t(simulated_catalogs$set1),
#' # range = 8:12,
#' # n_bootstrap = 5,
#' # n_nmf_run = 10
#' # )
#' #
#' # To avoid computation in examples,
#' # Here just load the result
#' # (e1$signature and e1$exposure set to NA to reduce package size)
#' load(system.file("extdata", "e1.RData", package = "sigminer"))
#'
#'
#' # See the survey for different signature numbers
#' # The suggested solution is marked as red dot
#' # with highest integrated score.
#' p1 <- bp_show_survey(e1)
#' p1
#' # You can also exclude plotting and highlighting the score
#' p2 <- bp_show_survey(e1, add_score = FALSE)
#' p2
#'
#' # You can also plot a simplified version
#' p3 <- bp_show_survey2(e1, highlight = 10)
#' p3
#'
#' # Obtain the suggested solution from extraction result
#' obj_suggested <- bp_get_sig_obj(e1, e1$suggested)
#' obj_suggested
#' # If you think the suggested signature number is not right
#' # Just pick up the solution you want
#' obj_s8 <- bp_get_sig_obj(e1, 8)
#'
#' # Track the reconstructed profile similarity
#' rec_sim <- get_sig_rec_similarity(obj_s8, t(simulated_catalogs$set1))
#' rec_sim
#'
#' # After extraction, you can assign the signatures
#' # to reference COSMIC signatures
#' # More see ?get_sig_similarity
#' sim <- get_sig_similarity(obj_suggested)
#' # Visualize the match result
#' if (require(pheatmap)) {
#' pheatmap::pheatmap(sim$similarity)
#' }
#'
#' # You already got the activities of signatures
#' # in obj_suggested, however, you can still
#' # try to optimize the result.
#' # NOTE: the optimization step may not truly optimize the result!
#' expo <- bp_attribute_activity(e1, return_class = "data.table")
#' expo$abs_activity
#' }
#'
#' \dontrun{
#' # Iterative extraction:
#' # This procedure will rerun extraction step
#' # for those samples with reconstructed catalog similarity
#' # lower than a threshold (default is 0.95)
#' e2 <- bp_extract_signatures_iter(
#' t(simulated_catalogs$set1),
#' range = 9:11,
#' n_bootstrap = 5,
#' n_nmf_run = 5,
#' sim_threshold = 0.99
#' )
#' e2
#' # When the procedure run multiple rounds
#' # you can cluster the signatures from different rounds by
#' # the following command
#' # bp_cluster_iter_list(e2)
#'
#' ## Extra utilities
#' rank_score <- bp_get_rank_score(e1)
#' rank_score
#' stats <- bp_get_stats(e2$iter1)
#' # Get the mean reconstructed similarity
#' 1 - stats$stats_sample$cosine_distance_mean
#' }
#' @testexamples
#' expect_is(e1, "ExtractionResult")
#' expect_is(p1, "ggplot")
#' expect_is(p2, "ggplot")
#' expect_is(p2, "ggplot")
#' expect_is(obj_suggested, "Signature")
#' expect_is(obj_s8, "Signature")
#' expect_is(sim, "list")
#' expect_is(rec_sim, "data.table")
#' expect_is(expo, "list")
NULL
#' @rdname bp
#' @export
bp_extract_signatures <- function(nmf_matrix,
range = 2:5,
n_bootstrap = 20L,
n_nmf_run = 50,
RTOL = 1e-3, min_contribution = 0,
cores = min(4L, future::availableCores()),
cores_solution = min(cores, length(range)),
seed = 123456L,
handle_hyper_mutation = TRUE,
report_integer_exposure = FALSE,
only_core_stats = nrow(nmf_matrix) > 100,
cache_dir = file.path(tempdir(), "sigminer_bp"),
keep_cache = FALSE,
pynmf = FALSE,
use_conda = TRUE,
py_path = "/Users/wsx/anaconda3/bin/python") {
stopifnot(
is.matrix(nmf_matrix),
!is.null(rownames(nmf_matrix)), !is.null(colnames(nmf_matrix)),
is.numeric(range),
is.numeric(n_bootstrap), is.numeric(n_nmf_run),
n_bootstrap >= 0, n_nmf_run > 0,
is.numeric(RTOL), RTOL > 0,
is.numeric(min_contribution),
min_contribution >= 0, min_contribution <= 0.1,
is.numeric(seed),
is.logical(handle_hyper_mutation),
is.numeric(cores), is.numeric(cores_solution)
)
seed <- as.integer(seed)
cores <- as.integer(cores)
cores_solution <- as.integer(cores_solution)
range <- sort(unique(range))
ii <- rowSums(nmf_matrix) < 0.01
if (any(ii)) {
message(
"The follow samples dropped due to null catalogue:\n\t",
paste0(rownames(nmf_matrix)[ii], collapse = ", ")
)
nmf_matrix <- nmf_matrix[!ii, , drop = FALSE]
}
if (!requireNamespace("synchronicity")) {
install.packages("synchronicity")
}
if (!requireNamespace("lpSolve")) {
install.packages("lpSolve")
}
timer <- Sys.time()
send_info("Best practice for signature extraction started.")
send_info("NOTE: the input should be a sample-by-component matrix.")
on.exit(send_elapsed_time(timer))
on.exit(invisible(gc()), add = TRUE)
# Input: a matrix used for NMF decomposition with rows indicate samples and columns indicate components.
raw_catalogue_matrix <- t(nmf_matrix)
send_success("Input matrix kept.")
# Dimension reduction: 去掉总贡献小的 components
if (min_contribution != 0) {
send_info("Checking contribution of components.")
contris <- colSums(nmf_matrix) / sum(nmf_matrix)
contris_index <- contris <= min_contribution
if (any(contris_index)) {
send_info(
"Dropping the components with very few contribution: ",
paste(names(contris[contris_index]), collapse = ", ")
)
nmf_matrix <- nmf_matrix[, !contris_index, drop = FALSE]
}
if (ncol(nmf_matrix) < 3) {
send_stop("Too few components (<3) left!")
}
send_success("Checked.")
}
# 超突变处理
if (handle_hyper_mutation) {
send_info("Spliting samples if it is hyper-mutant.")
nmf_matrix <- handle_hyper_mutation(nmf_matrix)
send_success("Done.")
}
send_info("Generating data for inputing NMF.")
if (n_bootstrap == 0) {
send_success("Resampling is disabled.")
bt_catalog_list <- list(nmf_matrix)
n_bootstrap <- 1L
bt_flag <- FALSE
} else {
send_success("Starting resampling (get bootstrapped catalog matrix).")
seeds_bt <- seq(seed, length = n_bootstrap)
# Generate bootstrapped catalogs based on n_bootstrap
bt_catalog_list <- purrr::map(seeds_bt, function(x) {
set.seed(x)
simulate_catalogue_matrix(nmf_matrix)
})
bt_flag <- TRUE
}
# 有必要的话添加一个极小的数,解决 NMF 包可能存在的异常报错问题
# 一个 component 之和不能为 0,还有其他一些可能引发的异常
if (pynmf) {
bt_catalog_list <- purrr::map(bt_catalog_list, t)
} else {
bt_catalog_list <- purrr::map(bt_catalog_list, ~ t(check_nmf_matrix(.)))
}
catalogue_matrix <- t(nmf_matrix)
send_success("Done.")
# Construct cache file names
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
if (pynmf) {
cache_files <- file.path(
cache_dir,
paste0(
digest::digest(list(catalogue_matrix, seed)),
"_pyNMF_K",
range,
".rds"
)
)
} else {
cache_files <- file.path(
cache_dir,
paste0(
digest::digest(list(catalogue_matrix, seed)),
"_NMF_K",
rep(range, each = length(bt_catalog_list)),
if (bt_flag) "_bt_catalog_" else "_observed_catalog_",
seq_len(n_bootstrap),
".rds"
)
)
}
extract_nmf <- function(solution) {
W <- NMF::basis(solution)
H <- NMF::coef(solution)
K <- ncol(W)
KLD <- NMF::deviance(solution)
return(list(
W = W,
H = H,
K = K,
KLD = KLD
))
}
filter_nmf <- function(s, bt_flag, RTOL) {
KLD_list <- purrr::map_dbl(s, "KLD")
if (bt_flag) {
ki <- KLD_list <= min(KLD_list) * (1 + RTOL)
s <- s[ki]
if (length(s) > 10) {
# Limits 10 best runs
KLD_list <- KLD_list[ki]
s <- s[order(KLD_list)[1:10]]
}
} else if (length(s) > 100 & !bt_flag) {
s <- s[order(KLD_list)[1:100]]
}
s
}
solutions <- list()
cache_list <- chunk2(cache_files, n = length(range))
for (k in seq_along(range)) {
if (bt_flag) {
send_info(
"Extracting ", range[k], " signatures on ",
n_bootstrap, " bootstrapped catalogs with ",
n_nmf_run, " NMF runs for each."
)
} else {
send_info(
"Extracting ", range[k], " signatures on ",
"the original catalog with ",
n_nmf_run, " NMF runs in a parallel chunk."
)
}
fl <- cache_list[[k]]
if (pynmf) {
# Use Python nimfa package
env_install(use_conda, py_path, pkg = "nimfa", pkg_version = "1.4.0")
send_info("Calling Python as backend to run NMF.")
if (!file.exists(fl[1])) {
slist <- call_pynmf(bt_catalog_list, range[k], n_nmf_run, cores)
# Filter solutions with RTOL threshold
if (bt_flag) {
send_info("Keeping at most 10 NMF solutions with KLD within (1+RTOL) range of the best.")
} else {
send_info("Keeping at most 100 best NMF solutions.")
}
slist <- purrr::map(chunk2(slist, length(bt_catalog_list)),
filter_nmf,
bt_flag = bt_flag, RTOL = RTOL
) %>%
purrr::flatten()
solutions[[paste0("K", range[k])]] <- slist
if (keep_cache) {
send_info("Saving cache file to ", fl[1])
saveRDS(slist, file = fl[1])
}
rm(slist)
invisible(gc())
} else {
send_info("Cache run file ", fl[1], " already exists, skip calling.")
send_info("Reading cache file...")
solutions[[paste0("K", range[k])]] <- readRDS(fl[1])
}
} else {
# Use R NMF package
eval(parse(text = "suppressMessages(library('NMF'))"))
call_nmf <- function(i, cores = 1L) {
if (!file.exists(fl[i])) {
send_info(sprintf(
"Running signature number %s for %s catalog %d.",
range[k],
if (bt_flag) "bootstrapped" else "observed",
i
))
r <- NMF::nmf(
bt_catalog_list[[i]],
rank = range[k],
method = "brunet",
seed = seed,
nrun = n_nmf_run,
.options = paste0("v", if (ncol(bt_catalog_list[[i]]) > 100) 4 else 1, "mkp", cores)
)
if (inherits(r, "NMFfit")) {
r <- list(r)
} else {
r <- r@.Data
}
r <- purrr::map(r, extract_nmf)
# Filter solutions with RTOL threshold
if (bt_flag) {
send_info("Keeping at most 10 NMF solutions with KLD within (1+RTOL) range of the best.")
} else {
send_info("Keeping at most 100 best NMF solutions.")
}
r <- filter_nmf(r, bt_flag, RTOL)
saveRDS(r, file = fl[i])
rm(r)
invisible(gc())
} else {
send_info("Cache run file ", fl[i], " already exists, skip.")
}
}
if (n_bootstrap >= 10 * n_nmf_run) {
oplan <- future::plan()
future::plan(set_future_strategy(), workers = cores, gc = TRUE)
on.exit(future::plan(oplan), add = TRUE, after = FALSE)
furrr::future_map(seq_len(n_bootstrap), call_nmf,
cores = 1L,
.progress = TRUE,
.options = furrr::furrr_options(
seed = TRUE,
stdout = FALSE
)
)
} else {
purrr::map(seq_len(n_bootstrap), call_nmf, cores = cores)
}
send_info("Reading NMF run files...")
solutions[[paste0("K", range[k])]] <- purrr::map(cache_list[[k]], readRDS) %>%
purrr::flatten()
send_success("Read successfully.")
if (!keep_cache) {
signal <- file.remove(cache_list[[k]])
if (all(signal)) {
send_success("NMF run files deleted.")
} else {
send_warning("Delete some or all NMF run files failed.")
}
}
}
send_success(
"Done for this signature number. Current memory size used: ",
round(mem_used() / 2^20), "MB"
)
invisible(gc(verbose = FALSE))
}
send_info("Starting process the solution list.")
# Collect solutions
# 先将所有 solution 标准化处理,得到 signature 和 activity
# 然后针对 signature 使用 clustering with match 算法进行聚类
# 聚类:使用 1 - cosine 相似性作为距离指标
if ((sum(sapply(solutions, length)) < 200L & length(solutions) < 4) | cores_solution == 1) {
solutions <- purrr::map(
solutions,
.f = process_solution,
catalogue_matrix = catalogue_matrix,
report_integer_exposure = report_integer_exposure,
only_core_stats = only_core_stats
)
} else {
cores <- cores_solution
cores <- min(cores, length(solutions))
send_info(cores, " cores set for processing solutions.")
oplan <- future::plan()
future::plan(set_future_strategy(), workers = cores, gc = TRUE, .skip = TRUE)
on.exit(future::plan(oplan), add = TRUE, after = FALSE)
solutions <- furrr::future_map(
solutions,
.f = process_solution,
catalogue_matrix = catalogue_matrix,
report_integer_exposure = report_integer_exposure,
only_core_stats = only_core_stats,
.progress = TRUE,
.options = furrr::furrr_options(
seed = seed,
stdout = FALSE
)
)
}
send_success(
"Solution list processed. Current memory size used: ",
round(mem_used() / 2^20), "MB"
)
send_info("Merging and checking the solution data.")
# 合并 solutions
solutions <- purrr::transpose(solutions)
solutions$stats <- purrr::reduce(solutions$stats, rbind)
solutions$stats_signature <- purrr::reduce(solutions$stats_signature, rbind)
solutions$stats_sample <- purrr::reduce(solutions$stats_sample, rbind)
# 追加属性
solutions$object <- purrr::map(solutions$object, .f = function(obj) {
attr(obj, "nrun") <- n_bootstrap * n_nmf_run
attr(obj, "seed") <- seed
obj
})
# 如果发现缺少 components,利用 0 进行回补
kept_comps <- rownames(solutions$signature[[1]]$signature_mean)
raw_comps <- rownames(raw_catalogue_matrix)
if (length(kept_comps) < length(raw_comps)) {
to_add <- setdiff(raw_comps, kept_comps)
solutions$object <- purrr::map(solutions$object, .f = function(obj, to_add) {
n <- ncol(obj$Signature)
m <- length(to_add)
mat_add <- matrix(rep(0, n * m),
nrow = m,
dimnames = list(to_add)
)
obj$Signature <- rbind(obj$Signature, mat_add)
obj$Signature.norm <- rbind(obj$Signature.norm, mat_add)
obj
}, to_add = to_add)
}
send_success(
"Merged. Current memory size used: ",
round(mem_used() / 2^20), "MB"
)
if (nrow(solutions$stats) > 1) {
send_info("Ranking signature extraction solutions.")
rank_score <- rank_solutions(solutions$stats)
suggested <- rank_score[order(
rank_score$aggregated_score,
rank_score$signature_number,
decreasing = TRUE
), ][1, ]
send_success(
"The solution of ",
suggested$signature_number,
" signatures (aggregated score = ",
suggested$aggregated_score,
") is the suggested result."
)
solutions$rank_score <- rank_score
solutions$suggested <- as.integer(suggested$signature_number)
} else {
solutions$rank_score <- NA
solutions$suggested <- NA_integer_
}
solutions$catalog_matrix <- raw_catalogue_matrix
class(solutions) <- "ExtractionResult"
send_success("Extraction procedure run successfully.")
solutions
}
#' @param sim_threshold a similarity threshold for selecting samples to auto-rerun
#' the extraction procedure (i.e. `bp_extract_signatures()`), default is `0.95`.
#' @param max_iter the maximum iteration size, default is 10, i.e., at most run
#' the extraction procedure 10 times.
#' @rdname bp
#' @export
bp_extract_signatures_iter <- function(nmf_matrix,
range = 2:5,
sim_threshold = 0.95,
max_iter = 10L,
n_bootstrap = 20L,
n_nmf_run = 50,
RTOL = 1e-3, min_contribution = 0,
cores = min(4L, future::availableCores()),
cores_solution = min(cores, length(range)),
seed = 123456L,
handle_hyper_mutation = TRUE,
report_integer_exposure = FALSE,
only_core_stats = nrow(nmf_matrix) > 100,
cache_dir = file.path(tempdir(), "sigminer_bp"),
keep_cache = FALSE,
pynmf = FALSE,
use_conda = FALSE,
py_path = "/Users/wsx/anaconda3/bin/python") {
force(only_core_stats)
iter_list <- list()
cache_file_list <- c()
for (i in seq_len(max_iter)) {
message("Round #", i)
message("===============================")
bp <- bp_extract_signatures(
nmf_matrix = nmf_matrix,
range = range,
n_bootstrap = n_bootstrap,
n_nmf_run = n_nmf_run,
RTOL = RTOL,
min_contribution = min_contribution,
cores = cores,
cores_solution = cores_solution,
seed = seed,
handle_hyper_mutation = handle_hyper_mutation,
report_integer_exposure = report_integer_exposure,
only_core_stats = only_core_stats,
cache_dir = cache_dir,
keep_cache = keep_cache,
pynmf = pynmf,
use_conda = use_conda,
py_path = py_path
)
# 检查寻找需要重新运行的样本,修改 nmf_matrix
iter_list[[paste0("iter", i)]] <- bp
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
cache_file <- file.path(
cache_dir,
paste0(digest::digest(nmf_matrix), "_round_", i, ".rds")
)
message("Save round #", i, " result to ", cache_file)
saveRDS(bp, file = cache_file)
cache_file_list <- c(cache_file_list, cache_file)
samp2rerun <- bp$stats_sample %>%
dplyr::filter(.data$signature_number == bp$suggested) %>%
dplyr::filter(.data$cosine_distance_mean > 1 - sim_threshold) %>%
dplyr::pull("sample")
nsamp <- dim(bp$catalog_matrix)[2]
if (length(samp2rerun) < 2L) {
if (length(samp2rerun) == 0) {
message("All samples passed the rerun threshold in round #", i)
} else {
message("Only one sample did not pass the rerun threshold in round #", i, ". Stop here.")
}
message("Return.")
break()
} else if (length(samp2rerun) == nsamp) {
message("All samples cannot pass rerun threshold in round #", i)
if (i == 1L) message("It is the first round, maybe your should lower the value.")
message("Return.")
break()
} else {
# Rerun
nmf_matrix <- t(bp$catalog_matrix)
nmf_matrix <- nmf_matrix[samp2rerun, ]
}
}
if (!keep_cache) {
on.exit(unlink(cache_file_list, force = TRUE), add = TRUE)
}
message("Done.")
class(iter_list) <- "ExtractionResultList"
iter_list
}
#' @param x result from [bp_extract_signatures_iter()] or a list of
#' `Signature` objects.
#' @param k an integer sequence specifying the cluster number to get silhouette.
#' @param include_final_iteration if `FALSE`, exclude final iteration result
#' from clustering for input from [bp_extract_signatures_iter()], not applied
#' if input is a list of `Signature` objects.
#' @rdname bp
#' @export
bp_cluster_iter_list <- function(x, k = NULL, include_final_iteration = TRUE) {
if (length(x) < 2) {
stop("No need to cluster length-1 result list.")
}
if (inherits(x, "ExtractionResultList")) {
x <- purrr::map(x, ~ bp_get_sig_obj(., .$suggested))
if (isFALSE(include_final_iteration)) {
x <- x[-length(x)]
}
}
if (!inherits(x[[1]], "Signature")) {
stop("The list element should be a Signature object.")
}
sig_list <- purrr::map(x, "Signature.norm")
sigmat <- purrr::imap(sig_list, function(x, y) {
colnames(x) <- paste(y, colnames(x), sep = ":")
x
}) %>% purrr::reduce(cbind)
cosdist <- 1 - cosineMatrix(sigmat, sigmat)
rownames(cosdist) <- colnames(cosdist) <- colnames(sigmat)
# Do clustering
distobj <- stats::as.dist(cosdist)
cls <- stats::hclust(distobj, method = "average")
if (is.null(k)) {
# Set default k sequence
k <- seq(2, max(sapply(sig_list, ncol)))
}
sil_df <- purrr::map_df(
k,
function(x) {
cls <- cluster::silhouette(cutree(cls, k = x), distobj)
cbind(
data.frame(
signame = rownames(cosdist),
k = x
),
data.frame(cls[, 1:3])
)
}
)
sil_summary <- sil_df %>%
dplyr::group_by(.data$k) %>%
dplyr::summarise(
min = min(.data$sil_width, na.rm = TRUE),
mean = mean(.data$sil_width, na.rm = TRUE),
max = max(.data$sil_width, na.rm = TRUE),
sd = sd(.data$sil_width, na.rm = TRUE)
) %>%
as.data.frame()
r <- list(
cluster = cls,
distance = cosdist,
sil_df = sil_df,
sil_summary = sil_summary,
sigmat = sigmat
)
class(r) <- "SignatureListClusters"
r
}
bp_get_cluster_index_list <- function(x) {
rg <- range(x)
sq <- seq(rg[1], rg[2])
y <- purrr::map(sq, ~ which(x == .))
names(y) <- as.character(sq)
y
}
#' @param SigClusters result from [bp_cluster_iter_list()].
#' @param cluster_label cluster labels for a specified cluster number, obtain it
#' from `SigClusters$sil_df`.
#' @rdname bp
#' @export
bp_get_clustered_sigs <- function(SigClusters, cluster_label) {
sig_idx <- bp_get_cluster_index_list(cluster_label)
sig_map <- purrr::map(sig_idx, ~ colnames(SigClusters$sigmat)[.])
names(sig_map) <- paste0("Sig", names(sig_map))
grp_sigs <- purrr::reduce(
purrr::map(sig_idx, ~ t(t(rowMeans(SigClusters$sigmat[, ., drop = FALSE])))),
cbind
)
colnames(grp_sigs) <- names(sig_map)
sim_sig_to_grp_mean <- purrr::map(
sig_idx,
~ as.numeric(
cosine(
SigClusters$sigmat[, ., drop = FALSE],
t(t(rowMeans(SigClusters$sigmat[, ., drop = FALSE])))
)
)
)
sim_sig_to_grp_mean <- purrr::map2_df(
sim_sig_to_grp_mean,
sig_idx,
~ data.frame(sig = colnames(SigClusters$sigmat)[.y], sim = .x),
.id = "grp_sig"
)
sim_sig_to_grp_mean$grp_sig <- paste0("Sig", sim_sig_to_grp_mean$grp_sig)
return(
list(
grp_sigs = grp_sigs,
sim_sig_to_grp_mean = sim_sig_to_grp_mean,
sig_map = sig_map
)
)
}
#' @param obj a `ExtractionResult` object from [bp_extract_signatures()].
#' @param signum a integer vector to extract the corresponding `Signature` object(s).
#' If it is `NULL` (default), all will be returned.
#' @rdname bp
#' @export
bp_get_sig_obj <- function(obj, signum = NULL) {
assert_class(obj, "ExtractionResult")
if (is.null(signum)) {
message("When signum is NULL, output all signature objects as a list.")
obj$object
} else {
signum <- paste0("K", signum)
if (length(signum) > 1) {
obj$object[signum]
} else {
obj$object[[signum]]
}
}
}
#' @rdname bp
#' @export
bp_get_stats <- function(obj) {
assert_class(obj, "ExtractionResult")
obj[c("stats", "stats_signature", "stats_sample")]
}
#' @rdname bp
#' @export
bp_get_rank_score <- function(obj) {
assert_class(obj, "ExtractionResult")
obj[["rank_score"]]
}
#' @rdname bp
#' @inheritParams show_sig_number_survey
#' @export
bp_show_survey2 <- function(obj,
x = "signature_number",
left_y = "silhouette", right_y = "L2_error",
left_name = left_y, right_name = right_y,
left_color = "black", right_color = "red",
left_shape = 16, right_shape = 18,
shape_size = 4, highlight = NULL) {
stopifnot(inherits(obj, "ExtractionResult") | is.data.frame(obj))
if (inherits(obj, "ExtractionResult")) {
obj <- obj$stats
}
message("Variables can be used: ", paste(colnames(obj), collapse = ", "))
show_sig_number_survey(
object = obj,
x = x,
left_y = left_y, right_y = right_y,
left_name = left_name, right_name = right_name,
left_color = left_color, right_color = right_color,
left_shape = left_shape, right_shape = right_shape,
shape_size = shape_size, highlight = highlight
)
}
#' @rdname bp
#' @param add_score if `FALSE`, don't show score and label optimal points by
#' rank score.
#' @param scales one of "free_y" (default) and "free" to control the scales
#' of plot facet.
#' @param fixed_ratio if `TRUE` (default), make the x/y axis ratio fixed.
#' @export
bp_show_survey <- function(obj,
add_score = FALSE,
scales = c("free_y", "free"),
fixed_ratio = TRUE) {
assert_class(obj, "ExtractionResult")
scales <- match.arg(scales)
if (!is.data.frame(obj$rank_score)) {
message("Show survey cannot work when only one signature extracted.")
return(invisible())
}
cols <- c(
"signature_number",
"aggregated_score",
"silhouette",
"sample_cosine_distance",
"L2_error",
"signature_similarity_within_cluster",
"exposure_positive_correlation"
)
rs <- obj$rank_score[, cols]
nsig <- nrow(rs)
plot_df <- merge(
obj$stats,
rs[, c("signature_number", "aggregated_score")],
by = "signature_number"
) %>%
dplyr::select(cols)
colnames(plot_df) <- colnames(rs) <- c("sn", "as", "sil", "cos", "err", "sim", "corr")
cn <- c("score", "silhouette", "distance", "error", "similarity", "pos cor")
names(cn) <- c("as", "sil", "cos", "err", "sim", "corr")
plot_df <- plot_df %>%
tidyr::pivot_longer(
cols = -"sn",
names_to = "type",
values_to = "measure"
)
rs <- rs %>%
tidyr::pivot_longer(
cols = -"sn",
names_to = "type",
values_to = "measure"
)
rs <- rs %>%
dplyr::group_by(.data$type) %>%
dplyr::mutate(rk = rank(.data$measure, na.last = FALSE, ties.method = "first")) %>%
dplyr::select(-"measure")
plot_df <- dplyr::left_join(plot_df, rs, by = c("sn", "type")) %>%
dplyr::mutate(
type = cn[.data$type],
type = factor(.data$type, levels = cn)
)
if (!add_score) {
plot_df <- plot_df %>%
dplyr::filter(!type %in% c("silhouette", "error"))
}
if (add_score) {
p <- ggplot(plot_df, aes_string(x = "sn", y = "measure")) +
geom_line() +
geom_point() +
geom_point(
data = dplyr::filter(
plot_df,
.data$rk == nsig & .data$type != "score"
),
color = "orange"
) +
geom_point(
data = dplyr::filter(
plot_df,
.data$rk == nsig & .data$type == "score"
),
color = "red"
)
} else {
p <- ggplot(
plot_df %>%
dplyr::filter(.data$type != "score"),
aes_string(x = "sn", y = "measure")
) +
geom_line() +
geom_point()
}
p <- p +
facet_wrap(~type, nrow = if (length(unique(plot_df$type)) > 3) 2 else 1, scales = scales) +
cowplot::theme_cowplot() +
labs(x = NULL, y = NULL)
if (fixed_ratio) {
p <- p + theme(aspect.ratio = 1)
}
p
}
#' @param input result from [bp_extract_signatures()] or a Signature object.
#' @param sample_class a named string vector whose names are sample names
#' and values are class labels (i.e. cancer subtype). If it is `NULL` (the default),
#' treat all samples as one group.
#' @param method one of 'bt' (use bootstrap exposure median, from reference #2,
#' **the most recommended way in my personal view**) or stepwise'
#' (stepwise reduce and update signatures then do signature fitting
#' with last signature sets, from reference #2, the result tends to assign
#' the contribution of removed signatures to the remaining signatures,
#' **maybe I misunderstand the paper method? PAY ATTENTION**).
#' @param bt_use_prop this parameter is only used for `bt` method to reset
#' low contributing signature activity (relative activity `<0.01`). If `TRUE`,
#' use empirical P value calculation way (i.e. proportion, used by reference `#2`),
#' otherwise a `t.test` is applied.
#' @param cache_dir a directory for keep temp result files.
#' @param keep_cache if `TRUE`, keep cache results.
#' @inheritParams sig_fit
#' @inheritParams sig_fit_bootstrap_batch
#' @rdname bp
#' @export
bp_attribute_activity <- function(input,
sample_class = NULL,
nmf_matrix = NULL,
method = c("bt", "stepwise"),
bt_use_prop = FALSE,
return_class = c("matrix", "data.table"),
use_parallel = FALSE,
cache_dir = file.path(tempdir(), "sigminer_attribute_activity"),
keep_cache = FALSE) {
# logical for bt: get median exposure value from signature fitting,
# we set exposures of a signature to zero if
# (1) bootstrapped exposure distribution is significantly less
# than the threshold of 1% of total mutations with t.test.
# If a t.test is impractical, then (2) is applied.
# In common sense, program will never run (2).
# (2) more than 5% of
# the bootstrapped exposure values (empirical P = 0.05) are
# below the threshold of 1% of total mutations in the sample.
# logical for stepwise: excludes class specific signatures if it contributes <0.01 similarity
# while include global signatures if it add >0.05 similarity
set.seed(123456, kind = "L'Ecuyer-CMRG")
method <- match.arg(method)
return_class <- match.arg(return_class)
if (inherits(input, "ExtractionResult")) {
if (is.null(nmf_matrix)) {
nmf_matrix <- t(input$catalog_matrix)
}
if (is.na(input$suggested)) {
# Only one Signature object
input <- input$object[[1]]
} else {
input <- bp_get_sig_obj(input, signum = input$suggested)
}
}
if (inherits(input, "Signature")) {
sig <- input$Signature.norm
expo <- input$Exposure.norm
} else {
# 其他的输入情况,待定
stop("Invalid input!")
}
if (is.null(nmf_matrix)) {
stop("nmf_matrix cannot be NULL!")
}
if (!is.null(sample_class)) {
samps <- names(sample_class)
if (any(duplicated(samps))) {
stop("A sample cannot be assigned to two groups!")
}
idx <- colnames(expo) %in% samps
if (length(samps) != ncol(expo) | !all(idx)) {
n_total <- ncol(expo)
samps2 <- colnames(expo)[idx]
if (n_total - length(samps2) > 0) {
warning(
n_total - length(samps2),
" samples cannot be found in 'sample_class', they will be removed!",
immediate. = TRUE
)
expo <- expo[, samps2, drop = FALSE]
nmf_matrix <- nmf_matrix[samps2, ]
}
sample_class <- sample_class[samps2]
}
}
# catalog matrix 的 component 顺序必须和 signature profile 矩阵保持一致
# 存在矩阵和 signature profile 矩阵中 signature 顺序必须一致
# 存在矩阵与 catalog matrix 中的 sample 顺序也必须一致
sig_order <- colnames(sig)
samp_order <- colnames(expo)
catalog_df <- as.data.frame(t(nmf_matrix))
catalog_samp <- colnames(catalog_df)
ix <- samp_order %in% catalog_samp
if (!all(ix)) {
message("The following samples have no original catalogs, they will be removed.")
message("\t", paste(samp_order[!ix], collapse = ", "))
samp_order <- samp_order[ix]
}
catalog_df <- catalog_df[rownames(sig), samp_order, drop = FALSE]
exist_df <- as.data.frame(construct_sig_exist_matrix(expo, sample_class))
exist_df <- exist_df[sig_order, samp_order, drop = FALSE]
act_tmp_dir <- cache_dir
if (!dir.exists(act_tmp_dir)) {
dir.create(act_tmp_dir, recursive = TRUE)
}
message("Set cache dir: ", act_tmp_dir)
if (use_parallel) {
oplan <- future::plan()
future::plan(set_future_strategy(),
workers = if (is.logical(use_parallel)) {
future::availableCores()
} else {
use_parallel
}
)
on.exit(future::plan(oplan), add = TRUE)
}
if (method == "bt") {
if (use_parallel) {
suppressMessages(
furrr::future_pmap(
.l = list(
catalog = catalog_df,
flag_vector = exist_df,
sample = samp_order
),
.f = optimize_exposure_in_one_sample_bt,
sig_matrix = sig,
tmp_dir = act_tmp_dir,
bt_use_prop = bt_use_prop,
.progress = TRUE,
.options = furrr::furrr_options(
seed = TRUE,
stdout = FALSE
)
)
)
} else {
purrr::pmap(
.l = list(
catalog = catalog_df,
flag_vector = exist_df,
sample = samp_order
),
.f = optimize_exposure_in_one_sample_bt,
sig_matrix = sig,
tmp_dir = act_tmp_dir,
bt_use_prop = bt_use_prop
)
}
} else {
# Handle samples one by one (by columns)
if (use_parallel) {
suppressMessages(
furrr::future_pmap(
.l = list(
catalog = catalog_df,
flag_vector = exist_df,
sample = samp_order
),
.f = optimize_exposure_in_one_sample,
sig_matrix = sig,
tmp_dir = act_tmp_dir,
.progress = TRUE,
.options = furrr::furrr_options(
seed = TRUE,
stdout = FALSE
)
)
)
} else {
purrr::pmap(
.l = list(
catalog = catalog_df,
flag_vector = exist_df,
sample = samp_order
),
.f = optimize_exposure_in_one_sample,
sig_matrix = sig,
tmp_dir = act_tmp_dir
)
}
}
message()
message("Reading result files...")
expo_files <- file.path(
act_tmp_dir,
paste0(samp_order, "_expo.rds")
)
sim_files <- file.path(
act_tmp_dir,
paste0(samp_order, "_sim.rds")
)
expo <- purrr::map(expo_files, readRDS)
sim <- purrr::map_dbl(sim_files, readRDS)
expo <- purrr::reduce(expo, c)
expo <- matrix(expo, ncol = length(samp_order))
rownames(expo) <- sig_order
colnames(expo) <- samp_order
rel_expo <- apply(expo, 2, function(x) x / sum(x, na.rm = TRUE))
if (!keep_cache) {
message("Clean cache files when exits.")
on.exit(unlink(act_tmp_dir, recursive = TRUE, force = TRUE), add = TRUE)
}
if (return_class == "data.table") {
expo <- .mat2dt(expo)
rel_expo <- .mat2dt(rel_expo)
}
message("Done.")
list(
abs_activity = expo,
rel_activity = rel_expo,
similarity = sim
)
}
.mat2dt <- function(x) {
x %>%
t() %>%
data.table::as.data.table(keep.rownames = "sample")
}
# Python NMF ------------------------------------------------------------------
call_pynmf <- function(V_list, rank, nrun = 1L, cores = 1L) {
on.exit(invisible(gc()))
rank <- as.integer(rank)
nrun <- as.integer(nrun)
cores <- as.integer(cores)
reticulate::source_python(system.file("py", "nmf.py", package = "sigminer", mustWork = TRUE))
res <- MultiVNMF(V_list, rank, nrun, cores) %>%
purrr::flatten() %>%
purrr::flatten()
res <- purrr::map(res, function(x) {
rownames(x$W) <- rownames(V_list[[1]])
colnames(x$H) <- colnames(V_list[[1]])
x
})
res
}
# Install python environment
env_install <- function(use_conda, py_path, pkg, pkg_version) {
if (!requireNamespace("reticulate")) {
stop("Package 'reticulate' is required, please install it firstly!")
}
if (use_conda) {
tryCatch(reticulate::conda_binary(),
error = function(e) {
message("Cannot find conda binary, installing miniconda...")
reticulate::install_miniconda()
}
)
## Prepare conda environment and packages
tryCatch(
reticulate::use_condaenv("sigminer_sigprofiler", required = TRUE),
error = function(e) {
if (grepl("failed to initialize requested version of Python", e$message)) {
stop(e$message, ". Please restart R and try again.")
}
message("Conda environment not detected, creat it and install required packages.")
message("======")
reticulate::conda_create("sigminer_sigprofiler", packages = "python=3.7")
message("Installing packages, be patient...")
message("======")
# reticulate::conda_install("sigminer_sigprofiler",
# packages = "torch==1.5.1",
# pip = TRUE,
# pip_options = "-f https://download.pytorch.org/whl/torch_stable.html"
# )
reticulate::conda_install("sigminer_sigprofiler",
packages = paste0(pkg, "==", pkg_version),
pip = TRUE
)
reticulate::use_condaenv("sigminer_sigprofiler", required = TRUE)
}
)
message("Python and conda environment configuration.")
message("====================")
print(reticulate::py_config())
} else {
if (is.null(py_path)) {
if (!reticulate::py_available(initialize = TRUE)) {
stop("Python is not available!")
}
config <- reticulate::py_config()
reticulate::use_python(config$python, required = TRUE)
} else {
reticulate::use_python(py_path, required = TRUE)
}
message("Python environment configuration.")
message("====================")
print(reticulate::py_config())
}
if (pkg == "SigProfilerExtractor" & !reticulate::py_module_available("torch")) {
message("torch not found, try installing it...")
tryCatch(
reticulate::py_install("torch==1.5.1", pip = TRUE),
error = function(e) {
message("Cannot install torch just with pip, try again with source from https://download.pytorch.org/whl/torch_stable.html")
reticulate::py_install("torch==1.5.1", pip = TRUE, pip_options = "-f https://download.pytorch.org/whl/torch_stable.html")
}
)
}
if (!reticulate::py_module_available(pkg)) {
message("Python module ", pkg, " not found, try installing it...")
reticulate::py_install(paste0(pkg, "==", pkg_version), pip = TRUE)
}
if (pkg != "SigProfilerExtractor") {
if (!reticulate::py_module_available("multiprocess")) {
message("Python module multiprocess not found, try installing it...")
reticulate::py_install("multiprocess", pip = TRUE)
}
}
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.