#' QC by peak-detection algorithms.
#'
#' @description
#'
#' CNAqc uses peak-detection algorithms to QC data; all leverage the idea that
#' VAFs peaks are known for mutations mapped to a segment with given minor/ major
#' allele copies. CNAqc therefore computes expected peaks, and compares them to
#' peaks detected from data. The theory works with minor modifications for both
#' clonal and subclonal segments. Three distinct algorithms are available, each
#' one working with a different type of copy number segment; all analyses are
#' called by this function, which takes care of running all the suitable algorithms
#' based on the input data.
#'
#' * Simple clonal segments (1:0, 2:0, 1:1, 2:1, 2:2). This QC measures an error for
#' the precision of the current purity estimate, failing a whole sample or a subset
#' of segments the value is over a desired maximum value. The error is determined as
#' a linear combination from the distance between VAF peaks and their theoretical
#' expectation. For this analysis, all mutations mapping across any segment with the
#' same major/minor alleles are pooled. Note that this score can be used to select
#' among alternative copy number solutions, i.e., favouring a solution with lower score.
#' The peaks are determined i) via peak-detection algorithms from the peakPick package,
#' applied to a Gaussian kernel density estimate (gKDE) smooth of the VAF distribution,
#' and ii) via the Bmix Binomial mixture model. Peak-matching (i.e., determining what
#' data peak is closest to the expected peak) has two possible implementations: one
#' matching the closest peaks by euclidean distance, the other ranking peaks from
#' higher to lowr VAFs, and prioritising the former.
#'
#' * Complex clonal segments. The QC procedure for these “general” segments uses only
#' the KDE and, as for simple segments, pools all mutations mapping across any segment
#' with the same major/minor alleles. In this case,t no segment-level or sample-level
#' scores are produced, and complex segment with many matched peaks is likely to be correct.
#'
#' * Subclonal simple segments. The QC procedure for these segments uses the KDE and
#' considers 2 subclones with distinct mixing proportions. Differently from clonal CNAs,
#' however, here the analysis is carried out at the level of each segment, i.e., without
#' pooling segments with the same karyotypes. This makes it possible to use subclonal calls
#' from callers that report segment-specific CCF values, e.g., Battenberg. The model in
#' CNAqc ranks the proposed evolutionary alternatives (linear versus branching) based on
#' the number of matched peaks. A subclonal segment with many matched peaks is likely to be
#' correct.
#'
#' Results from peak-based QC are available via \code{plot_peaks_analysis}, and stored
#' inside the input object.
#'
#' @param x A CNAqc object.
#' @param karyotypes For clonal simple CNAs, the list of segments to test; by default LOH regions (A, AA),
#' diploid regions (AB), and amplification regions (AAB, AABB) are tested, corresponding to
#' \code{'1:0', '1:1', '2:1', '2:0', '2:2'} in "Major:minor" notation.
#' @param min_karyotype_size For clonal simple CNAs, a filter for the segments
#' to test. The segment size is defined based on the number of mutations mapped, this cut is on the proportion
#' relative to the whole set of segments one wishes to analyse (defined by `karyotypes`). For example, by setting
#' `min_karyotype_size = 0.2` one would QC clonal simple CNAs that contain at least 20% of the total mutations.
#' The default of this parameter is `0` (all QCed).
#' @param min_absolute_karyotype_mutations For clonal simple CNAs, as \code{min_karyotype_size} but with a cut
#' measured on absolute mutation counts. For example, by setting `min_absolute_karyotype_mutations = 150` one
#' would QC clonal simple CNAs that contain at least `150` mutations. The default of this parameter is `100`.
#' @param p_binsize_peaks For clonal simple CNAs, peaks detected will be filtered if, in a peak, we map
#' less than \code{p_binsize_peaks * N} mutations. The value \code{N} is obtained couting all mutations that map
#' in all peaks. By default this parameters is `0.005`.
#' @param matching_epsilon Deprecated parameter.
#' @param purity_error For clonal simple CNAs, the purity error tolerance to determine QC pass or fail. This can be
#' set automatically using function \code{auto_tolerance} to optimise the analysis based on a desired rate of false
#' positives matches, as a function of the data coverage and (putative) purity.
#' @param VAF_tolerance For clonal simple CNAs, a tolerance in comparing bands overlaps which is applied
#' to the raw VAF values.
#' @param n_bootstrap For clonal simple CNAs, the number of times peak detection is bootstrapped (by default 1).
#' This helps sometimes finding peaks that might be visually observable but fail to be detected by the underlying
#' peak-detection heuristics.
#' @param matching_strategy For clonal simple CNAs, if \code{"closest"} the closest peak will be used to
#' match the expected peak. If \code{"rightmost"} peaks are matched prioritizing
#' right to left peaks (the higher-VAF gets matched first); this strategy is more correct
#' in principle but works only if there are no spurious peaks in the estimated
#' density. By default the \code{"closest"} strategy is used.
#' @param kernel_adjust For KDE-based matches the adjust density parameter; see \code{density}. Note that a
#' Gaussian kernel is used by setting (\code{kernel = 'gaussian'}).
#' @param KDE Deprecated parameter.
#' @param starting_state_subclonal_evolution For subclonal simple CNAs, the starting state to determine linear versus
#' branching evolutionary models. By default this is an heterozygous diploid `1:1` state.
#' @param cluster_subclonal_CCF For subclonal segments, should the tool try to merge segments with similar CCF and the same copy number alteration?
#' @param min_VAF Only mutations with VAF higher than the supplied cut-off will be used for the QC, but they will not be removed from the final object.
#' If the data are multisample (multi-region or longitudinal data) it is strongly advised to set as 0 (default) to avoid including private mutations of samples
#' in the peak detection.
#' @return An object of class \code{cnaqc}, modified to hold the results from this analysis. For every type
#' of segment analyzed tables with summary peaks are available in \code{x$peaks_analysis}. The most helpful table
#' is usually the one for simple clonal CNAs `x$peaks_analysis$matches`, which reports several information:
#'
#' - `mutation_multiplicity`, the number of copies of the mutation (i.e., a phasing information);
#' - `peak`, `x`, `y` the expected peak, and the matched peak (`x` and `y`);
#' - `offset`, `weight` and `score`, the factors of the final score;
#' - `QC`, a pass/fail status for the peak.
#'
#' The overall sample-level QC result is available in `x$peaks_analysis$QC`.
#'
#' @seealso \code{auto_tolerance}, \code{plot_peak_analysis} and \code{plot_QC}.
#'
#' @export
#'
#' @examples
#' data('example_dataset_CNAqc', package = 'CNAqc')
#' x = init(mutations = example_dataset_CNAqc$mutations, cna = example_dataset_CNAqc$cna, purity = example_dataset_CNAqc$purity)
#'
#' # Note the run outputs
#' x = analyze_peaks(x)
#'
#' # More precise messages
#' print(x)
#'
#' # The tabulars with summary results per peak and segment
#' print(x$peaks_analysis)
#'
#' # Analysis where simple clonal segments are matched with an alternative algorithm.
#' x = analyze_peaks(x, matching_strategy = "rightmost")
#'
#' print(x)
#'
analyze_peaks = function(x,
karyotypes = c('1:0', '1:1', '2:0', '2:1', '2:2'),
min_karyotype_size = 0,
min_absolute_karyotype_mutations = 100,
p_binsize_peaks = 0.005,
matching_epsilon = NULL,
purity_error = 0.05,
VAF_tolerance = 0.015,
n_bootstrap = 1,
kernel_adjust = 1,
matching_strategy = "closest",
KDE = TRUE,
starting_state_subclonal_evolution = "1:1",
cluster_subclonal_CCF = FALSE,
min_VAF = 0)
{
if (!is.null(matching_epsilon)) {
stop("matching_epsilon is deprecated - using purity_error = ",
purity_error)
matching_epsilon = purity_error
}
# Check inputs
stopifnot(inherits(x, "cnaqc"))
stopifnot(min_karyotype_size >= 0 & min_karyotype_size < 1)
stopifnot(min_absolute_karyotype_mutations >= 0)
stopifnot(p_binsize_peaks > 0 & p_binsize_peaks < 1)
stopifnot(purity_error > 0 & purity_error < 1)
stopifnot(is.numeric(kernel_adjust))
stopifnot(matching_strategy %in% c("closest", "rightmost"))
# Common peaks analysis - they must be in the sample
cli::cli_h1("Peak analysis: simple CNAs")
cat("\n")
x = x %>% analyze_peaks_common(
karyotypes = karyotypes,
min_karyotype_size = min_karyotype_size,
min_absolute_karyotype_mutations = min_absolute_karyotype_mutations,
p_binsize_peaks = p_binsize_peaks,
purity_error = purity_error,
VAF_tolerance = VAF_tolerance,
n_bootstrap = n_bootstrap,
kernel_adjust = kernel_adjust,
min_VAF = min_VAF
)
x$cna <- dplyr::left_join(x$cna,CNAqc:::compute_QC_table(x)$QC_table %>% filter(type == "Peaks") %>%
dplyr::mutate(QC_PASS = dplyr::if_else(QC == "PASS", TRUE, FALSE)) %>%
tidyr::separate(karyotype, into = c("Major", "minor"), sep = ":") %>%
mutate(Major = as.numeric(Major), minor = as.numeric(minor)) %>%
dplyr::select(Major, minor, QC_PASS))
x$mutations <- dplyr::left_join(x$mutations %>% ungroup(), CNAqc:::compute_QC_table(x)$QC_table %>%
filter(type == "Peaks") %>%
dplyr::mutate(QC_PASS = dplyr::if_else(QC == "PASS", TRUE, FALSE)) %>%
dplyr::select(karyotype, QC_PASS))
# Generalised peak analysis
cli::cli_h1("Peak analysis: complex CNAs")
cat("\n")
w = x$n_karyotype[!(x$n_karyotype %>% names() %in% karyotypes)]
w = w[w > min_absolute_karyotype_mutations]
if(length(w) > 0)
{
cli::cli_alert_info(
"Karyotypes {.field {names(w)}} with >{.field {min_absolute_karyotype_mutations}} mutation(s). Using epsilon = {.field {purity_error}}."
)
x = x %>% analyze_peaks_general(
n_min = min_absolute_karyotype_mutations,
epsilon = purity_error,
kernel_adjust = kernel_adjust,
n_bootstrap = n_bootstrap,
min_VAF = min_VAF
)
x$peaks_analysis$general$summary %>%
print()
x$cna <- dplyr::left_join(x$cna, x$peaks_analysis$general$summary %>%
dplyr::mutate(QC_PASS = dplyr::if_else(prop >= 0.5, TRUE, FALSE)) %>%
tidyr::separate(karyotype, into = c("Major", "minor"), sep = ":") %>%
mutate(Major = as.numeric(Major), minor = as.numeric(minor)) %>%
dplyr::select(Major, minor, QC_PASS))
x$mutations <- dplyr::left_join(x$mutations, x$peaks_analysis$general$summary %>%
dplyr::mutate(QC_PASS = dplyr::if_else(prop >= 0.5, TRUE, FALSE)) %>%
dplyr::select(karyotype, QC_PASS))
}
else
cli::cli_alert_info(
"No karyotypes with >{.field {min_absolute_karyotype_mutations}} mutation(s). "
)
cli::cli_h1("Peak analysis: subclonal CNAs")
cat("\n")
# Subclonal CNAs peak analysis
if(x$n_cna_subclonal > 0)
{
x = x %>% analyze_peaks_subclonal(
n_min = min_absolute_karyotype_mutations,
epsilon = purity_error,
kernel_adjust = kernel_adjust,
n_bootstrap = n_bootstrap,
starting_state = starting_state_subclonal_evolution,
cluster_subclonal_CCF = cluster_subclonal_CCF
)
if(!is.null(x$peaks_analysis$subclonal))
x$peaks_analysis$subclonal$summary %>% print()
else
cli::cli_alert_info("Subclonal CNAs not analysed with the current parameters.")
}
else
cli::cli_alert_info("No subclonal CNAs in this sample.")
return(x)
}
# analyze_peaks = function(x,
# karyotypes = c('1:0', '1:1', '2:0', '2:1', '2:2'),
# min_karyotype_size = 0,
# min_absolute_karyotype_mutations = 100,
# p_binsize_peaks = 0.005,
# matching_epsilon = NULL,
# purity_error = 0.05,
# VAF_tolerance = 0.015,
# n_bootstrap = 1,
# kernel_adjust = 1,
# matching_strategy = "closest",
# KDE = TRUE,
# starting_state_subclonal_evolution = "1:1")
# {
#
# if (!is.null(matching_epsilon)) {
# stop("matching_epsilon is deprecated - using purity_error = ",
# purity_error)
# matching_epsilon = purity_error
# }
#
# # Check input
# stopifnot(inherits(x, "cnaqc"))
# stopifnot(min_karyotype_size >= 0 & min_karyotype_size < 1)
# stopifnot(min_absolute_karyotype_mutations >= 0)
# stopifnot(p_binsize_peaks > 0 & p_binsize_peaks < 1)
# stopifnot(purity_error > 0 & purity_error < 1)
# stopifnot(is.numeric(kernel_adjust))
# stopifnot(matching_strategy %in% c("closest", "rightmost"))
#
# # Generalised peak analysis
# cli::cli_h1("Peak analysis: sample-level QC with common karyotypes")
# cat("\n")
#
#
# # Karyotypes of interest, and filter for karyotype size
# qc_mutations = x$mutations %>%
# dplyr::filter(karyotype %in% karyotypes)
#
# filtered_qc_mutations = qc_mutations %>%
# dplyr::group_by(karyotype) %>%
# dplyr::summarise(
# n = n(),
# n_proportion = n() / x$n_mutations,
# .groups = 'drop'
# ) %>%
# dplyr::arrange(desc(n)) %>%
# dplyr::mutate(QC = (n_proportion > min_karyotype_size) &
# n > min_absolute_karyotype_mutations)
#
# # Re-normalize karyotype size for the ones with QC = true
# N_total = sum(filtered_qc_mutations %>% filter(QC) %>% pull(n_proportion))
# filtered_qc_mutations = filtered_qc_mutations %>%
# dplyr::mutate(norm_prop = ifelse(QC, n_proportion / N_total, NA))
#
# n_k = sum(filtered_qc_mutations %>% filter(QC) %>% pull(n))
#
# cli::cli_alert_info(
# "Karyotypes {.field {karyotypes}}. Matching strategy {.field {matching_strategy}}. KDE = {.field {KDE}}."
# )
#
# cli::cli_alert_info(
# paste0(
# "Found n = ",
# n_k,
# ' mutations in ',
# paste(
# filtered_qc_mutations %>% dplyr::filter(QC) %>% dplyr::pull(karyotype),
# collapse = ', '
# ),
# ' (skipping those with n < ',
# round(min_karyotype_size * x$n_mutations),
# ' mutations).'
# )
# )
#
# # Special case, nothing to compute
# if (n_k == 0) {
# warning("There are non mutations to detect peaks, returning input object.")
# return(x)
# }
#
# # Actual data and analysis
# qc_karyotypes = filtered_qc_mutations %>% dplyr::filter(QC) %>% dplyr::pull(karyotype)
# qc_mutations = qc_mutations %>% dplyr::filter(karyotype %in% qc_karyotypes)
#
# tumour_purity = x$purity
#
# detections = NULL
# for (k in karyotypes)
# {
# # For karyotypes that cannot be checked we create a dummy entry
# if (!(k %in% qc_karyotypes))
# {
# detection = list(matching = NULL)
# }
# else
# {
# cli::cli_alert("Fitting {.field {k}} from {.field {x$n_karyotype[k]}} mutations - {.field {n_bootstrap}} bootstrap(s).")
#
# # The other ones are actually checked
# AB = as.numeric(strsplit(k, ':')[[1]])
# expectation = CNAqc:::expected_vaf_peak(AB[1], AB[2], tumour_purity)
#
# band_matching = delta_vaf_karyo(epsilon_error = purity_error,
# purity = x$purity) %>%
# dplyr::filter(karyotype == k) %>%
# pull(delta_vaf)
#
# # Bootstrap function - w is a fake parameter, the N total
# # is used meaning that if N==1 no bootstrap is computed
# pd = function(w, n_boot)
# {
# w = qc_mutations %>%
# dplyr::filter(karyotype == k)
#
# # Bootstrap only if multiple samples are required
# data_input = w
# if (n_boot > 1)
# data_input = w %>% dplyr::sample_n(size = nrow(w), replace = TRUE)
#
# run_results = NULL
#
# if (matching_strategy == "rightmost")
# run_results = peak_detector(
# mutations = data_input,
# expectation = expectation,
# tumour_purity = tumour_purity,
# filtered_qc_mutations = filtered_qc_mutations,
# p = p_binsize_peaks,
# kernel_adjust = kernel_adjust,
# matching_epsilon = band_matching,
# VAF_tolerance = VAF_tolerance,
# KDE = KDE
# )
#
# if (matching_strategy == "closest")
# run_results = peak_detector_closest_hit_match(
# mutations = data_input,
# expectation = expectation,
# tumour_purity = tumour_purity,
# filtered_qc_mutations = filtered_qc_mutations,
# p = p_binsize_peaks,
# kernel_adjust = kernel_adjust,
# matching_epsilon = band_matching,
# VAF_tolerance = VAF_tolerance,
# KDE = KDE
# )
#
# run_results
# }
#
# # Sample "n_bootstrap" times via the bootstrap
# if (n_bootstrap < 0)
# n_bootstrap = 1
#
# # For error handling, we switch to easypar
# # best_peaks = lapply(1:n_bootstrap, pd, n_boot = n_bootstrap)
# best_peaks = easypar::run(
# FUN = pd,
# PARAMS = lapply(1:n_bootstrap, function(w)
# data.frame(w = w, n_boot = n_bootstrap)),
# parallel = FALSE,
# silent = TRUE,
# filter_errors = TRUE
# )
#
# # Runs for this giving ALL errors are intercepted and an error is raised
# # https://github.com/caravagnalab/CNAqc/issues/10
# # Without a better example failure a more intelligent handling is difficult to achieve
# if (length(best_peaks) == 0)
# stop(
# "Cannot QC karyotype ",
# k,
# " - consider removing that from the input \"karyotypes\" vector."
# )
#
# best_score = which.min(sapply(best_peaks, function(w)
# w$matching$weight %*% w$matching$offset))
# best_score = best_score[1]
#
# # Best is returned
# detection = best_peaks[[best_score]]
# }
#
# detections = append(detections, list(detection))
# }
# names(detections) = karyotypes
#
# # =-==-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# # Compute a linear combination for a GOF of each karyotype
# # =-==-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# assembled_corrections = Reduce(bind_rows,
# lapply(detections, function(w)
# w$matching)) %>%
# mutate(score = weight * offset)
# assembled_corrections$purity_error = purity_error
#
# overall_score = sum(assembled_corrections$score)
#
# # ###### ###### ###### ###### ###### ###### ######
# # # QC of the analysis, each karyotype independently up to the overall score
# # ###### ###### ###### ###### ###### ###### ######
# #
# # # Function to a border to a plot
# # qc_plot = function(x, QC)
# # {
# # if (is.na(QC))
# # return(x)
# # qc = ifelse(QC == "FAIL", "indianred3", 'forestgreen')
# #
# # x +
# # theme(title = element_text(color = qc),
# # panel.border = element_rect(colour = qc,
# # fill = NA))
# # }
#
# # QC per karyotype ~ kscore is karyotype score (sum, partila of the overall score)
# score_per_karyotype = assembled_corrections %>%
# dplyr::group_by(karyotype) %>%
# dplyr::arrange(desc(counts_per_bin)) %>%
# dplyr::filter(row_number() == 1) %>%
# dplyr::select(karyotype, matched) %>%
# dplyr::mutate(QC = ifelse(matched, "PASS", 'FAIL')) %>%
# dplyr::select(-matched)
#
# assembled_corrections = assembled_corrections %>% dplyr::left_join(score_per_karyotype, by = 'karyotype')
#
# print(assembled_corrections)
#
# # QC overall score
# # QC = ifelse(abs(overall_score) < matching_epsilon, "PASS", "FAIL")
#
# QC = assembled_corrections %>%
# dplyr::group_by(QC) %>%
# dplyr::summarise(prop = sum(weight)) %>%
# dplyr::arrange(desc(prop)) %>%
# dplyr::filter(dplyr::row_number() == 1) %>%
# dplyr::pull(QC)
#
# if (QC == "FAIL")
# cli::cli_alert_danger(
# "Peak detection {red('FAIL')} with {.value {red(paste0('r = ', overall_score))}} and tolerance e = {.value {2 * purity_error}}"
# )
#
# if (QC == "PASS")
# cli::cli_alert_success(
# "Peak detection {green('PASS')} with {.value {green(paste0('r = ', overall_score))}} and tolerance e = {.value {2 * purity_error}}"
# )
#
# # Plots assembly
# # qc_each_plot = pio:::nmfy(score_per_karyotype$karyotype, score_per_karyotype$QC)
# # na_plots = setdiff(names(detections), score_per_karyotype$karyotype)
# # qc_each_plot[na_plots] = NA
# #
# # plots = lapply(names(detections), function(w)
# # qc_plot(detections[[w]]$plot, QC = qc_each_plot[w]))
# # names(plots) = karyotypes
#
# x$peaks_analysis = list(
# score = overall_score,
# fits = detections,
# matches = assembled_corrections,
# matching_strategy = matching_strategy,
# purity_error = purity_error,
# # plots = plots,
# min_karyotype_size = min_karyotype_size,
# p_binsize_peaks = p_binsize_peaks,
# # matching_epsilon = matching_epsilon,
# QC = QC,
# KDE = KDE
# )
#
# # Generalised peak analysis
# cli::cli_h1("Peak analysis: QC with general karyotypes")
# cat("\n")
#
# w = x$n_karyotype[!(x$n_karyotype %>% names() %in% karyotypes)]
# w = w[w > min_absolute_karyotype_mutations]
#
# if(length(w) > 0)
# {
# cli::cli_alert_info(
# "Karyotypes {.field {names(w)}} with >{.field {min_absolute_karyotype_mutations}} mutation(s). Using epsilon = {.field {purity_error}}."
# )
#
# x = x %>% analyze_peaks_general(
# n_min = min_absolute_karyotype_mutations,
# epsilon = purity_error,
# kernel_adjust = kernel_adjust,
# n_bootstrap = n_bootstrap
# )
#
# x$peaks_analysis$general$summary %>%
# print()
# }
# else
# cli::cli_alert_info(
# "No karyotypes with >{.field {min_absolute_karyotype_mutations}} mutation(s). "
# )
#
#
# cli::cli_h1("Peak analysis: subclonal CNAs")
# cat("\n")
#
# # Subclonal CNAs peak analysis
# if(x$n_cna_subclonal > 0)
# {
# x = x %>% analyze_peaks_subclonal(
# n_min = min_absolute_karyotype_mutations,
# epsilon = purity_error,
# kernel_adjust = kernel_adjust,
# n_bootstrap = n_bootstrap,
# starting_state = starting_state_subclonal_evolution
# )
#
# if(!is.null(x$peaks_analysis$subclonal))
# x$peaks_analysis$subclonal$summary %>% print()
# else
# cli::cli_alert_info("Subclonal CNAs not analysed with the current parameters.")
# }
# else
# cli::cli_alert_info("No subclonal CNAs in this sample.")
#
# return(x)
# }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.