R/delineate_with_similarity.R
In maldipickr: Dereplicate and Cherry-Pick Mass Spectrometry Spectra

Documented in delineate_with_similarity

# WARNING - Generated by {fusen} from dev/dereplicate-spectra.Rmd: do not edit by hand

#' Delineate clusters from a similarity matrix
#'
#' From a matrix of spectra similarity (e.g., with the cosine metric,
#' or Pearson product moment), infer the species clusters based on a
#' threshold **above** (or **equal to**) which spectra are considered alike.
#'
#' @param sim_matrix A \eqn{n \times n} similarity matrix, with \eqn{n} the number of spectra. Columns should be named as the rows.
#' @param threshold A numeric value indicating the minimal similarity between two spectra. Adjust accordingly to the similarity metric used.
#' @param method The method of hierarchical clustering to use. The default and recommended method is "complete", but any methods from [stats::hclust] are valid.
#'
#' @return A tibble of \eqn{n} rows for each spectra and 3 columns:
#' * `name`: the rownames of the similarity matrix indicating the spectra names
#' * `membership`: integers stating the cluster number to which the spectra belong to. It starts from 1 to \eqn{c}, the total number of clusters.
#' * `cluster_size`: integers indicating the total number of spectra in the corresponding cluster.
#'
#' @details The similarity matrix is converted to a distance matrix by subtracting the value one. This approach works for cosine similarity and positive correlations that have an upper bound of 1. Clusters are then delineated using hierarchical clustering. The default method of hierarchical clustering is the complete linkage (also known as farthest neighbor clustering) to ensure that the within-group minimum similarity of each cluster respects the threshold. See the Details section of [stats::hclust] for others valid methods to use.
#' 
#' @seealso For similarity metrics: [`coop::tcosine`](https://rdrr.io/cran/coop/man/cosine.html), [`stats::cor`](https://rdrr.io/r/stats/cor.html), [`Hmisc::rcorr`](https://rdrr.io/cran/Hmisc/man/rcorr.html). For using taxonomic identifications for clusters : [delineate_with_identification]. For further analyses: [set_reference_spectra].
#' @export
#' @examples
#' # Toy similarity matrix between the six example spectra of
#' #  three species. The cosine metric is used and a value of
#' #  zero indicates dissimilar spectra and a value of one
#' #  indicates identical spectra.
#' cosine_similarity <- matrix(
#'   c(
#'     1, 0.79, 0.77, 0.99, 0.98, 0.98,
#'     0.79, 1, 0.98, 0.79, 0.8, 0.8,
#'     0.77, 0.98, 1, 0.77, 0.77, 0.77,
#'     0.99, 0.79, 0.77, 1, 1, 0.99,
#'     0.98, 0.8, 0.77, 1, 1, 1,
#'     0.98, 0.8, 0.77, 0.99, 1, 1
#'   ),
#'   nrow = 6,
#'   dimnames = list(
#'     c(
#'       "species1_G2", "species2_E11", "species2_E12",
#'       "species3_F7", "species3_F8", "species3_F9"
#'     ),
#'     c(
#'       "species1_G2", "species2_E11", "species2_E12",
#'       "species3_F7", "species3_F8", "species3_F9"
#'     )
#'   )
#' )
#' # Delineate clusters based on a 0.92 threshold applied
#' #  to the similarity matrix
#' delineate_with_similarity(cosine_similarity, threshold = 0.92)
delineate_with_similarity <- function(sim_matrix, threshold, method = "complete") {
  if (!is.matrix(sim_matrix)) {
    stop("The similarity matrix is not a matrix.")
  }
  if (nrow(sim_matrix) != ncol(sim_matrix)) {
    stop("The similarity matrix is not square: nrow != ncol.")
  }
  if (is.null(rownames(sim_matrix)) || is.null(colnames(sim_matrix))) {
    stop("The similarity matrix has no rownames or colnames.")
  }
  if (any(rownames(sim_matrix) != colnames(sim_matrix))) {
    stop("The similarity matrix has no identical names.")
  }
  if (!is.numeric(threshold)) {
      stop("The threshold provided is not a numeric.")
  }
  if( threshold < 0 | threshold > 1 ){
    stop("The threshold provided is not in the range [0-1].")
  }
  
  # Clustering with default complete-linkage after
  #  conversion of similarity matrix to distance matrix
  #
  # WARNING: despite being distance agnostic, we expect the distance to be [0,1]
  dist_matrix <- stats::as.dist(1 - sim_matrix)
  hierarchical_clustering <- stats::hclust(dist_matrix, method = method)
  
  # Spectra belongs to the same cluster only if similarity is above or equal to the threshold
  #
  # The threshold is converted to a distance threshold
  dist_threshold <- 1 - threshold
  memberships <- stats::cutree(hierarchical_clustering, h = dist_threshold)


  memberships %>%
    tibble::enframe(value = "membership") %>%
    dplyr::group_by(.data$membership) %>%
    dplyr::mutate(
      "membership" = base::as.integer(.data$membership),
      "cluster_size" = dplyr::n()
    ) %>%
    dplyr::ungroup() %>%
    return()
}