R/define_rb.R
In ulrb: Unsupervised Learning Based Definition of Microbial Rare Biosphere

Documented in define_rb

#' Define Rare Biosphere
#'
#' Classifies taxa in each sample into either "rare", "undetermined" or "abundant". Other classifications are allowed.
#'
#' @details
#' **Overview**
#'
#' Function to cluster taxa abundance with partition around medoids algorithm (Kaufman and Rousseuw. 1991).
#' By default, we propose the division into three clusters (k = 3),
#' which can be translated into convenient classifications: "rare", "undetermined" and "abundant".
#' Taxa from the cluster with lowest median abundance corresponds to the "rare biosphere".
#'
#' @details
#' **The classification vector**
#'
#' The classification vector (argument classification_vector) represents the
#' different clusters to be used, by ascending order of median abundance.
#' To change the number of clusters, change the number of elements in the
#' classification vector, **but order matters!** Depending on the number of clusters used,
#' you can change the meaning that best applies to your research.
#'
#' For example, you can use a classification vector with the designations: "very rare", "rare",
#' "abundant" and "very abundant"; which would apply a k = 4 underneath.
#' It is possible to use any number of clusters, as long as they are within 2 and
#' the maximum possible k.
#'
#' The maximum possible k is the number of different abundance scores observed in a single sample. Note, however,
#'  that we do not recommend any clustering for k > 10
#' and we also don't recommend k = 2 (we explain in more detail in Pascoal et al., 2025;
#' and in the vignette `vignette("explore-classifications")`.
#'
#' @details
#' **Automatic selection of the number of clusters**
#'
#' To automatically decide the number of clusters (i.e., the value of k), it is possible to do so with the argument **automatic=TRUE**. For details on
#' complete automation of [define_rb()], please see the documentation for [suggest_k()]. Briefly, the k with best average Silhouette score
#' is selected from a range of k values between 3 and 10. It is possible to decide k based on other indices ("Davies-Bouldin" or "Calinsky-Harabasz").
#'
#' If you want a more fine grained analysis of k values, we provide several functions:
#'  - [evaluate_k()];
#'  - [check_avgSil()];
#'  - [check_DB()];
#'  - [check_CH()].
#'
#' @details
#' **Verify clustering results**
#'
#' If half of the taxa of any cluster got a Silhouette score below 0.5 in any sample, then a warning is provided.
#' The warning provides the number of times this issue occurred.
#' You can inspect other alternatives to reduce the occurrences of a bad clustering,
#' but it is possible that, in some situations, you just can't find an optimal clustering.
#'
#' The detailed output gives you access to all of the clustering results:
#'  - `pam_object` is a list with the original results from the [cluster::pam()],
#'  see [cluster::pam()] documentation for more details.
#'  - `Level` is an integer indicating the specific cluster attributed by the [cluster::pam()]
#'  function for each observation. Its order is random.
#'  - `Silhouette_scores` provides the Silhouette score obtained for each
#'  observation, i.e. a score for each taxa.
#'  - `Cluster_median_abundance` provides the median taxa abundance of each cluster.
#'  - `median_Silhouette` provides the median Silhouette score obtained for each cluster.
#'  - `Evaluation` indicates if the silhouette score obtained for a given observation
#'   is below the median Silhouette of its cluster and sample.
#'
#' You can make your own plots and analysis, but we also provide another function,
#'  [plot_ulrb()], which illustrates the results obtained.
#'
#' @details
#' **Partition around medoids (pam)**
#'
#' To calculate k-medoids, we used the partition around medoids (pam)
#' algorithm, which was described in Chapter 2 of "Finding Groups in Data: An Introduction to Cluster Analysis."
#' (Kaufman and Rousseeuw, 1991) and implemented by the cluster package with the [cluster::pam()] function.
#'
#' Briefly, the pam algorithm is divided into two main phases: **build** and **swap**.
#'
#' The first phase (**build**) selects k observations as cluster representatives. The first
#' observation selected as representative is the one that minimizes the sum of the dissimilarities to the
#' remaining observations. The second, third and so on repeat the same process, until k clusters have
#' been formed.
#'
#' The **build** steps are:
#'
#' 1 - Propose a centroid with observation, \eqn{i}, which has not been selected as a centroid yet
#'
#' 2 - Calculate the distance between another observation, \eqn{j}, and its most similar
#' observation, \eqn{D_j}; and calculate the difference with the proposed centroid,
#'  \eqn{i}, i.e., \eqn{d(j,i)}
#'
#' 3 - If \eqn{d(j,i) > 0}, then calculate its contribution to the centroid:
#' \deqn{max(D_j - d(j,i),0)}
#'
#' 4 - Calculate the total gain obtained by \eqn{i}, \deqn{\sum_{j}C_{ji}}
#'
#' 5 - From all possible centroids, select the one that maximizes the previous total gain obtained,
#' \deqn{max_i \sum_jC_{ji}}
#'
#' 6 - Repeat until k observations have been selected as cluster representatives.
#'
#' The purpose of the next phase, **swap**, is to improve the representatives
#' for the clusters. The principle is to swap the cluster representative between
#' all possibilities and calculate the value sum of dissimilarities between each
#' observation and the closest centroid. The swapping continues until no more improvement is
#' possible, i.e., when the minimum sum of dissimilarities of the clusters is reached.
#'
#'
#' @details
#' **Notes**:
#'
#' Understand that **ulrb** package considers each sample as an independent
#' community of taxa, which means clustering is also independent across
#' different samples.
#' Thus, be aware that you will have clustering results and metrics for each
#' single sample, which is why we also provide some functions to analyze results across
#' any number of samples (see: [plot_ulrb()] for clustering results
#' and [evaluate_k()] for k selection).
#'
#'
#' @param data A data.frame with, at least, a column for Abundance and Sample. Additional columns are allowed.
#' @param classification_vector A vector of strings with the names for each cluster, from lower to higher abundance. Default is c("Rare", "Undetermined", "Abundance").
#' @param samples_col String with name of column with sample names.
#' @param abundance_col String with name of column with abundance values.
#' @param simplified Can be TRUE/FALSE. Default (FALSE) provides an additional column with detailed [cluster::pam()] results
#'  and Silhouette scores. If TRUE, only the Classification result is added to the original input data.
#' @param automatic By default (FALSE), will assume a classification into "Rare", "Undetermined" or "Abundant". If TRUE, then it will automatically select the number of classifications (or k),
#' based on the index argument.
#' @inheritParams suggest_k
#'
#' @returns The input data.frame with extra columns containing the classification and additional metrics (if detailed = TRUE).
#' @seealso [suggest_k()], [evaluate_k()], [plot_ulrb()], [cluster::pam()]
#' @export
#'
#' @references
#' Kaufman, L., & Rousseuw, P. J. (1991). Chapter 2 in book Finding Groups in Data: An Introduction to Cluster Analysis. Biometrics, 47(2), 788.
#' Pascoal et al. (2025). Definition of the microbial rare biosphere through unsupervised machine learning. Communications Biology.
#'
#' @examples
#' \donttest{
#' library(dplyr)
#' # Sample ID's
#' sample_names <- c("ERR2044662", "ERR2044663", "ERR2044664",
#'                    "ERR2044665", "ERR2044666", "ERR2044667",
#'                    "ERR2044668", "ERR2044669", "ERR2044670")
#'
#' # If data is in wide format, with samples in cols
#' nice_tidy <- prepare_tidy_data(nice,
#'                                sample_names = sample_names,
#'                                samples_in = "cols")
#'
#' # Straightforward with tidy format
#' define_rb(nice_tidy)
#'
#' #Closer look
#' classified_table <- define_rb(nice_tidy)
#' classified_table %>%
#' select(Sample, Abundance, Classification) %>%
#' head()
#'
#'
#' # Automatic decision, instead of a predefined definition
#' define_rb(nice_tidy, automatic = TRUE) %>% select(Sample, Abundance, Classification)
#'
#' # Automatic decision, using Davies-Bouldin index,
#' # instead of average Silhouette score (default)
#' define_rb(nice_tidy, automatic = TRUE, index = "Davies-Bouldin") %>%
#' select(Sample, Abundance, Classification)
#'
#' # User defined classifications
#' user_classifications <- c("very rare",
#'                           "rare",
#'                           "undetermined",
#'                           "abundant",
#'                           "very abundant")
#'
#' define_rb(nice_tidy, classification_vector = user_classifications) %>%
#' select(Sample, Abundance, Classification)
#'
#' # Easy to incorporate in big pipes
#' # Remove Archaea
#' # Remove taxa below 10 reads
#' # Classify according to a different set of classifications
#' nice_tidy %>%
#'  filter(Domain != "sk__Archaea") %>%
#'  filter(Abundance > 10) %>%
#'  define_rb(classification_vector = c("very rare",
#'                                      "rare",
#'                                      "abundant",
#'                                      "very abundant")) %>%
#'  select(Sample, Abundance, Classification)
#'
#'  # An example that summarises results
#' nice_tidy %>%
#'  filter(Domain != "sk__Archaea") %>%
#'  filter(Abundance > 10) %>%
#'  define_rb(classification_vector = c("very rare",
#'                                      "rare",
#'                                      "abundant",
#'                                      "very abundant")) %>%
#'  select(Sample, Abundance, Classification) %>%
#'  group_by(Sample, Classification) %>%
#'  summarise(totalAbundance = sum(Abundance))
#'}
#' @import dplyr
#' @importFrom rlang .data
#'
define_rb <- function(data,
                      classification_vector = c("Rare","Undetermined","Abundant"),
                      samples_col = "Sample",
                      abundance_col = "Abundance",
                      simplified = FALSE,
                      automatic = FALSE,
                      index = "Average Silhouette Score",
                      ...) {


  #If automatic, use suggest_k()
  if(isTRUE(automatic)){
    message("Automatic option set to TRUE, so classification vector was overwritten")
    automatic_k <- suggest_k(data, index = index, ...)
    classification_vector <- seq_along(1:automatic_k)
    message(paste0("K= ", automatic_k, " based on ", index,"."))
  }

  # Define number of cluster based on possible classifications
  k <- length(classification_vector)

  # Match samples_col and abundance_col with Samples and Abundance, respectively
  data <-
    data %>%
    rename(Sample = all_of(samples_col),
           Abundance = all_of(abundance_col))

  # Calculate k-medoids
    ## Apply cluster algorithm
    clustered_data <-
      data %>%
      filter(.data$Abundance > 0, !is.na(.data$Abundance)) %>%
      group_by(.data$Sample, .add = TRUE) %>%
      tidyr::nest() %>%
      mutate(pam_object = purrr::map(.x = data,
                                     .f = ~cluster::pam(.x$Abundance,
                                                        k = k,
                                                        diss = FALSE))) %>%
      mutate(Level = purrr::map(.x = .data$pam_object, .f = ~.x[[3]]), # obtain clusters
             Silhouette_scores = purrr::map(.x = .data$pam_object, .f = ~.x[[7]][[1]][,3])) %>%  ## obtain silhouette plots
      tidyr::unnest(cols = c("data", "Level", "Silhouette_scores"))

  # Make classification table
  classification_table <- clustered_data %>%
    group_by(.data$Sample, Level = as.factor(.data$Level), .add = TRUE) %>%
    summarize(Cluster_median_abundance = stats::median(.data$Abundance)) %>%
    arrange(.data$Sample, .data$Cluster_median_abundance) %>%
    mutate(Classification = factor(classification_vector, levels = classification_vector))

  # Apply classification_table to classify clusters
  classified_clusters <- clustered_data %>%
    mutate(Level = as.factor(.data$Level)) %>%
    left_join(classification_table)

  # Evaluate
  classified_clusters <-
    classified_clusters %>%
    group_by(.data$Sample, .data$Classification) %>%
    tidyr::nest() %>%
    mutate(median_Silhouette = purrr::map(.x = data, .f = ~median(.x$Silhouette_scores))) %>%
    mutate(Evaluation = purrr::map(.x = .data$median_Silhouette, .f = ~case_when(median_Silhouette > 0.9 ~ "Very good",
                                                                                 median_Silhouette > 0.75 ~ "Good",
                                                                                 median_Silhouette > 0.5 ~ "Sufficient",
                                                                                 median_Silhouette <= 0.5 ~ "Bad"))) %>%
    tidyr::unnest(cols = c(data, median_Silhouette, Evaluation))

  ## the nest steps only check if a warning is necessary, the output is classified clusters
  clusters_report <- classified_clusters %>%
    select(Sample, Classification, median_Silhouette, Evaluation) %>% ### break from here
    distinct() %>%
    group_by(.data$Classification) %>%
    count(.data$Evaluation)

  # Count number of samples with bad scores
  bad_samples <- clusters_report %>%
    filter(.data$Evaluation == "Bad") %>%
    pull(.data$n) %>%
    sum()

  #
  if(bad_samples > 0){
    warning(paste(bad_samples, "samples got a bad Silhouette score. Consider changing the number of classifications."))
    message("If half the observations within a classification are below 0.5 Silhouette score, we consider that the clustering was 'Bad'.")
    message("Check 'Evaluation' collumn for more details.")
  }


  # Option to simplify
  if(simplified == TRUE){
    # Remove unnecessary columns
    classified_clusters %>%
      select(-"Level", -"pam_object",
             -"Evaluation", -"median_Silhouette",
             -"Silhouette_scores",
             -"Cluster_median_abundance")
  }

  return(classified_clusters)
}