R/hclu_hierarclust.R

Defines functions hclu_hierarclust

Documented in hclu_hierarclust

#' Hierarchical clustering based on dissimilarity or beta-diversity
#'
#' This function generates a hierarchical tree from a dissimilarity
#' (beta-diversity) `data.frame`, calculates the cophenetic correlation
#' coefficient, and optionally retrieves clusters from the tree upon user 
#' request. The function includes a randomization process for the dissimilarity 
#' matrix to generate the tree, with two methods available for constructing the 
#' final tree. Typically, the dissimilarity `data.frame` is a
#' `bioregion.pairwise` object obtained by running `similarity`,
#' or by running `similarity` followed by `similarity_to_dissimilarity`.
#'
#' @param dissimilarity The output object from [dissimilarity()] or
#'  [similarity_to_dissimilarity()], or a `dist` object. 
#'  If a `data.frame` is used, the first two columns represent pairs of sites 
#'  (or any pair of nodes), and the subsequent column(s) contain the 
#'  dissimilarity indices.
#' 
#' @param index The name or number of the dissimilarity column to use. By 
#' default, the third column name of `dissimilarity` is used.
#' 
#' @param method The name of the hierarchical classification method, as in
#' [hclust][fastcluster::hclust]. Should be one of `"ward.D"`,
#' `"ward.D2"`, `"single"`, `"complete"`, `"average"`
#' (= UPGMA), `"mcquitty"` (= WPGMA), `"median"` (= WPGMC), or
#' `"centroid"` (= UPGMC).
#' 
#' @param randomize A `boolean` indicating whether the dissimilarity matrix 
#' should be randomized to account for the order of sites in the dissimilarity
#'  matrix.
#'  
#' @param seed A value for the random number generator (`NULL` for random by 
#' default).
#' 
#' @param n_runs The number of trials for randomizing the dissimilarity matrix.
#' 
#' @param keep_trials A `character` string indicating whether random trial results 
#' (including the randomized matrix, the associated tree and metrics for that tree) 
#' should be stored in the output object. Possible values are `"no"` (default), 
#' `"all"` or `"metrics"`. Note that this parameter is automatically set to 
#' `"no"` if `optimal_tree_method = "iterative_consensus_tree"`.
#' 
#' @param optimal_tree_method A `character` string indicating how the final tree
#' should be obtained from all trials. Possible values are 
#' `"iterative_consensus_tree"` (default), `"best"` or `"consensus"`. 
#' **We recommend `"iterative_consensus_tree"`. See Details.** 
#' 
#' @param n_clust An `integer` vector or a single `integer` indicating the 
#' number of clusters to be obtained from the hierarchical tree, or the output 
#' from [bioregionalization_metrics]. This parameter should not be used 
#' simultaneously with `cut_height`.
#' 
#' @param cut_height A `numeric` vector indicating the height(s) at which the
#' tree should be cut. This parameter should not be used simultaneously with 
#' `n_clust`.
#' 
#' @param find_h A `boolean` indicating whether the height of the cut should be 
#' found for the requested `n_clust`.
#' 
#' @param h_max A `numeric` value indicating the maximum possible tree height 
#' for the chosen `index`.
#' 
#' @param h_min A `numeric` value indicating the minimum possible height in the
#'  tree for the chosen `index`.
#' 
#' @param consensus_p A `numeric` value (applicable only if 
#' `optimal_tree_method = "consensus"`) indicating the threshold proportion of 
#' trees that must support a region/cluster for it to be included in the final 
#' consensus tree.
#' 
#' @param show_hierarchy A `boolean` specifying if the hierarchy of clusters
#' should be identifiable in the outputs (`FALSE` by default). This argument is 
#' only used if the tree is cut (i.e., `n_clust` or `cut_height` is provided).
#'  
#' @param verbose A `boolean` indicating whether to 
#' display progress messages. Set to `FALSE` to suppress these messages.
#' 
#' @return
#' A `list` of class `bioregion.clusters` with five slots:
#' \enumerate{
#' \item{**name**: A `character` string containing the name of the algorithm.}
#' \item{**args**: A `list` of input arguments as provided by the user.}
#' \item{**inputs**: A `list` describing the characteristics of the clustering process.}
#' \item{**algorithm**: A `list` containing all objects associated with the
#'  clustering procedure, such as the original cluster objects.}
#' \item{**clusters**: A `data.frame` containing the clustering results.}}
#'
#' In the `algorithm` slot, users can find the following elements:
#'
#' \itemize{
#' \item{`trials`: A list containing all randomization trials. Each trial
#' includes the dissimilarity matrix with randomized site order, the
#' associated tree, and the cophenetic correlation coefficient for
#' that tree.}
#' \item{`final.tree`: An `hclust` object representing the final
#' hierarchical tree to be used.}
#' \item{`final.tree.coph.cor`: The cophenetic correlation coefficient
#' between the initial dissimilarity matrix and the `final.tree`.}
#' }
#'  
#' @details
#' The function is based on [hclust][fastcluster::hclust].
#' The default method for the hierarchical tree is `average`, i.e.
#' UPGMA as it has been recommended as the best method to generate a tree
#' from beta diversity dissimilarity (Kreft & Jetz, 2010).
#'
#' Clusters can be obtained by two methods:
#' \itemize{
#' \item{Specifying a desired number of clusters in `n_clust`}
#' \item{Specifying one or several heights of cut in `cut_height`}}
#'
#' To find an optimal number of clusters, see [bioregionalization_metrics()]
#' 
#' It is important to pay attention to the fact that the order of rows
#' in the input distance matrix influences the tree topology as explained in 
#' Dapporto (2013). To address this, the function generates multiple trees by 
#' randomizing the distance matrix. 
#' 
#' Two methods are available to obtain the final tree:
#' \itemize{
#' 
#' \item{`optimal_tree_method = "iterative_consensus_tree"`: The Iterative 
#' Hierarchical Consensus Tree (IHCT) method reconstructs a consensus tree by 
#' iteratively splitting the dataset into two subclusters based on the pairwise 
#' dissimilarity of sites across `n_runs` trees based on `n_runs` randomizations
#' of the distance matrix. At each iteration, it 
#' identifies the majority membership of sites into two stable groups across
#' all trees,
#' calculates the height based on the selected linkage method (`method`),
#' and enforces monotonic constraints on 
#' node heights to produce a coherent tree structure. 
#' This approach provides a robust, hierarchical representation of site 
#' relationships, balancing 
#' cluster stability and hierarchical constraints.}
#' 
#' \item{`optimal_tree_method = "best"`: This method selects one tree among with 
#' the highest cophenetic correlation coefficient, representing the best fit 
#' between the hierarchical structure and the original distance matrix. }
#' 
#' \item{`optimal_tree_method = "consensus"`: This method constructs a consensus 
#' tree using phylogenetic methods with the function 
#' [consensus][ape::consensus].
#' When using this option, you must set the `consensus_p` parameter, which 
#' indicates 
#' the proportion of trees that must contain a region/cluster for it to be 
#' included 
#' in the final consensus tree. 
#' Consensus trees lack an inherent height because they represent a majority 
#' structure rather than an actual hierarchical clustering. To assign heights, 
#' we use a non-negative least squares method ([nnls.tree][phangorn::nnls.tree]) 
#' based on the initial distance matrix, ensuring that the consensus 
#' tree preserves 
#' approximate distances among clusters.}
#' }
#' 
#' We recommend using the `"iterative_consensus_tree"` as all the branches of
#' this tree will always reflect the majority decision among many randomized 
#' versions of the distance matrix. This method is inspired by 
#' Dapporto et al. (2015), which also used the majority decision
#' among many randomized versions of the distance matrix, but it expands it 
#' to reconstruct the entire topology of the tree iteratively. 
#' 
#' We do not recommend using the basic `consensus` method because in many 
#' contexts it provides inconsistent results, with a meaningless tree topology
#' and a very low cophenetic correlation coefficient. 
#' 
#' For a fast exploration of the tree, we recommend using the `best` method
#' which will only select the tree with the highest cophenetic correlation
#' coefficient among all randomized versions of the distance matrix. 
#'
#' @references
#' Kreft H & Jetz W (2010) A framework for delineating biogeographical regions
#' based on species distributions. \emph{Journal of Biogeography} 37, 2029-2053.
#' 
#' Dapporto L, Ramazzotti M, Fattorini S, Talavera G, Vila R & Dennis, RLH 
#' (2013) Recluster: an unbiased clustering procedure for beta-diversity 
#' turnover. \emph{Ecography} 36, 1070--1075.
#' 
#' Dapporto L, Ciolli G, Dennis RLH, Fox R & Shreeve TG (2015) A new procedure 
#' for extrapolating turnover regionalization at mid-small spatial scales, 
#' tested on British butterflies. \emph{Methods in Ecology and Evolution} 6
#' , 1287--1297. 
#' 
#' @seealso
#' For more details illustrated with a practical example, 
#' see the vignette: 
#' \url{https://biorgeo.github.io/bioregion/articles/a4_1_hierarchical_clustering.html}.
#' 
#' Associated functions: 
#' [cut_tree]
#' 
#' @author
#' Boris Leroy (\email{leroy.boris@gmail.com}) \cr
#' Pierre Denelle (\email{pierre.denelle@gmail.com}) \cr
#' Maxime Lenormand (\email{maxime.lenormand@inrae.fr}) 
#' 
#' @examples
#' comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
#' 20, 25)
#' rownames(comat) <- paste0("Site",1:20)
#' colnames(comat) <- paste0("Species",1:25)
#'
#' dissim <- dissimilarity(comat, metric = "Simpson")
#'
#' # User-defined number of clusters
#' tree1 <- hclu_hierarclust(dissim, 
#'                           n_clust = 5)
#' tree1
#' plot(tree1)
#' str(tree1)
#' tree1$clusters
#' 
#' # User-defined height cut
#' # Only one height
#' tree2 <- hclu_hierarclust(dissim, 
#'                           cut_height = .05)
#' tree2
#' tree2$clusters
#' 
#' # Multiple heights
#' tree3 <- hclu_hierarclust(dissim, 
#'                           cut_height = c(.05, .15, .25))
#' 
#' tree3$clusters # Mind the order of height cuts: from deep to shallow cuts
#' # Info on each partition can be found in table cluster_info
#' tree3$cluster_info
#' plot(tree3)
#' 
#' @export
hclu_hierarclust <- function(dissimilarity,
                             index = names(dissimilarity)[3],
                             method = "average",
                             randomize = TRUE,
                             seed = NULL,
                             n_runs = 100,
                             keep_trials = "no",
                             optimal_tree_method = "iterative_consensus_tree", 
                             n_clust = NULL,
                             cut_height = NULL,
                             find_h = TRUE,
                             h_max = 1,
                             h_min = 0,
                             consensus_p = 0.5,
                             show_hierarchy = FALSE,
                             verbose = TRUE){
  # 1. Controls ---------------------------------------------------------------
  controls(args = NULL, data = dissimilarity, type = "input_nhandhclu")
  if(!inherits(dissimilarity, "dist")){
    controls(args = NULL, data = dissimilarity, type = "input_dissimilarity")
    controls(args = NULL, data = dissimilarity, 
             type = "input_data_frame_nhandhclu")
    controls(args = index, data = dissimilarity, type = "input_net_index")
    net <- dissimilarity
    # Convert tibble into dataframe
    if(inherits(net, "tbl_df")){
      net <- as.data.frame(net)
    }
    colnameindex <- index
    if(is.numeric(colnameindex)){
      colnameindex <- colnames(net)[index]
      if(is.null(colnameindex)){
        colnameindex <- NA
      }
    }
    net[, 3] <- net[, index]
    net <- net[, 1:3]
    controls(args = NULL, data = net, type = "input_net_index_value")
    dist_mat <- net_to_mat(net,
                           weight = TRUE, 
                           squared = TRUE, 
                           symmetrical = TRUE)
  } else {
    controls(args = NULL, data = dissimilarity, type = "input_dist")
    dist_mat <- as.matrix(dissimilarity)
    if(is.null(names(dist_mat))){
      rownames(dist_mat) <- paste0(1:dim(dist_mat)[1])
      colnames(dist_mat) <- rownames(dist_mat)
      if(verbose){
        message("No labels detected, they have been assigned automatically.")
      }
    }
    dissimilarity <- mat_to_net(dist_mat, weight = TRUE)
    if(is.null(index)) {
      colnames(dissimilarity)[3] <- "Dissimilarity"
    } else {
      colnames(dissimilarity)[3] <- index
    }
    colnameindex <- NA
  }
  nsites <- dim(dist_mat)[1]

  
  controls(args = method, data = NULL, type = "character")
  if(!(method %in% c("ward.D", "ward.D2", "single", "complete", "average",
                          "mcquitty", "median", "centroid" ))){
    stop(paste0("Please choose method from the following:\n",
                "ward.D, ward.D2, single, complete, average, mcquitty, median ", 
                "or centroid"), 
         call. = FALSE)
  }
  controls(args = randomize, data = NULL, type = "boolean")
  if(randomize & !is.null(seed)){
    controls(args = seed, data = NULL, type = "strict_positive_integer")
  }
  controls(args = n_runs, data = NULL, type = "strict_positive_integer")
  controls(args = keep_trials, data = NULL, type = "character")
  if(!(keep_trials %in% c("no", "all", "metrics"))){
    stop(paste0("Please choose keep_trials from the following:\n",
                "no, all or metrics"), 
         call. = FALSE)
  }
  controls(args = optimal_tree_method, data = NULL, type = "character")
  if(!(optimal_tree_method %in% c("iterative_consensus_tree",
                                  "best", "consensus"))){
    stop(paste0("Please choose optimal_tree_method from the following:\n",
                "iterative_consensus_tree, best or consensus"), 
    call. = FALSE)
  }

  if(!is.null(n_clust)) {
    if(is.numeric(n_clust)) {
        controls(args = n_clust, 
                 data = NULL, 
                 type = "strict_positive_integer_vector")
    } else if(inherits(n_clust, "bioregion.bioregionalization.metrics")){
      if(!is.null(n_clust$algorithm$optimal_nb_clusters)) {
        n_clust <- n_clust$algorithm$optimal_nb_clusters
      } else {
        stop(paste0("n_clust does not have an optimal number of clusters. ",
                    "Did you specify partition_optimisation = TRUE in ",
                    "bioregionalization_metrics()?"), 
             call. = FALSE)
      }
    } else{
      stop("n_clust must be one of those:
        * an integer determining the number of clusters
        * a vector of integers determining the numbers of clusters for each cut
        * the output from bioregionalization_metrics()", 
           call. = FALSE)
    }
    if(!is.null(cut_height)){
      stop(paste0("Please provide either n_clust or cut_height, ",
                  "but not both at the same time."), 
           call. = FALSE)
    }
  }
  if(!is.null(cut_height)){
    controls(args = cut_height, data = NULL, type = "positive_numeric_vector")
  }
  controls(args = find_h, data = NULL, type = "boolean")
  if(find_h){
    controls(args = h_min, data = NULL, type = "positive_numeric")
    controls(args = h_max, data = NULL, type = "positive_numeric")
    if(h_min > h_max){
      stop("h_min must be inferior to h_max.",
           call. = FALSE)
    }
  }
  controls(args = consensus_p, data = NULL, type = "positive_numeric")
  if(consensus_p < 0.5 | consensus_p > 1) {
    stop("consensus_p must be between 0.5 and 1.",
         call. = FALSE)
  }
  controls(args = show_hierarchy, data = NULL, type = "boolean")
  controls(args = verbose, data = NULL, type = "boolean")
  
  # 2. Function ---------------------------------------------------------------
  outputs <- list(name = "hclu_hierarclust")
  
  # Adding dynamic_tree_cut = FALSE for compatibility with generic functions
  dynamic_tree_cut <- FALSE
  
  # Outputs args
  outputs$args <- list(index = index,
                       method = method,
                       randomize = randomize,
                       seed = seed,
                       n_runs = n_runs,
                       optimal_tree_method = optimal_tree_method,
                       keep_trials = keep_trials,
                       n_clust = n_clust,
                       cut_height = cut_height,
                       find_h = find_h,
                       h_max = h_max,
                       h_min = h_min,
                       consensus_p = consensus_p,
                       show_hierarchy = show_hierarchy,
                       verbose = verbose,
                       dynamic_tree_cut = dynamic_tree_cut)
  
  # Determine pairwise_metric and data_type
  pairwise_metric <- ifelse(!inherits(dissimilarity, "dist"), 
                            colnameindex, 
                            NA)
  data_type <- detect_data_type_from_metric(pairwise_metric)
  
  # Outputs inputs
  outputs$inputs <- list(bipartite = FALSE,
                         weight = TRUE,
                         pairwise = TRUE,
                         pairwise_metric = pairwise_metric,
                         dissimilarity = TRUE,
                         nb_sites = nsites,
                         data_type = data_type,
                         node_type = "site")
  
  if(randomize) {
    if(optimal_tree_method == "iterative_consensus_tree") {
      if(verbose){
        message(paste0("Building the iterative hierarchical consensus tree...",
                       " Note that this",
                       " process can take time especially if you have a lot of",
                       " sites."))
      }

      if(method == "mcquitty") {
        warning("mcquitty (WPGMA) method may not be properly implemented",
                " in Iterative Hierarchical Tree Construction (IHCT), because of the ",
                "hybrid divise-agglomerative nature of IHCT. ",
                "In WPGMA, heights are updated iteratively from bottom to top as ",
                "the tree is constructed. ",
                "In IHCT, divisions are created from top to bottom, based on a ",
                "majority decision among many trees. ", 
                "Hence, it is not possible to exactly compute WPGMA calculations ",
                "with IHCT - final height calculations are approximated into UPGMA.")
      }

      if (!is.null(seed)) set.seed(seed) # generate seed
      
      # consensus_tree <- iterative_consensus_tree(dissimilarity, 
      #                                            sites = unique(c(dissimilarity[, 1],
      #                                                             dissimilarity[, 2])), 
      #                                            index = index,
      #                                            method = method,
      #                                            depth = 1, 
      #                                            previous_height = Inf, 
      #                                            verbose = verbose,
      #                                            n_runs = n_runs,
      #                                            monotonicity_direction = "bottom-up")
      
      consensus_tree <- IHCT(dist_mat, 
                             method = method,
                             n_runs = n_runs,
                             n_splits = 2,
                             monotonicity_direction = "bottom-up",
                             verbose = verbose)
      
      if (!is.null(seed)) rm(.Random.seed, envir = globalenv()) # remove seed
      
      consensus_tree <- reconstruct_hclust_bis(consensus_tree)

      # Compute hierarchical tree
      outputs$algorithm$final.tree <- consensus_tree
      
      evals <- tree_eval(consensus_tree,
                         dist_mat)
      
      outputs$algorithm$final.tree.coph.cor <- evals$cophcor
      outputs$algorithm$final.tree.msd <- evals$msd
      
    } else {
      if(verbose){
        message(paste0("Randomizing the dissimilarity matrix with ", 
                       n_runs,
                       " trials"))
      }

      if (!is.null(seed)) set.seed(seed)
      results <- vector("list", n_runs)
      for (run in 1:n_runs) {
        trial <- list()

        trial$dist_mat <- randomize_dist(dist_mat)

        trial$hierartree <- fastcluster::hclust(stats::as.dist(trial$dist_mat), 
                                                method = method)

        evals <- tree_eval(trial$hierartree, trial$dist_mat)
        trial$cophcor <- evals$cophcor
        trial$msd <- evals$msd
        
        if(keep_trials != "all"){
          trial$dist_mat <- 0
        }

        results[[run]] <- trial
      }
      if (!is.null(seed)) rm(.Random.seed, envir = globalenv())
      
      if (optimal_tree_method == "best") {
        coph.coeffs <- sapply(results, function(x) x$cophcor)
        
        if(verbose){
          message(paste0(" -- range of cophenetic correlation coefficients ",
                         "among trials: ",
                         round(min(coph.coeffs), 4), 
                         " - ", 
                         round(max(coph.coeffs), 4)))
        }


        best.run <- which.max(coph.coeffs)
        final.tree <- results[[best.run]]$hierartree
        final.tree.metrics <- results[[best.run]]
        
      } else if (optimal_tree_method == "consensus") {
        if (n_runs < 2) {
          stop("At least two trees are required to calculate a consensus.",
               call. = FALSE)
        }

        trees <- lapply(results, function(trial) ape::as.phylo(trial$hierartree))

        consensus_tree <- ape::consensus(trees, p = 0.5)
        consensus_tree <- phangorn::nnls.tree(stats::as.dist(dist_mat), 
                                              consensus_tree, 
                                              method = "ultrametric", 
                                              trace = 0)
        consensus_tree <- ape::multi2di(consensus_tree)

        tree_ape_for_coph <- consensus_tree
        consensus_tree <- ape::as.hclust.phylo(consensus_tree)
        
        final.tree <- consensus_tree
        evals <- tree_eval(tree_ape_for_coph, dist_mat)
        final.tree.metrics <- list(cophcor = evals$cophcor, 
                                   msd = evals$msd)
      }
      
      # keep_trials
      if(keep_trials == "metrics"){
        for (run in 1:n_runs) {
          results[[run]] <- results[[run]][-c(1,2)]
        }
      }

      outputs$algorithm$final.tree <- final.tree
      outputs$algorithm$final.tree.coph.cor <- final.tree.metrics$cophcor
      outputs$algorithm$final.tree.msd <- final.tree.metrics$msd

    }
    if(verbose){
      message(paste0("\nFinal tree has a ",
                     round(outputs$algorithm$final.tree.coph.cor, 4),
                     " cophenetic correlation coefficient with the initial ",
                     "dissimilarity matrix\n"))
    }

  } else  {
    outputs$algorithm$final.tree <- fastcluster::hclust(stats::as.dist(dist_mat),
                                                        method = method)
    
    
    evals <- tree_eval(outputs$algorithm$final.tree,
                       dist_mat)
    
    outputs$algorithm$final.tree.coph.cor <- evals$cophcor
    outputs$algorithm$final.tree.msd <- evals$msd
    
    if(verbose){
      message(paste0("Output tree has a ",
                     round(outputs$algorithm$final.tree.coph.cor, 4),
                     " cophenetic correlation coefficient with the initial
                   dissimilarity matrix\n"))
    }

  }
  
  class(outputs) <- append("bioregion.clusters", class(outputs))
  
  if(any(!is.null(n_clust) | !is.null(cut_height))){
    outputs <- cut_tree(outputs,
                        n_clust = n_clust,
                        cut_height = cut_height,
                        find_h = find_h,
                        h_max = h_max,
                        h_min = h_min,
                        dynamic_tree_cut = dynamic_tree_cut,
                        show_hierarchy = show_hierarchy,
                        verbose = verbose)
    
    # Add hierarchical attribute
    outputs$inputs$hierarchical <- ifelse(ncol(outputs$clusters) > 2,
                                          TRUE,
                                          FALSE)
    
    # Add node_type attribute
    attr(outputs$clusters, "node_type") <- rep("site", dim(outputs$clusters)[1])
    
  } else {
    outputs$clusters <- NA
    outputs$cluster_info <- NA
    outputs$inputs$hierarchical <- FALSE
  }
  
  # Keep trials 
  if(randomize){
    if(keep_trials == "no"){
      outputs$algorithm$trials <- "Trials not stored in output"
    }else{
      if(optimal_tree_method == "iterative_consensus_tree"){
        outputs$algorithm$trials <- "Trials not stored in output"
      }else{
        outputs$algorithm$trials <- results
      }
    }
  }
  
  return(outputs)
}

Try the bioregion package in your browser

Any scripts or data that you put into this service are public.

bioregion documentation built on March 29, 2026, 5:07 p.m.