R/Icluster.R
In VecDep: Measuring Copula-Based Dependence Between Random Vectors

Documented in Icluster

#' @title Icluster
#'
#' @description This function clusters the columns (variables) of a dataset via agglomerative hierarchical variable clustering using estimated multivariate similarities
#' (dependence coefficients) between random vectors.
#'
#' @param data The dataset (\eqn{n \times q} matrix with observations in rows, variables in columns) whose columns need to be clustered.
#' @param est_method The method for estimating the similarity between two clusters of variables.
#' @param max_dim The maximum dimension of the random vectors for which no link function is used when computing the similarity (default = Inf).
#' @param norm A possible normalization function applied to the dependence measure (default = NULL, meaning no normalization).
#' @param link The link function to be used when max_dim is exceeded (default = "average").
#' @param trace Controls how verbose output should be (default = 1, showing the progress).
#'
#' @details
#' Suppose that the \eqn{q} variables (of which we have \eqn{n} observations in data) are \eqn{\mathcal{S} = \{X_{1}, \dots, X_{q}\}}.
#' Then, most important in hierarchical variable clustering, is computing the similarity
#' \deqn{\mathcal{D}(\mathbb{X},\mathbb{Y})} between two disjoint subsets of variables \eqn{\mathbb{X}, \mathbb{Y} \subset \mathcal{S}}.
#' In particular, the main algorithm is as follows:
#' \itemize{
#' \item Each object \eqn{\{X_{i}\}} forms a separate cluster, i.e., \eqn{\aleph_{1} = \{\{X_{1}\},\dots,\{X_{q}\}\}} is the initial feature partition.}
#' For \eqn{i = 1,2,\dots,q-1}:
#' \itemize{
#' \item For each pair of disjoint clusters \eqn{\mathbb{X},\mathbb{Y} \in \aleph_{i}}, compute the similarity \eqn{\mathcal{D}(\mathbb{X},\mathbb{Y})}.
#' \item Define \eqn{\aleph_{i+1} = (\aleph_{i} \setminus \{\widetilde{\mathbb{X}},\widetilde{\mathbb{Y}}\}) \cup \{\widetilde{\mathbb{X}} \cup \widetilde{\mathbb{Y}} \}}, where \eqn{\widetilde{\mathbb{X}},\widetilde{\mathbb{Y}}} are the clusters having maximal similarity according to the previous step.
#' \item The algorithm stops with \eqn{\aleph_{q} = \{\{X_{1},\dots,X_{q}\}\}}.}
#' We call \eqn{\{\aleph_{1}, \dots, \aleph_{q}\}} the hierarchy constructed throughout the algorithm, and define, for \eqn{i \in \{1, \dots, q\}},
#' \eqn{\text{Adiam}(\aleph_{i}) = |\aleph_{i}|^{-1} \sum_{\mathbb{X} \in \aleph_{i}} \text{diam}(\mathbb{X})}, with
#' \deqn{\text{diam}(\mathbb{X}) = \begin{cases} \underset{\{X,Y\} \subset \mathbb{X} }{\min} \mathcal{D}(X,Y)  & \mbox{if } |\mathbb{X}| > 1 \\ 1 & \mbox{if } |\mathbb{X}| = 1, \end{cases}}
#' and \eqn{\text{Msplit}(\aleph_{i}) = \max_{\mathbb{X} \in \aleph_{i}} \text{split}(\mathbb{X})}, with
#' \deqn{\text{split}(\mathbb{X}) =  \underset{Y \in \aleph_{i} \setminus \mathbb{X}}{\underset{X \in \mathbb{X}}{\max}} \mathcal{D}(X,Y) \hspace{0.2cm} \text{for} \hspace{0.2cm} \{\mathbb{X}\} \neq \aleph_{i}.}
#' Adiam stands for the average diameter of a partition (measuring the homogeneity, which we want to be large), while
#' Msplit stands for the maximum split of a partition (measuring the non-separation, which we want to be small).
#'
#' For measuring the similarity \eqn{\mathcal{D}(\mathbb{X},\mathbb{Y})}, we approach \eqn{\mathbb{X}} and \eqn{\mathbb{Y}} as being two random vectors
#' (let's say of dimensions \eqn{d_{1}} and \eqn{d_{2}} respectively). For \eqn{\mathcal{D}}, we take an estimated dependence measure between (two) random vectors. The following options are possible:
#' \itemize{
#' \item  list("phi", "mi", "Gauss", omegas = omegas) for the estimated Gaussian copula mutual information. Use omegas = 1 for no penalty, or a sequence of omegas for a ridge penalty tuned via 5-fold cross-validation,
#' see also the functions \code{\link{minormal}}, \code{\link{estR}}, and \code{\link{cvomega}},
#' \item list("phi", "Hel", "Gauss", omegas = omegas) for the estimated Gaussian copula Hellinger distance. Use omegas = 1 for no penalty, or a sequence of omegas for a ridge penalty tuned via 5-fold cross-validation,
#' see also the functions \code{\link{Helnormal}}, \code{\link{estR}}, and \code{\link{cvomega}},
#' \item list("phi", phi(t), "hac", type = type, M = M) for general \eqn{\Phi}-dependence with specified function phi(t) = \eqn{\Phi(t)},
#'        estimated by fitting (via pseudo maximum likelihood estimation) a hierarchical Archimedean copula of given type = type,
#'        and computed based on a Monte Carlo sample of size \eqn{M} in order to approximate the expectation, see also the functions \code{\link{mlehac}}, \code{\link{phihac}} and \code{\link{estphi}},
#' \item list("phi", phi(t), "nphac", estimator = estimator, type = type) for general
#'       \eqn{\Phi}-dependence with specified function phi(t) = \eqn{\Phi(t)},
#'        estimated via non-parametric beta kernel estimation or Gaussian transformation kernel estimation,
#'        and local bandwidth selection, by using a fitted (via pseudo maximum likelihood) hierarchical Archimedean copula
#'        as reference copula, see also the functions \code{\link{phinp}} and \code{\link{estphi}},
#' \item list("phi", phi(t), "np", estimator = estimator, bw_method = bw_method) for general \cr
#' \eqn{\Phi}-dependence with specified function phi(t) = \eqn{\Phi(t)},
#'         estimated via non-parametric beta kernel estimation or Gaussian transformation kernel estimation,
#'         and local bandwidth selection, either by using a non-parametric kernel estimator as reference copula if bw_method = 1,
#'         or by using a big O bandwidth rule if bw_method = 2, see also the functions \code{\link{phinp}} and \code{\link{estphi}},
#' \item list("phi", phi(t), "ellip", grid = grid) for general \eqn{\Phi}-dependence with specified function phi(t) = \eqn{\Phi(t)},
#'        estimated via the improved MECIP procedure on the specified grid, and parameter selection done via
#'        the function \code{\link{elliptselect}} using the Gaussian generator as reference generator, see also the functions \code{\link{phiellip}} and \code{\link{estphi}},
#' \item list("ot", coef = coef, omegas = omegas) for Gaussian copula Bures-Wasserstein dependence measures, either coefficient \eqn{\mathcal{D}_{1}} (coef = 1) or coefficient \eqn{\mathcal{D}_{2}} (coef = 2).
#'        Use omegas = 1 for no penalty, or a sequence of omegas for a ridge penalty tuned via 5-fold cross-validation,
#'        see also the functions \code{\link{bwd1}}, \code{\link{bwd2}}, \code{\link{estR}}, and \code{\link{cvomega}}.}
#'
#' When \eqn{d_{1} + d_{2} >} max_dim, the specified link function (say \eqn{L}) is used
#' for computing the similarity between \eqn{\mathbb{X}} and \eqn{\mathbb{Y}}, i.e.,
#' \deqn{\mathcal{D} \left ( \mathbb{X}, \mathbb{Y} \right ) = L \left (\left \{\mathcal{D}(X,Y) : X \in \mathbb{X}, Y \in \mathbb{Y} \right \} \right ),}
#' which by default is the average of all inter-pairwise similarities. Other options are "single" for the minimum, and "complete" for the maximum.
#'
#' The function norm (say \eqn{N}) is a possible normalization applied to the similarity measure, i.e., instead of
#' computing \eqn{\mathcal{D}} (using the method specified by est_method), the similarity becomes \eqn{N \circ \mathcal{D}}.
#' The default is \eqn{N(t) = t}, meaning that no normalization is applied.
#'
#' @return A list with elements "hierarchy" containing the hierarchy constructed throughout the algorithm (a hash object), "all" containing all similarities that were computed throughout the algorithm (a hash object),
#' "diam" containing the average diameters of all partitions created throughout the algorithm (a vector), and "split" containing the maximum splits of all partitions created throughout the algorithm (a vector).
#'
#' @references
#' De Keyser, S. & Gijbels, I. (2024).
#' Hierarchical variable clustering via copula-based divergence measures between random vectors.
#' International Journal of Approximate Reasoning 165:109090.
#' doi: https://doi.org/10.1016/j.ijar.2023.109090.
#'
#' De Keyser, S. & Gijbels, I. (2024).
#' Parametric dependence between random vectors via copula-based divergence measures.
#' Journal of Multivariate Analysis 203:105336. \cr
#' doi: https://doi.org/10.1016/j.jmva.2024.105336.
#'
#' De Keyser, S. & Gijbels, I. (2024).
#' High-dimensional copula-based Wasserstein dependence.
#' doi: https://doi.org/10.48550/arXiv.2404.07141.
#'
#' @seealso \code{\link{minormal}} for the computation of the Gaussian copula mutual information,
#'          \code{\link{Helnormal}} for the computation of the Gaussian copula Hellinger distance,
#'          \code{\link{estphi}} for several approach to estimating the \eqn{\Phi}-dependence between \eqn{k} random vectors,
#'          \code{\link{bwd1}} for the computation of the first Bures-Wasserstein dependence coefficient \eqn{\mathcal{D}_{1}},
#'          \code{\link{bwd2}} for the computation of the second Bures-Wasserstein dependence coefficient \eqn{\mathcal{D}_{2}}.
#'
#' @examples
#' \donttest{
#' q = 20
#'
#' # We will impose a clustering
#' # {{X1,X2},{X3,X4,X5},{X6,X7,X8},{X9,X10,X11,X12,X13},{X14,X15,X16,X17,X18,X19,X20}}
#' dim = c(2,3,3,5,7)
#'
#' # Sample size
#' n = 200
#'
#' # Twenty dimensional hierarchical Gumbel copula with parameters
#' # (theta_0,theta_1,theta_2,theta_3,theta_4,theta_5) = (2,3,4,5,6,7)
#' hac = gethac(dim,c(2,3,4,5,6,7),type = 1)
#'
#' # So, strongest cluster is {X14,X15,X16,X17,X18,X19,X20}, then {X9,X10,X11,X12,X13},
#' # then {X6,X7,X8}, then {X3,X4,X5}, and finally {X1,X2}
#'
#' # Sample
#' sample =  suppressWarnings(HAC::rHAC(n,hac))
#'
#' # Cluster using different methods
#'
#' # Gaussian copula based methods
#'
#' Clustering1 =  Icluster(data = sample,
#'                         est_method = list("phi", "mi", "Gauss", omegas = 1))
#'
#' # 5-cluster partition
#' Clustering1$hierarchy$Aleph_16
#'
#' Clustering2 =  Icluster(data = sample,
#'                         est_method = list("phi", "mi", "Gauss",
#'                                           omegas = seq(0.01,0.999,len = 50)))
#'
#' # 5-cluster partition
#' Clustering2$hierarchy$Aleph_16
#'
#' Clustering3 =  Icluster(data = sample,
#'                         est_method = list("phi", "mi", "Gauss", omegas = 1),
#'                         max_dim = 2)
#'
#' # 5-cluster partition
#' Clustering3$hierarchy$Aleph_16
#'
#' Clustering4 =  Icluster(data = sample,
#'                         est_method = list("phi", "Hel", "Gauss", omegas = 1))
#'
#' # 5-cluster partition
#' Clustering4$hierarchy$Aleph_16
#'
#' Clustering5 =  Icluster(data = sample,
#'                         est_method = list("ot", coef = 1, omegas = 1))
#'
#' # 5-cluster partition
#' Clustering5$hierarchy$Aleph_16
#'
#' Clustering6 =  Icluster(data = sample,
#'                         est_method = list("ot", coef = 2, omegas = 1))
#'
#' # 5-cluster partition
#' Clustering6$hierarchy$Aleph_16
#'
#' Clustering7 =  Icluster(data = sample,
#'                         est_method = list("ot", coef = 2, omegas = 1), max_dim = 4)
#'
#' # 5-cluster partition
#' Clustering7$hierarchy$Aleph_16
#'
#' # Parametric hierarchical Archimedean copula approach
#'
#' Clustering8 = Icluster(data = sample,
#'                        est_method = list("phi", function(t){t * log(t)}, "hac",
#'                                          type = 1, M = 1000), max_dim = 4)
#'
#' # 5-cluster partition
#' Clustering8$hierarchy$Aleph_16
#'
#' Clustering9 = Icluster(data = sample,
#'                        est_method = list("phi", function(t){(sqrt(t)-1)^2}, "hac",
#'                                          type = 1, M = 1000), max_dim = 2)
#'
#' # 5-cluster partition
#' Clustering9$hierarchy$Aleph_16
#'
#' # Non-parametric approaches
#'
#' Clustering10 = Icluster(data = sample,
#'                         est_method = list("phi", function(t){t * log(t)}, "nphac",
#'                                        estimator = "beta", type = 1), max_dim = 3)
#'
#' # 5-cluster partition
#' Clustering10$hierarchy$Aleph_16
#'
#' Clustering11 = Icluster(data = sample,
#'                         est_method = list("phi", function(t){t * log(t)}, "nphac",
#'                                      estimator = "trans", type = 1), max_dim = 3)
#'
#' # 5-cluster partition
#' Clustering11$hierarchy$Aleph_16
#'
#' Clustering12 = Icluster(data = sample,
#'                         est_method = list("phi", function(t){t * log(t)}, "np",
#'                                      estimator = "beta", bw_method = 1), max_dim = 3)
#'
#' # 5-cluster partition
#' Clustering12$hierarchy$Aleph_16
#'
#' Clustering13 = Icluster(data = sample,
#'                         est_method = list("phi", function(t){t * log(t)}, "np",
#'                                      estimator = "trans", bw_method = 2), max_dim = 3)
#'
#' # 5-cluster partition
#' Clustering13$hierarchy$Aleph_16
#'
#' Clustering14 = Icluster(data = sample,
#'                         est_method = list("phi", function(t){(sqrt(t)-1)^2}, "np",
#'                                       estimator = "trans", bw_method = 1), max_dim = 2)
#'
#' # 5-cluster partition
#' Clustering14$hierarchy$Aleph_16
#'
#' # Semi-parametric meta-elliptical copula approach
#' # Uncomment to run (takes a long time)
#'
#' # Clustering15 = Icluster(data = sample,
#' #                        est_method = list("phi", function(t){t * log(t)}, "ellip",
#' #                                     grid = seq(0.005,100,by = 0.005)), max_dim = 2)
#'
#' # 5-cluster partition
#' # Clustering15$hierarchy$Aleph_16
#'
#'}
#'
#' @export

Icluster = function(data, est_method, max_dim = Inf, norm = NULL, link = "average", trace = 1){

  start_time = Sys.time() # Time the algorithm

  if(is.null(norm)){norm = function(t){t}} # Identity normalization if none is given

  # Function sim is the similarity measure used throughout the algorithm

  sim = function(sample, dim, est_method){

    # Computes the similarity between the columns of sample corresponding to dimensions specified in dim
    # Method will be "No link" in the algorithm
    # Est_method is as specified in the function Icluster, and will be the argument for est_PHI function

    if(est_method[[1]] == "phi"){ # If (normalized) Phi-dependence measures are used

      if(is.character(est_method[[2]]) && est_method[[2]] == "mi"){ # If Gaussian copula mutual information is used

        if(length(est_method[[4]]) == 1){ # If omegas = 1 (no penalization)

          return(norm(minormal(estR(sample,est_method[[4]]),dim))) # Mutual information

        } else{

          omega = cvomega(sample,est_method[[4]],5) # Select omega via 5-fold cross-validation
          return(norm(minormal(estR(sample,omega),dim))) # Mutual information

        }
      }

      else if(is.character(est_method[[2]]) && est_method[[2]] == "Hel"){ # If Gaussian copula Hellinger distance is used

        if(length(est_method[[4]]) == 1){  # If omegas = 1 (no penalization)

          return(norm(Helnormal(estR(sample,est_method[[4]]),dim))) # Hellinger distance

        } else{

          omega = cvomega(sample,est_method[[4]],5) # Select omega via 5-fold cross-validation
          return(norm(Helnormal(estR(sample,omega),dim))) # Hellinger distance

        }

      } else{ # If other method using est_PHI function

        return(norm(estphi(sample,dim,est_method[-c(1,2)],est_method[[2]]))) # Phi-dependence

      }
    }

    else if(est_method[[1]] == "ot"){ # If optimal transport measures are used

      if(length(est_method[[3]]) == 1){ # If omegas = 1 (no penalization)

        if(est_method[[2]] == 1){ # If D1 is to be used

          return(norm(bwd1(estR(sample,est_method[[3]]),dim))) # D1

        } else{

          return(norm(bwd2(estR(sample,est_method[[3]]),dim))) # D2

        }

      } else{

        omega = cvomega(sample,est_method[[3]],5)  # Select omega via 5-fold cross-validation

        if(est_method[[2]] == 1){ # If D1 is to be used

          return(norm(bwd1(estR(sample,omega),dim))) # D1

        } else{

          return(norm(bwd2(estR(sample,omega),dim))) # D2

        }
      }
    }
  }

  # Initialization

  q = ncol(data)
  n = nrow(data)
  colnames(data) = integer(q)

  for(j in 1:q){ # Name the columns of data X1,X2,...,Xq
    name = paste("X",toString(j),sep = "")
    colnames(data)[j] = name
  }

  hierarchy = sets::set(sets::as.set(colnames(data))) # Initial hierarchy is {{X1,X2,...,Xq}}
  all = hash::hash() # Empty hash object to store all similarities that are computed
  cluster = sets::as.set(colnames(data)) # Initial clustering (partition) of all variables {X1,...,Xq}
  STC_computed = sets::set() # Empty set for storing similarities that are already computed
  finish = 0 # Index indicating when algorithm terminates
  index = 1 # Index indicating which partition has been found
  stepp = 1 # Index keeping track of the amount of bivariate similarities that are computed

  # Main algorithm

  while(finish != 1){

    names_cluster = c() # Vector to store all names of the current partition

    # If for example cluster = sets::set(c("X34"),c("X2"),c("X3","X4"),c("X25","X1","X22")) = {X34,X2,c(X3,X4),c(X25,X1,X22)}
    # then names_cluster = c("X34","X2","(X3,X4)","(X1 X22 X25)"), not necessarily in this order,
    # but within a cluster, variables are sorted according to index

    for(i in cluster){names_cluster = c(names_cluster,set_to_string(sets::set(i)))}

    # Next, we order the names according to the first appearing index in each cluster, i.e. for the example
    # names_cluster = c("(X1 X22 X25)","X2","(X3,X4)","X34")

    names_cluster = unlist(strsplit(set_to_string(sets::as.set(names_cluster)), "(\\((?:[^()]++|(?1))*\\))(*SKIP)(*F)| ", perl = TRUE))

    # Names for all pairwise combinations of which similarities need to be computed are contained in
    # names_combn, e.g. for the example, names_combn = c("(X1 X22 X25) X2","(X1 X22 X25) (X3 X4)","(X1 X22 X25) X34","X2 (X3 X4)","X2 X34")

    names_combn = apply(combn(names_cluster,2), 2, paste, collapse = " ")

    # All possible combinations are also contained in the set combn, e.g. for the example
    # combn = {{X2,X34},{X2,c(X3,X4)},{X2,c(X25,X1,X22)},{X34,c(X3,X4)},{X34,c(X25,X1,X22)},{c(X3,X4),c(X25,X1,X22)}}
    # Recall that set elements have no order

    combn = sets::set_combn(cluster,2)

    # Combinations that have not yet been considered are stored in STC_new, with a total of length(STC_new)

    STC_new = sets::set_complement(STC_computed,combn)
    length_STC_new = length(STC_new)

    count = 1 # Index keeping track of how many similarities of length(STC_new) are already computed

    for(set in STC_new){ # Take a couple for which the similarity is to be computed, e.g. {x2,c(X3,X4)}

      str_set = set_to_string(set) # Unique string representation, e.g. "X2 (X3 X4)"
      dim = c() # Vector for dimensions of two clusters contained in set
      sample = c() # Vector (matrix) for sample data of two clusters contained in set

      for(i in set){ # For each of the two clusters, e.g. X2 and c(X3,X4)

        dim = c(dim,length(i)) # Concatenate dimensions, e.g. c(1,2)
        sample = cbind(sample,data[,i,drop = F]) # Concatenate sample data (recall that colnames(data) = X1,...,Xq)

      }

      names_to_keep = colnames(sample) # Keep the names of the current sample in consideration
      colnames(sample) = NULL # Then remove colnames of sample

      # In case ncol(sample) == 2, we have two univariate random variables

      if(ncol(sample) == 2){

        S = sim(sample,dim,est_method) # Compute bivariate similarity

        if(trace > 0){

          cat(paste("Bivariate similarity ", stepp, " out of ", choose(q,2), " is found"), fill = T)

        }

        stepp = stepp + 1 # Update stepp index

      }

      # In case ncol(sample) > max_dim, use link function to compute similarity

      else if(ncol(sample) > max_dim){

        S = link(names_to_keep,all,dim,link = link) # Compute link similarity

      } else{ # Or compute multivariate similarity

        S = sim(sample,dim,est_method)

        if(trace > 0){

          cat(paste(ncol(sample), "dimensional similarity is found"), fill = T)

        }

      }

      all[[str_set]] = S # Update hash object, where key = unique string representation of the two clusters
      # values = similarity between the two clusters

      if(choose(q,2) != length_STC_new){ # In case we are not in the first iteration of the algorithm
        # where only bivariate similarities are computed

        if(trace > 0){

          cat(paste("Similarity", count, " out of ", length_STC_new, " is found"), fill = T)

        }

        count = count + 1 # Update count index

      }
    }

    # Val_all is a named numerical vector with names the unique string representations and
    # values the corresponding similarity

    val_all = unlist(as.list(all))
    val_all = val_all[names(val_all) %in% names_combn] # Only consider those computed in the current iteration

    # Take the ones with largest similarity, e.g. "X2 (X3 X4)" and convert back to a set, e.g. {X2, c(X3,X4)}

    best_set = string_to_set(names(val_all[val_all == max(val_all)])[1])

    # Put all variables included in best_set in a vector ordered according to index

    best_vct = gtools::mixedsort(set_to_vct(best_set))

    # A new partition is obtained by removing best_set and adding the merged two clusters
    # e.g. remove x2 and c(X3,X4) and add c(X2,X3,X4)

    cluster = sets::set_union(sets::set_symdiff(cluster,best_set),sets::set(best_vct))

    # Add the new partition to the hierarchy

    hierarchy = sets::set_union(hierarchy,sets::set(cluster))

    if(trace > 0){

      cat(paste("Partition of ", q - index , " elements is found"), fill = T)

    }

    STC_computed = sets::set_union(STC_new,STC_computed) # Update already computed similarities

    index = index + 1 # Update index

    if(length(cluster) == 2){ # Stop merging when only two clusters remain (can be altered)

      finish = 1

    }
  }

  end_time = Sys.time()

  if(trace > 0){

    cat(paste("Elapsed time (in minutes): ", difftime(end_time, start_time, units='mins')), fill = T) # Print total running time
    cat("Finalizing the algorithm", fill = T)

  }

  # Finalization

  # Complete the hierarchy by adding the final partition (consisting of all variables in one single cluster)

  hierarchy = sets::set_union(hierarchy,sets::set(sets::set(colnames(data))))

  # We summarize all partitions in a hash object with keys (aleph1 = first partition, .... alephd = d'th partition)
  # and values the unique string representation of the respective partition

  new_hierarchy = hash::hash()
  pos = 0

  for(set in hierarchy){

    set = sort_set(set)
    name = paste("Aleph_",length(hierarchy) - pos, sep = "")
    new_hierarchy[[name]] = set_to_string(set)
    pos = pos + 1

  }


  if(trace > 0){

    cat("Computing average diameter and maximum split", fill = T)

  }

  # We compute the average diameters of all partitions

  avg_diameters = integer(length(hierarchy))
  pos = 1

  for(set in hierarchy){

    avg_diameters[pos] = avg_diam(set,all)
    pos = pos + 1

  }

  # We compute the maximum splits of all partitions

  max_splits = integer(length(hierarchy)-1)
  pos = 1

  for(set in hierarchy){

    if(length(set) != 1){

      max_splits[pos] = max_split(set,all)
      pos = pos + 1

    }
  }

  end_time = Sys.time()

  if(trace > 0){

    cat(paste("Elapsed time (in minutes): ", difftime(end_time, start_time, units='mins')), fill = T) # Print total running time

  }

  # Return a list containing the "hierarchy" (hash object), "all" similarities computed in the algorithm (hash)
  # the average diameters "diam" (vector) and the maximum splits "split" (vector)

  return(list("hierarchy" = new_hierarchy, "all" = all, "diam" = avg_diameters, "split" = max_splits))

}

# Auxiliary functions

sort_set = function(set){

  # This function sorts all vectors in set
  # Each vector of the form c("Xi1",...,"Xik") gets sorted according to i1,...,ik
  # and converted to a string of the form "(Xj1 ... Xjl)"

  for(i in set){

    if(length(i) > 1){

      set = sets::set_union(sets::set_symdiff(set,sets::set(i)),sets::set(paste("(",paste(i[order(as.numeric(gsub('X', '', i)))],collapse=" "),")",sep = "")))

    }
  }

  return(set)

}

set_to_vct = function(set){

  # Puts set elements in a vector

  vct = c()

  for(i in set){

    vct = c(vct,i)

  }

  return(vct)

}

set_to_string = function(set){

  # Gives unique string representation of a partition by first sorting within each group,
  # and converting to string vector notation (using sort_set) and then sorting the groups
  # according to their first X index
  # Returns one string representing the clustering
  # Example: sets::set(c("X34"),c("X2"),c("X3","X4"),c("X25","X1","X22"))
  # gives "(X1 X22 X25) X2 (X3 X4) X34"

  vect = set_to_vct(sort_set(set))

  for(i in 1:length(vect)){

    if(substr(vect[i],1,1) == "("){

      vect[i] = substring(vect[i], 2)

    }
  }

  firsts = c()

  for(i in 1:length(vect)){

    firsts = c(firsts, sub(" .*", "", vect[i]))

  }

  order = order(as.numeric(gsub('X', '', firsts)))
  vect = vect[order]

  for(i in 1:length(vect)){

    if(gsub(" ", "",vect[i], fixed = TRUE) != vect[i]){

      vect[i] = paste("(",vect[i], sep = "")

    }
  }

  return(paste(vect, collapse = ' '))

}

string_to_set = function(string){

  # Reverse operation of set_to_string

  string = unlist(strsplit(string, "(\\((?:[^()]++|(?1))*\\))(*SKIP)(*F)| ", perl=TRUE))
  set = sets::as.set(NULL)

  for(i in 1:length(string)){

    set = sets::set_union(set,sets::set(unlist(strsplit(gsub("\\(|\\)", "", string[i]), "\\s+"))))

  }

  return(set)

}

avg_diam = function(partition, all){

  # Average diameter of a partition

  diameters = integer(length(partition))
  index = 1

  for(i in partition){

    if(length(i) == 1){

      min_s = 1

    } else{

      STC = sets::set_combn(i,2)
      S_values = integer(length(STC))
      pos = 1

      for(set in STC){

        str_set = set_to_string(set)
        S_values[pos] = all[[str_set]]
        pos = pos + 1

      }

      min_s = min(S_values)

    }

    diameters[index] = min_s
    index = index + 1

  }

  return(mean(diameters))

}

max_split = function(partition, all){

  # Maximum split of a partition

  splits = integer(length(partition))
  index = 1

  for(i in partition){

    rest = sets::set_symdiff(partition,sets::set(i))
    S_values = c()

    for(j in i){

      for(k in rest){

        for(l in k){

          str_set = set_to_string(sets::set(j,l))
          S_values = c(S_values,all[[str_set]])

        }
      }
    }

    splits[index] = max(S_values)
    index = index + 1

  }

  return(max(splits))

}

## Link function used in Icluster algorithm

link = function(names, all, dim, link){

  # This function computes the similarity between two groups by taking the average, single
  # or complete linkage of all pairwise inter similarities already computed before and contained in all
  # Names contains the variable names of the two groups

  d1 = ifelse(dim[1] == 1,1,dim[1])
  d2 = ifelse(dim[2] == 1,1,dim[2])
  dep = matrix(0,d1,d2)

  for(i in 1:d1){

    for(j in 1:d2){

      if(d1 == 1){

        name1 = names[1]

      } else{

        name1 = names[i]

      }

      if(d2 == 1){

        name2 = names[length(names)]

      } else{

        name2 = names[dim[1] + j]

      }

      dep[i,j] = all[[set_to_string(sets::set(name1,name2))]]

    }
  }

  if(link == "average"){

    return(mean(dep))

  }

  if(link == "single"){

    return(min(dep))

  }

  if(link == "complete"){

    return(max(dep))

  }
}
Any scripts or data that you put into this service are public.
VecDep documentation built on April 4, 2025, 5:14 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
VecDep
Measuring Copula-Based Dependence Between Random Vectors

R/Icluster.R
In VecDep: Measuring Copula-Based Dependence Between Random Vectors

Defines functions link max_split avg_diam string_to_set set_to_string set_to_vct sort_set Icluster

Documented in Icluster

Try the VecDep package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

VecDep Measuring Copula-Based Dependence Between Random Vectors

R/Icluster.R In VecDep: Measuring Copula-Based Dependence Between Random Vectors

Defines functions link max_split avg_diam string_to_set set_to_string set_to_vct sort_set Icluster

Documented in Icluster

Try the VecDep package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

VecDep
Measuring Copula-Based Dependence Between Random Vectors

R/Icluster.R
In VecDep: Measuring Copula-Based Dependence Between Random Vectors