R/trainHVT.R
In HVT: Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

Documented in trainHVT

#' @name trainHVT
#' @title Constructing Hierarchical Voronoi Tessellations
#' @description This is the main function to construct hierarchical voronoi tessellations.
#' This is done using hierarchical vector quantization(hvq). The data is represented in 2D
#' coordinates and the tessellations are plotted using these coordinates as
#' centroids. For subsequent levels, transformation is performed on the 2D
#' coordinates to get all the points within its parent tile. Tessellations are
#' plotted using these transformed points as centroids.
#' @param dataset Data frame. A data frame, with numeric columns (features) will be used for training the model.
#' @param min_compression_perc Numeric. An integer, indicating the minimum compression percentage to be achieved for the dataset. 
#' It indicates the desired level of reduction in dataset size compared to its original size.
#' @param n_cells Numeric. An integer, indicating the number of cells per hierarchy (level).
#' @param depth Numeric. An integer, indicating the number of levels. A depth of 1 means no hierarchy (single level), 
#' while higher values indicate multiple levels (hierarchy).
#' @param quant.err Numeric. A number indicating the quantization error threshold.
#' A cell will only breakdown into further cells if the quantization error of the cell is 
#' above the defined quantization error threshold.
#' @param normalize Logical. A logical value indicating if the dataset should be normalized. When set to TRUE, 
#' scales the values of all features to have a mean of 0 and a standard deviation of 1 (Z-score).
#' @param distance_metric Character. The distance metric can be L1_Norm(Manhattan) or L2_Norm(Eucledian). L1_Norm is selected by default.
#' The distance metric is used to calculate the distance between an n dimensional point and centroid.
#' @param error_metric Character. The error metric can be mean or max. max is selected by default. 
#' max will return the max of m values and mean will take mean of m values where
#' each value is a distance between a point and centroid of the cell.
#' @param quant_method Character. The quantization method can be kmeans or kmedoids. Kmeans uses means (centroids) as cluster centers
#'  while Kmedoids uses actual data points (medoids) as cluster centers. kmeans is selected by default.
#' @param scale_summary List. A list with user-defined mean and standard deviation values for all the features in the dataset. 
#' Pass the scale summary when normalize is set to FALSE.
#' @param diagnose Logical. A logical value indicating whether user wants to perform diagnostics on the model. 
#' Default value is FALSE.
#' @param hvt_validation Logical. A logical value indicating whether user wants to holdout a validation set and find 
#' mean absolute deviation of the validation points from the centroid. Default value is FALSE.
#' @param train_validation_split_ratio  Numeric. A numeric value indicating train validation split ratio. 
#' This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8.
#' @param dim_reduction_method Character.The dim_reduction_method can be one of "tsne", "umap", "sammon".
#' @param tsne_perplexity Numeric.The tsne_perplexity is only used when dim_reduction_method is set to "tsne".
#' Default value is 30 and common values are between between 30 and 50.
#' @param tsne_theta Numeric.The tsne_theta is only used when dim_reduction_method is set to "tsne".
#' Default value is 0.5 and common values are between 0.2 and 0.5.
#' @param tsne_eta Numeric.The tsne_eta are used only when dim_reduction method is set to "tsne".
#' Default value is 200.
#' @param tsne_verbose Logical. A logical value which indicates the t-SNE algorithm to print 
#' detailed information about its progress to the console.
#' @param tsne_max_iter Numeric.The tsne_max_iter is used only when dim_reduction_method is set to "tsne".
#' Default value is 1000.More iterations can improve results but increase computation time.
#' @param umap_n_neighbors Integer.The umap_n_neighbors is used only when dim_reduction_method is set to "umap".
#' Default value is 15.Controls the balance between local and global structure in data.
#' @param umap_n_components Integer.The umap_n_components is used only when dim_reduction_method is set to "umap".
#' Default value is 2.Indicates the number of dimensions for embedding.
#' @param umap_min_dist Numeric.The umap_map_dist is used only when dim_reduction_method is set to "umap".
#' Default value is 0.1.Controls how tightly UMAP packs points together.
#' @return A Nested list that contains the hierarchical tessellation information. This
#' list has to be given as input argument to plot the tessellations.
#' \item{[[1]] }{A list containing information related to plotting tessellations. 
#' This information will include coordinates, boundaries, and other details necessary for visualizing the tessellations} 
#' \item{[[2]] }{A list containing information related to Sammon’s projection coordinates of the data points
#' in the reduced-dimensional space.}
#' \item{[[3]] }{A list containing detailed information about the hierarchical vector quantized data along with 
#' a summary section containing no of points, Quantization Error and the centroids for each cell.}
#' \item{[[4]] }{A list that contains all the diagnostics information of the model when diagnose is set to TRUE. 
#' Otherwise NA.}
#' \item{[[5]] }{A list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, 
#' if hvt_validation is set to TRUE. Otherwise NA}
#' \item{[[6]]}{A list containing detailed information about the hierarchical vector quantized data along with a
#'  summary section containing no of points, Quantization Error and the centroids for each cell which is the output of `hvq`}
#' \item{[[7]]}{model info: A list that contains model-generated timestamp, input parameters passed to the model ,
#' the validation results and the dimensionality reduction evaluation metrics table.}
#' @author Shubhra Prakash <shubhra.prakash@@mu-sigma.com>, Sangeet Moy Das <sangeet.das@@mu-sigma.com>, Shantanu Vaidya <shantanu.vaidya@@mu-sigma.com>,Bidesh Ghosh <bidesh.gosh@@mu-sigma.com>,Alimpan Dey <alimpan.dey@@mu-sigma.com>
#' @seealso \code{\link{plotHVT}}
#' @keywords Training_or_Compression
#' @importFrom magrittr %>%
#' @importFrom stats dist kmeans
#' @examples
#' data("EuStockMarkets")
#' hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1, 
#'                        distance_metric = "L1_Norm", error_metric = "max",
#'                        normalize = TRUE,quant_method="kmeans")
#' @include hvq.R getCellId.R madPlot.R displayTable.R
#' @include Add_boundary_points.R  Corrected_Tessellations.R  DelaunayInfo.R  Delete_Outpoints.R diagPlot.R
#' @export trainHVT


trainHVT <-
  function(dataset,
           min_compression_perc = NA,
           n_cells = NA,
           depth = 1,
           quant.err = 0.2,
           normalize = FALSE,
           distance_metric = "L1_Norm",
           error_metric =  "max",
           quant_method="kmeans",
           scale_summary = NA,
           diagnose=FALSE,
           hvt_validation=FALSE,
           train_validation_split_ratio=0.8,
           dim_reduction_method = "sammon",
           tsne_theta = 0.2,
           tsne_eta = 200,
           tsne_perplexity = 30,
           tsne_verbose = TRUE,
           tsne_max_iter = 500,
           umap_n_neighbors = 60,
           umap_n_components = 2, 
           umap_min_dist = 0.1
  ) {
    
    set.seed(279)
    requireNamespace("deldir")       #deldir function
    requireNamespace("Hmisc")        #ceil function
    requireNamespace("grDevices")    #chull function
    requireNamespace("splancs")      #csr function
    requireNamespace("sp")           #point.in.polygon function
    requireNamespace("conf.design")  #factorize functionScreenshot from 2022-10-26 16-15-44
    # requireNamespace("uwot")
    # requireNamespace("dplyr")
    # requireNamespace("scales")
    # requireNamespace("Matrix")
    # requireNamespace("stats")
    # requireNamespace("cluster")
    # requireNamespace("magrittr")
    # requireNamespace("kableExtra")
    
    if(quant_method=="kmedoids"){message(' K-Medoids: Run time for vector quantization using K-Medoids is very high for large number of clusters.')}
    dataset <- as.data.frame(dataset)
    # dataset <- as.data.frame(sapply(dataset[,1:length(dataset[1,])], as.numeric))
    dataset=data.frame(
      lapply(dataset, as.numeric),
      check.names = FALSE,
      row.names = rownames(dataset)
    )
    data_structure <- dim(dataset)
    
    ## Diagnose Function
    if(hvt_validation){
      if(train_validation_split_ratio >1 | train_validation_split_ratio <0){stop("The train_validation_split_ratio should be in range 0-1 ")}
      message(paste0("Check MAD parameter has been set to TRUE, the train data will be randomly split by the ratio of  ",train_validation_split_ratio*100,":",(1-train_validation_split_ratio)*100))
      
      set.seed(273)
      smp_size <- floor(train_validation_split_ratio * nrow(dataset))
      train_ind <- sample(seq_len(nrow(dataset)), size = smp_size)
      
      validation_data <- dataset[-train_ind, ]
      dataset <- dataset[train_ind, ]
      
    }
    
    
    if (normalize) {
      scaledata <- scale(dataset, scale = TRUE, center = TRUE)
      rownames(scaledata) <- rownames(dataset)
      
      mean_data <- attr(scaledata, "scaled:center")
      std_data <- attr(scaledata, "scaled:scale")
      
      scale_summary <- list(mean_data = mean_data, std_data = std_data)
      
      # flog.info("scaling is done")
    } else {
      scaledata <- as.matrix(dataset)
      rownames(scaledata) <- rownames(dataset)
      # flog.info("The data is not scaled as per the user requirement")
      if(!all(is.na(scale_summary))){
        scale_summary = scale_summary
      }else{
        scale_summary = NA
        # stop("Please pass scale_summary to the function or set normalize parameter to TRUE")
      }
      
    }
    
    polinfo <- hvqdata <- list()
    # browser()
    hvq_k <- hvq(
      scaledata,
      min_compression_perc = min_compression_perc,
      n_cells = n_cells,
      depth = depth,
      quant.err = quant.err,
      distance_metric = distance_metric,
      error_metric = error_metric,
      quant_method=quant_method
    )
    
    if(!is.na(min_compression_perc)){
      n_cells <- hvq_k$compression_summary$noOfCells
      depth <- 1
    }
    
    # flog.info("HVQ output is ready")
    hvqoutput <- hvq_k$summary
    
    gdata <- hvqoutput  #assign the output of hvq file to gdata
    #cleaning the data by deleting the rows containing NA's
    #gdata <- gdata[-which(is.na(gdata[, 5])), ]
    gdata <- hvqoutput[stats::complete.cases(hvqoutput), ]
    hvqdata <- gdata
    # flog.info("NA's are removed from the HVQ output")
    
    #to store hvqdata according to each level
    tessdata <- input.tessdata <- list()
    #stores sammon co-ordinates and segregated according to each level
    points2d <- list()
    #sammon datapoints to be stored in hierarchy
    rawdeldata <- list()
    #rawdeldata is transformed to be within its parent tile and stored in new_rawdeldata
    new_rawdeldata <- list()
    #contains the vertices of the parent polygon
    pol_info <- polygon_info <- list()
    #number of levels
    #browser()
    nlevel <- length(unique(gdata[, "Segment.Level"]))
    #verify if the transformed points are correct
    transpoints <- list()
    
    # New columns added to the dataset by the hvq function
    newcols <-  ncol(hvqoutput) - ncol(dataset)
    # Variable to store information on mapping
    par_map <- list()
    
    
    projection.scale <-  10
    for (i in 1:nlevel) {
      #hvqdata segregated according to different levels
      tessdata[[i]] <- gdata[which(gdata[, "Segment.Level"]==i),]
      # tessdata[[i]] <- tessdata[[i]] %>% mutate(across(where(is.double), round, 8))
      # tessdata[[i]] <- tessdata[[i]] %>% mutate_if(is.double, round, 8)
    }
    
    # data to be used as input to sammon function
    # input.tessdata[[i]] <- tessdata[[i]][, (newcols+1): ncol(hvqoutput)]
    # d<-cmdscale()
    #  }
    #  }
    # # tessdata[[i]] <- round(tessdata[[i]], digits = 8)
    counter <- c(1:nlevel)
    
    
    dim_reduction <- function(x){
    requireNamespace("Rtsne")        
    requireNamespace("umap")
    requireNamespace("FNN")
    requireNamespace("MASS")
      data <- unique(tessdata[[x]][,(newcols +1):ncol(hvqoutput)])
      
      if (dim_reduction_method == "sammon"){
        return( projection.scale * MASS::sammon(stats::dist(data),niter = 10^5)$points)
      }else if (dim_reduction_method == "tsne"){
        set.seed(232)
        tsne_data <- (projection.scale * Rtsne::Rtsne(data,
                                                      theta = tsne_theta,
                                                      eta = tsne_eta,
                                                      perplexity = tsne_perplexity,
                                                      verbose = tsne_verbose,
                                                      max_iter = tsne_max_iter)$Y)
        rownames(tsne_data) <- rownames(tessdata[[x]])
        return(tsne_data)
      }else if (dim_reduction_method == "umap"){
        set.seed(232)
        return(projection.scale * umap::umap(data,
                                             n_neighbors = umap_n_neighbors,
                                             n_components = umap_n_components, 
                                             min_dist = umap_min_dist)$layout)
      }
    }
    start_time <- Sys.time()
    points2d <- lapply(counter,dim_reduction)
    end_time <-Sys.time()
    
    time_for_projection <- as.numeric(end_time - start_time)
    
    ## Function for Trustworthiness
    calculate_trustworthiness <- function(X, X_embedded, k = 5) {
      n <- nrow(X)
      # Compute the pairwise distance matrices
      dist_original <- as.matrix(dist(X))
      dist_embedded <- as.matrix(dist(X_embedded))
      # Compute ranks
      rank_original <- t(apply(dist_original, 1, rank))
      rank_embedded <- t(apply(dist_embedded, 1, rank))
      # Determine which points are in the top k in the embedded space but not in the original space
      U_k_matrix <- (rank_embedded <= k) & !(rank_original <= k)
      # Calculate the rank differences
      rank_differences <- (rank_original - k) * U_k_matrix
      # Sum all rank differences
      trustworthiness_sum <- sum(rank_differences[rank_differences > 0])
      # Normalize the trustworthiness score
      trustworthiness <- 1 - (2 / (n * k * (2 * n - 3 * k - 1)) * trustworthiness_sum)
      return(trustworthiness)
    }
    
    ## Function For Continuity
    calculate_continuity <- function(X, X_embedded, k = 5) {
      n <- nrow(X)
      # Compute the pairwise distance matrices
      dist_original <- as.matrix(dist(X))
      dist_embedded <- as.matrix(dist(X_embedded))
      # Compute ranks
      rank_original <- t(apply(dist_original, 1, rank))
      rank_embedded <- t(apply(dist_embedded, 1, rank))
      # Determine which points are in the top k in the original space but not in the embedded space
      U_k_matrix <- (rank_original <= k) & !(rank_embedded <= k)
      # Calculate the rank differences
      rank_differences <- (rank_embedded - k) * U_k_matrix
      # Sum all rank differences
      continuity_sum <- sum(rank_differences[rank_differences > 0])
      # Normalize the continuity score
      continuity <- 1 - (2 / (n * k * (2 * n - 3 * k - 1)) * continuity_sum)
      return(continuity)
    }
    
    ## RMSE
    calculate_normalized_rmse <- function(embeddingss) {
      
      # Ensure the original data is in matrix form
      if (is.list(embeddingss[["original"]])) {
        original_data <- do.call(cbind, embeddingss[["original"]])  # Convert list to matrix
      } else {
        original_data <- as.matrix(embeddingss[["original"]])  # Convert data frame to matrix if needed
      }
      
      # Replace the 'original' entry with the matrix
      embeddingss[["original"]] <- original_data
      
      # Check the number of points in the original and reduced data
      if (nrow(embeddingss[["original"]]) != nrow(embeddingss[["points_df"]])) {
        stop("The number of points in the original data and the reduced 2D data do not match.")
      }
      
      # Compute pairwise distances in the original space (3D)
      original_distances <- as.matrix(dist(embeddingss[["original"]], method = "euclidean"))
      
      # Compute pairwise distances in the reduced space (2D)
      reduced_distances <- as.matrix(dist(embeddingss[["points_df"]], method = "euclidean"))
      
      # Check if the distance matrices are conformable
      if (!all(dim(original_distances) == dim(reduced_distances))) {
        stop("The distance matrices are not conformable for RMSE calculation.")
      }
      
      # return(rmse_normalized)
      difference_matrix <- original_distances - reduced_distances
      
      # Square the differences
      squared_differences <- difference_matrix^2
      
      # Calculate the mean of the squared differences
      mean_squared_error <- mean(squared_differences)
      
      # Calculate the RMSE by taking the square root of the mean squared error
      rmse <- sqrt(mean_squared_error)
      
      return(rmse)
    }
    
    ###### Silhouette Score
    calculate_average_silhouette <- function(data, centers = 2, seed = 123) {
      
      set.seed(seed)
      
      # Apply k-means clustering
      kmeans_result <- kmeans(data, centers = centers)
      
      # Vectorized function to compute Silhouette score manually
      calculate_silhouette_vectorized <- function(data, clusters) {
        dist_matrix <- as.matrix(dist(data))
        
        # Compute the intra-cluster distance (a_i)
        a_i <- sapply(1:length(clusters), function(i) {
          cluster_points <- which(clusters == clusters[i])
          if (length(cluster_points) > 1) {
            mean(dist_matrix[i, cluster_points])
          } else {
            0
          }
        })
        
        # Compute the nearest-cluster distance (b_i)
        b_i <- sapply(1:length(clusters), function(i) {
          other_clusters <- unique(clusters)[unique(clusters) != clusters[i]]
          min(sapply(other_clusters, function(cluster) {
            cluster_points <- which(clusters == cluster)
            mean(dist_matrix[i, cluster_points])
          }))
        })
        
        # Calculate Silhouette scores
        sil_scores <- (b_i - a_i) / pmax(a_i, b_i)
        return(sil_scores)
      }
      
      # Calculate Silhouette scores
      silhouette_scores <- calculate_silhouette_vectorized(data, kmeans_result$cluster)
      
      # Calculate the average Silhouette score
      average_silhouette_score <- mean(silhouette_scores)
      
      return(average_silhouette_score)
    }
    
    ## Sammon's Stresss
    sammon_stress <- function(original_distances, projected_embeddings) {
      # Calculate pairwise distances for projected embeddings
      projected_distances <- dist(projected_embeddings)
      # Filter out zero distances to avoid division by zero
      mask <- original_distances > 0
      original_distances <- original_distances[mask]
      projected_distances <- projected_distances[mask]
      # Compute the numerator and denominator
      numerator <- sum(((original_distances - projected_distances) / original_distances) ^2)
      denominator <- sum(original_distances ^ 2)  # Usually just the sum of squared original distances
      # print(sum(original_distances))
      # print((sum(((original_distances - projected_distances) / original_distances) ^2)/sum(original_distances ^ 2)))
      stress <-  numerator/denominator
      return(stress)
    }

    ## KNN_retention function
    # knn_retention <- function(high_dim_embeddings, low_dim_embeddings, k = 5) {
    #   # Convert embeddings to matrices (if not already)
    #   high_dim_embeddings <- as.matrix(high_dim_embeddings)
    #   low_dim_embeddings <- as.matrix(low_dim_embeddings)
    #   # Find k-nearest neighbors in high-dimensional space
    #   high_dim_nn <- FNN::knn.index(high_dim_embeddings, k = k)
    #   # Find k-nearest neighbors in low-dimensional space
    #   low_dim_nn <- FNN::knn.index(low_dim_embeddings, k = k)
    #   # Calculate overlap
    #   total_overlap <- 0
    #   n <- nrow(high_dim_embeddings)
    #   # Calculate total overlap without using a for loop
    #   total_overlap <- sum(sapply(1:n, function(i) {
    #     length(intersect(high_dim_nn[i, ], low_dim_nn[i, ]))
    #   }))
    #   # Calculate average retention
    #   avg_retention <- total_overlap / (n * k)
    #   return(avg_retention)
    # }
    knn_retention <- function(high_dim_embeddings, low_dim_embeddings, k = 8) {
      high_dim_embeddings <- as.matrix(high_dim_embeddings)
      low_dim_embeddings <- as.matrix(low_dim_embeddings)
      
      n <- nrow(high_dim_embeddings)  # Total data points
      k <- min(k, n - 1)  # Ensure k is not greater than available points
      
      # Find k-nearest neighbors in high-dimensional space
      high_dim_nn <- FNN::knn.index(high_dim_embeddings, k = k)
      
      # Find k-nearest neighbors in low-dimensional space
      low_dim_nn <- FNN::knn.index(low_dim_embeddings, k = k)
      
      # Calculate total overlap
      total_overlap <- sum(sapply(1:n, function(i) {
        length(intersect(high_dim_nn[i, ], low_dim_nn[i, ]))
      }))
      
      # Calculate average retention
      avg_retention <- total_overlap / (n * k)
      return(avg_retention)
    }
    
    
    if(depth == 1){
      points2d_df = as.data.frame(points2d)
      new_tessdata = unique(tessdata[[1]][,(newcols +1):ncol(hvqoutput)])
      
      ## Trustworthiness
      trustworthiness <- calculate_trustworthiness(new_tessdata, points2d_df, k = 5)
      
      ## Continuity Score
      continuity_score <- calculate_continuity(new_tessdata, points2d_df, k = 5)
      
      ## RMSE Score
      embeddingss <- list(
        original = new_tessdata,  
        points_df = points2d_df
      )
      rmse_value<- calculate_normalized_rmse(embeddingss)

      ## Silhouette Score
      average_silhouette_score <- calculate_average_silhouette(points2d_df)
      
      ## Sammon's Stress
      # Calculate pairwise distances in the original space
      original_distances <- dist(embeddingss$original)
      # Calculate Sammon's Stress 
      stress_score <- sammon_stress(original_distances, embeddingss$points_df)
      
      ## KNN Retention
      avg_retention_score <- knn_retention(embeddingss[["original"]], embeddingss[["points_df"]], k = 8)
      
      ## Creating the Metrics df with the values.
      metrics_df<- data.frame(
        `L1_Metrics` = c("Structure Preservation Metrics", "Structure Preservation Metrics", "Structure Preservation Metrics", "Distance Preservation Metrics", "Interpretive Quality Metrics", "Interpretive Quality Metrics","Computational Efficiency Metrics"),
        `L2_Metrics` = c("Trustworthiness", "Continuity", "Sammon's Stress", "RMSE", "Silhouette Score", "KNN Retention Score","Execution Duration(sec)"),
        Value = c(trustworthiness, continuity_score, stress_score, rmse_value, average_silhouette_score, avg_retention_score, round(time_for_projection,4))
        # stringsAsFactors = FALSE
      )
    }else{
      
    }
    
   
    
    for (i in 1:nlevel) {
     
      # #sammon datapoints grouped according to the hierarchy
      intermediate.rawdata <- list()
      rn = row.names(points2d[[i]])
      vec = as.integer(rn) - sum(n_cells ^ (0:(i - 1))) + 1
      
      # for(j in 1: length(unique(tessdata[[i]][, "Segment.Parent"]))) {
      index = 1
      for (j in unique(tessdata[[i]][, "Segment.Parent"])) {
        # print(((n_cells * (j - 1)) + 1): (j * n_cells))
        # intermediate.rawdata[[j]] <- cbind(points2d[[i]][((n_cells * (j - 1)) + 1): (j * n_cells), 1],
        #                                    points2d[[i]][((n_cells * (j - 1)) + 1): (j * n_cells), 2])
        if (any(dplyr::between(j - vec / n_cells, 0, 0.9999))) {
          intermediate.rawdata[[index]] <-
            cbind(points2d[[i]][rn[dplyr::between(j - vec / n_cells, 0, 0.9999)], 1],
                  points2d[[i]][rn[dplyr::between(j - vec /
                                                    n_cells, 0, 0.9999)], 2])
          index = index + 1
        }
      }
      rawdeldata[[i]] <- intermediate.rawdata
      
    }
    rm(intermediate.rawdata)
    # flog.info("Sammon points for all the levels have been calculated")
    
    #new_rawdeldata contains the transformed points of rawdeldata
    new_rawdeldata[[1]] <- rawdeldata[[1]]
    
    #deldat1 is the output of the deldir::deldir function and contains the tessellation information
    deldat1 <- deldat2 <- list()
    #deldir function of deldir package gives the tessellations output
    deldat2[[1]] <- deldir::deldir(new_rawdeldata[[1]][[1]][, 1],
                                   new_rawdeldata[[1]][[1]][, 2])
    deldat1[[1]] <- deldat2
    rm(deldat2)
    # flog.info("Tessellations are calculated for the first level")
    #plotting the tessellation
    #plot(deldat1[[1]][[1]], wlines = "tess", lty = 1, lwd = 4)
    
    #polygon_info stores parent tile vertices information
    #polygon_info is the modified tile.list output except for first level.
    pol_info[[1]] <- deldir::tile.list(deldat1[[1]][[1]])
    
    #loop throught the polygon and  add the row name as id variable
    # row names of info in rawdeldata is preserved from the hvqoutput and this row name ties the rawdeldata back to the hierarchy
    # in the hvqoutput dataset
    pol_info[[1]] <- mapply(function(info, id) {
      info$id = as.numeric(id)
      info <-
        append(info, as.list(gdata[row.names(gdata) == id, c('Segment.Level', 'Segment.Parent', 'Segment.Child')]))
      return(info)
    }, pol_info[[1]], as.list(row.names(new_rawdeldata[[1]][[1]])), SIMPLIFY = FALSE)
    
    polygon_info[[1]] <- pol_info
    rm(pol_info)
    par_tile_indices <- n_par_tile <- list()
    par_tile_indices[[1]] <-
      unique(tessdata[[1]][, "Segment.Parent"])
    n_par_tile[[1]] <-
      length(unique(tessdata[[1]][, "Segment.Parent"]))
    if (nlevel < 2) {
      polinfo <- polygon_info
      
      fin_out <- list()
      
      fin_out[[1]] <- deldat1
      fin_out[[2]] <- polinfo
      fin_out[[3]] <- hvq_k
      fin_out[[6]] <- hvq_k
      fin_out[[3]][['scale_summary']] <- scale_summary
      level = 1
      level_names <- list()
      level_names[[level]] <- fin_out[[2]][[level]] %>% purrr::map(~ {
        this <- .
        element <- this[[1]]
        return(element$Segment.Parent)
      }) %>% unlist()
      
      names(fin_out[[2]][[level]]) <- level_names[[level]]
      
      ####### Adding Code for Diagnosis Plot
      diag_Suggestion=NA
      fin_out[[4]] = NA
      fin_out[[5]] <- NA
      if(diagnose){
        # fin_out[[4]] = NA
        diag_list =  diagPlot(hvt.results = fin_out,
                              data = dataset,
                              level = depth,
                              quant.err = quant.err,
                              distance_metric=distance_metric,
                              error_metric=error_metric
        )
        diag_Suggestion =  diagSuggestion(hvt.results = fin_out,
                                          data = dataset,
                                          level = depth,
                                          quant.err = quant.err,
                                          distance_metric=distance_metric,
                                          error_metric=error_metric
        )
        fin_out[[4]] = diag_list
        
      }
      
      ####### Adding Code for MAD Plot  
      if(hvt_validation){
        ####### MAD Plot ################
        
        predictions_validation = list()
        predictions_validation <- scoreHVT(
          dataset = validation_data,
          hvt.results.model=fin_out,
          child.level = depth,
          line.width = c(0.6, 0.4, 0.2),
          color.vec = c("#141B41", "#6369D1", "#D8D2E1"),
          mad.threshold = quant.err,
          distance_metric = distance_metric,
          error_metric = error_metric
        )
        
        mad_plot=madPlot(predictions_validation)
        fin_out[[5]] <- list()
        fin_out[[5]][["mad_plot"]] <- mad_plot
      }
      ### Model Info
      Dataset <- paste0(data_structure[1] ," Rows & ", data_structure[2], " Columns")
      
      model_train_date=Sys.time()
      input_parameters = list(
        input_dataset = Dataset,
        n_cells = n_cells,
        depth = depth,
        quant.err = quant.err,
        normalize = normalize,
        distance_metric = distance_metric[1],
        error_metric = error_metric[1],
        quant_method = quant_method[1],
        diagnose = diagnose,
        hvt_validation = hvt_validation,
        train_validation_split_ratio = train_validation_split_ratio
      )
      
#browser()    
  
      if(depth == 1){
        
        validation_result=diag_Suggestion
        fin_out[["model_info"]]=list(model_train_date=model_train_date,
                                     type="hvt_model",
                                     input_parameters = input_parameters,
                                     validation_result = validation_result,
                                     distance_measures = metrics_df)
      }else{
        validation_result=diag_Suggestion
        fin_out[["model_info"]]=list(model_train_date=model_train_date,
                                     type="hvt_model",
                                     input_parameters = input_parameters,
                                     validation_result = validation_result)
      }
      

      # validation_result=diag_Suggestion
      # fin_out[["model_info"]]=list(model_train_date=model_train_date,
      #                              type="hvt_model",
      #                              input_parameters = input_parameters,
      #                              validation_result = validation_result,
      #                              distance_measures = metrics_df)
      
      # generating cell ID using getCellId
      fin_out[[3]]$summary <- getCellId(hvt.results=fin_out)
      fin_out[[3]]$summary <- fin_out[[3]]$summary %>% select(Segment.Level, Segment.Parent, Segment.Child, n, Cell.ID, Quant.Error, colnames(dataset))
      
      
  
      
      
      
      class(fin_out) <- "hvt.object"
      return(fin_out)
      
    } else { 
      for (i in 2:nlevel) {
        new_rawdeldata[[i]] <- list()
        par_tile_indices[[i]] <-
          unique(tessdata[[i]][, "Segment.Parent"])
        n_par_tile[[i]] <-
          length(unique(tessdata[[i]][, "Segment.Parent"]))
        par_map[[i - 1]] <- list()
        
        for (tileIndex in 1:n_par_tile[[(i - 1)]]) {
          # print(tileIndex)
          #a chunk of hvqdata which contains the rows corresponding to a particular parent tile
          gi_par_tiles <-
            par_tile_indices[[i]][intersect(which((par_tile_indices[[i]] / n_cells) <= par_tile_indices[[(i - 1)]][tileIndex]),
                                            which((par_tile_indices[[(i - 1)]][tileIndex] - 1) < (par_tile_indices[[i]] / n_cells)))]
          gidata <-
            tessdata[[i]][which(tessdata[[i]][, "Segment.Parent"] %in% gi_par_tiles),]
          par_map[[i - 1]][[par_tile_indices[[(i - 1)]][tileIndex]]] <-
            gi_par_tiles
          
          #datapoints corresponding to a particular parent tile
          rawdeldati <-
            rawdeldata[[i]][intersect(which((par_tile_indices[[i]] / n_cells) <= par_tile_indices[[(i - 1)]][tileIndex]),
                                      which((par_tile_indices[[(i - 1)]][tileIndex] - 1) < (par_tile_indices[[i]] / n_cells)))]
          if (nrow(gidata) != 0) {
            #transformation of rawdeldata such that they are inside the parent tile and output is stored in new_rawdeldata
            transpoints <-
              DelaunayInfo(gidata, polygon_info[[(i - 1)]][[tileIndex]], rawdeldati, n_cells)
            if (all(transpoints != "-1")) {
              new_rawdeldata[[i]] <- append(new_rawdeldata[[i]], transpoints)
            } else{
              deldat1[i] <- 0
              polinfo <<- polygon_info
              #flog.warn("Projection is not scaled enough and the polygon is very small to perform transformation")
              return(deldat1)
              #return("Projection is not scaled enough and the polygon is very small to perform transformation")
            }
            
          }
        }
        # flog.info("Sammon points of level %s are transformed", i)
        
        deldat2 <- list()
        par_tile_polygon <- list()
        #?
        for (tileNo in 1:n_par_tile[[i]]) {
          # print(tileNo)
          #modulus operation for the last index in polygon_info
          if (((par_tile_indices[[i]][tileNo]) %% n_cells)) {
            last_index <- (par_tile_indices[[i]][tileNo] %% n_cells)
          } else{
            last_index <- n_cells
          }
          
          #divide to get the parent tile
          sec_index <-
            Hmisc::ceil(par_tile_indices[[i]][tileNo] / n_cells)
          
          deldat2[[tileNo]] <-
            deldir::deldir(new_rawdeldata[[i]][[tileNo]][, 1],
                           new_rawdeldata[[i]][[tileNo]][, 2],
                           rw = c(
                             range(polygon_info[[(i - 1)]][[which(par_tile_indices[[(i - 1)]] == sec_index)]][[last_index]]$x) - c(0.5,-0.5),
                             range(polygon_info[[(i - 1)]][[which(par_tile_indices[[(i - 1)]] == sec_index)]][[last_index]]$y) - c(0.5,-0.5)
                           ))
          
          #CORRECT ROW NAMES
          deldat2[[tileNo]]$rownames.orig <-
            as.numeric(row.names(new_rawdeldata[[i]][[tileNo]]))
          
          #constructing parent polygons
          par_tile_polygon[[tileNo]] <-
            matrix(
              c(polygon_info[[(i - 1)]][[which(par_tile_indices[[(i - 1)]] == sec_index)]][[last_index]]$x,
                polygon_info[[(i - 1)]][[which(par_tile_indices[[(i - 1)]] == sec_index)]][[last_index]]$y),
              ncol = 2,
              byrow = FALSE
            )
          #correct the tessellations
          cur_dirsgs <- deldat2[[tileNo]]$dirsgs
          cur_tile <-
            polygon_info[[(i - 1)]][[which(par_tile_indices[[(i - 1)]] == sec_index)]][[last_index]]
          cur_polygon <- par_tile_polygon[[tileNo]]
          verify_dirsgs <- Corrected_Tessellations(cur_dirsgs,
                                                   cur_tile,
                                                   cur_polygon)
          
          #polygon is sufficient to calculate next level tessellations inside this
          if (all(verify_dirsgs[, 1:8] != "-1")) {
            deldat2[[tileNo]]$dirsgs <- verify_dirsgs
            # flog.info("Tessellations for level %s is calculated", i)
          } else{
            deldat1[[i]] <- 0
            #flog.warn("Projection is not scaled enough and the polygon is very small to perform transformation")
            polinfo <<- polygon_info
            return(deldat1)
            #return("Projection is not scaled enough to perform tessellations for the next level")
          }
          
          #delete lines with both the points identical
          new_dirsgs <- deldat2[[tileNo]]$dirsgs
          del_rows <- 0
          for (j in 1:nrow(new_dirsgs)) {
            if (round(new_dirsgs[j, "x1"], 6) == round(new_dirsgs[j, "x2"], 6) &&
                round(new_dirsgs[j, "y1"], 6) == round(new_dirsgs[j, "y2"], 6)) {
              del_rows <- c(del_rows, j)
            }
          }
          if (length(del_rows) > 1) {
            new_dirsgs <- new_dirsgs[-del_rows, ]
          }
          deldat2[[tileNo]]$dirsgs <- new_dirsgs
          # flog.info("Line with same endpoints are deleted")
        }
        
        deldat1[[i]] <- deldat2
        
        polygon_info[[i]] <- list()
        #polygon information to correct the polygons
        for (parentIndex in 1:n_par_tile[[i]]) {
          polygon_info[[i]][[parentIndex]] <-
            suppressMessages(deldir::tile.list(deldat1[[i]][[parentIndex]]))
          # Add id corresponding to each segment from rownames.orig information
          polygon_info[[i]][[parentIndex]] <-
            mapply(function(info, id) {
              info$id = as.numeric(id)
              info <-
                append(info, as.list(gdata[row.names(gdata) == id, c('Segment.Level', 'Segment.Parent', 'Segment.Child')]))
              return(info)
            },
            polygon_info[[i]][[parentIndex]],
            as.list(deldat1[[i]][[parentIndex]]$rownames.orig),
            SIMPLIFY = FALSE)
        }
        
        #to delete the points which are outside the parent tile
        
        for (parentIndex in 1:length(polygon_info[[i]])) {
          for (tileIndex in 1:length(polygon_info[[i]][[parentIndex]])) {
            tile <- polygon_info[[i]][[parentIndex]][[tileIndex]]
            check_dirsgs <- deldat1[[i]][[parentIndex]]$dirsgs
            if (nrow(check_dirsgs) != 0 && length(tile$x) != 0) {
              polygon_info[[i]][[parentIndex]][[tileIndex]] <-
                Delete_Outpoints (tile, check_dirsgs)
            }
          }
        }
        # flog.info("Endpoints of tessellation lines which are outside parent polygon are deleted")
        
        for (parentIndex in 1:n_par_tile[[i]]) {
          #modulo operation to obtain the last index
          if (((par_tile_indices[[i]][parentIndex]) %% n_cells)) {
            last_index <- (par_tile_indices[[i]][parentIndex] %% n_cells)
          } else{
            last_index <- n_cells
          }
          #divide to get second index
          sec_index <-
            Hmisc::ceil(par_tile_indices[[i]] / n_cells)[parentIndex]
          
          polygon_info[[i]][[parentIndex]] <-
            Add_boundary_points(polygon_info[[i]][[parentIndex]],
                                polygon_info[[(i - 1)]][[which(par_tile_indices[[(i - 1)]] == sec_index)]][[last_index]],
                                new_rawdeldata[[i]][[parentIndex]])
        }
        # flog.info("Vertices of parent tile are added to appropriate child tile")
      }
      
      polinfo <- polygon_info
      
      # par_map <- reshape2::melt(par_map)
      # colnames(par_map) <- c('Child','Parent','Level')
      # par_map <- par_map %>% mutate(ChildNo = (Parent-1)*n_cells+Child)
      
      fin_out <- list()
      
      fin_out[[1]] <- deldat1
      fin_out[[2]] <- polinfo
      fin_out[[3]] <- hvq_k
      fin_out[[6]] <- hvq_k
      fin_out[[3]][['scale_summary']] <- scale_summary
      # fin_out[[4]] <- par_map
      level_names <- list()
      
      for (level in 1:nlevel) {
        level_names[[level]] <- fin_out[[2]][[level]] %>% purrr::map(~ {
          this <- .
          element <- this[[1]]
          return(element$Segment.Parent)
        }) %>% unlist()
        
        names(fin_out[[2]][[level]]) <- level_names[[level]]
      }
      
      # fin_out <- correctedBoundaries(fin_out,nlevel)
      
      emptyParents  <- depth - length(fin_out[[2]])
      for (i in 1:emptyParents) {
        fin_out[[2]][[length(fin_out[[2]]) + 1]] <- list()
      }

      
      
      
            
      # ###### Adding Code for Diagnosis Plot
      diag_Suggestion =NA
      fin_out[[4]] = NA
      fin_out[[5]] <- NA
      
      if(diagnose){
        fin_out[[4]] = NA
        diag_list = diagPlot(hvt.results = fin_out,
                             data = dataset,
                             level = depth,
                             quant.err = quant.err,
                             distance_metric=distance_metric,
                             error_metric=error_metric
        )
        
        diag_Suggestion = diagSuggestion(hvt.results = fin_out,
                                         data = dataset,
                                         level = depth,
                                         quant.err = quant.err,
                                         distance_metric=distance_metric,
                                         error_metric=error_metric
        )
        fin_out[[4]] = diag_list
        
      }
      
      #  fin_out[[5]] <- NA
      
      ####### Adding Code for MAD Plot
      if(hvt_validation){
        # browser()
        
        ####### MAD Plot ################
        predictions_validation = list()
        predictions_validation <- scoreHVT(
          dataset = validation_data,
          hvt.results.model=fin_out,
          child.level = depth,
          line.width = c(0.6, 0.4, 0.2),
          color.vec = c("#141B41", "#6369D1", "#D8D2E1"),
          mad.threshold = quant.err,
          distance_metric = distance_metric,
          error_metric = error_metric
        )
        
        mad_plot=madPlot(predictions_validation)
        fin_out[[5]] <- list()
        fin_out[[5]][["mad_plot"]] <- mad_plot
      }
      

      ### Model Info
      Dataset <- paste0(data_structure[1] ," Rows & ", data_structure[2], " Columns")
      
      model_train_date=Sys.time()
      input_parameters = list(
        input_dataset = Dataset,
        n_cells = n_cells,
        depth = depth,
        quant.err = quant.err,
        normalize = normalize,
        distance_metric = distance_metric[1],
        error_metric = error_metric[1],
        quant_method = quant_method[1],
        diagnose = diagnose,
        hvt_validation = hvt_validation,
        train_validation_split_ratio = train_validation_split_ratio,
        dim_reduction_method = dim_reduction_method[1],
        tsne_theta = tsne_eta[1],
        tsne_eta = tsne_eta[1],
        tsne_perplexity = tsne_perplexity[1],
        tsne_verbose = tsne_verbose[1],
        tsne_max_iter = tsne_max_iter[1],
        umap_n_neighbors = umap_n_neighbors[1],
        umap_n_components = umap_n_components[1],
        umap_min_dist = umap_min_dist[1]
        
      )
      
      validation_result=diag_Suggestion
      fin_out[["model_info"]]=list(model_train_date=model_train_date,
                                   type="hvt_model",
                                   input_parameters=input_parameters,
                                   validation_result=validation_result
      )
      
      # generating cell ID using getCellId
      fin_out[[3]]$summary <- getCellId(hvt.results=fin_out)
      fin_out[[3]]$summary <- fin_out[[3]]$summary %>% select(Segment.Level, Segment.Parent, Segment.Child, n, Cell.ID, Quant.Error, colnames(dataset))
 
      
      class(fin_out) <- "hvt.object"
      return(fin_out)
    }
  }
Any scripts or data that you put into this service are public.
HVT documentation built on April 3, 2025, 8:45 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
HVT
Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

R/trainHVT.R
In HVT: Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

Defines functions trainHVT

Documented in trainHVT

Try the HVT package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

HVT Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

R/trainHVT.R In HVT: Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

Defines functions trainHVT

Documented in trainHVT

Try the HVT package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

HVT
Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis

R/trainHVT.R
In HVT: Constructing Hierarchical Voronoi Tessellations and Overlay Heatmaps for Data Analysis