trainHVT | R Documentation |
This is the main function to construct hierarchical voronoi tessellations. This is done using hierarchical vector quantization(hvq). The data is represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids.
trainHVT(
dataset,
min_compression_perc = NA,
n_cells = NA,
depth = 1,
quant.err = 0.2,
normalize = FALSE,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
scale_summary = NA,
diagnose = FALSE,
hvt_validation = FALSE,
train_validation_split_ratio = 0.8,
dim_reduction_method = "sammon",
tsne_theta = 0.2,
tsne_eta = 200,
tsne_perplexity = 30,
tsne_verbose = TRUE,
tsne_max_iter = 500,
umap_n_neighbors = 60,
umap_n_components = 2,
umap_min_dist = 0.1
)
dataset |
Data frame. A data frame, with numeric columns (features) will be used for training the model. |
min_compression_perc |
Numeric. An integer, indicating the minimum compression percentage to be achieved for the dataset. It indicates the desired level of reduction in dataset size compared to its original size. |
n_cells |
Numeric. An integer, indicating the number of cells per hierarchy (level). |
depth |
Numeric. An integer, indicating the number of levels. A depth of 1 means no hierarchy (single level), while higher values indicate multiple levels (hierarchy). |
quant.err |
Numeric. A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error threshold. |
normalize |
Logical. A logical value indicating if the dataset should be normalized. When set to TRUE, scales the values of all features to have a mean of 0 and a standard deviation of 1 (Z-score). |
distance_metric |
Character. The distance metric can be L1_Norm(Manhattan) or L2_Norm(Eucledian). L1_Norm is selected by default. The distance metric is used to calculate the distance between an n dimensional point and centroid. |
error_metric |
Character. The error metric can be mean or max. max is selected by default. max will return the max of m values and mean will take mean of m values where each value is a distance between a point and centroid of the cell. |
quant_method |
Character. The quantization method can be kmeans or kmedoids. Kmeans uses means (centroids) as cluster centers while Kmedoids uses actual data points (medoids) as cluster centers. kmeans is selected by default. |
scale_summary |
List. A list with user-defined mean and standard deviation values for all the features in the dataset. Pass the scale summary when normalize is set to FALSE. |
diagnose |
Logical. A logical value indicating whether user wants to perform diagnostics on the model. Default value is FALSE. |
hvt_validation |
Logical. A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE. |
train_validation_split_ratio |
Numeric. A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8. |
dim_reduction_method |
Character.The dim_reduction_method can be one of "tsne", "umap", "sammon". |
tsne_theta |
Numeric.The tsne_theta is only used when dim_reduction_method is set to "tsne". Default value is 0.5 and common values are between 0.2 and 0.5. |
tsne_eta |
Numeric.The tsne_eta are used only when dim_reduction method is set to "tsne". Default value is 200. |
tsne_perplexity |
Numeric.The tsne_perplexity is only used when dim_reduction_method is set to "tsne". Default value is 30 and common values are between between 30 and 50. |
tsne_verbose |
Logical. A logical value which indicates the t-SNE algorithm to print detailed information about its progress to the console. |
tsne_max_iter |
Numeric.The tsne_max_iter is used only when dim_reduction_method is set to "tsne". Default value is 1000.More iterations can improve results but increase computation time. |
umap_n_neighbors |
Integer.The umap_n_neighbors is used only when dim_reduction_method is set to "umap". Default value is 15.Controls the balance between local and global structure in data. |
umap_n_components |
Integer.The umap_n_components is used only when dim_reduction_method is set to "umap". Default value is 2.Indicates the number of dimensions for embedding. |
umap_min_dist |
Numeric.The umap_map_dist is used only when dim_reduction_method is set to "umap". Default value is 0.1.Controls how tightly UMAP packs points together. |
A Nested list that contains the hierarchical tessellation information. This list has to be given as input argument to plot the tessellations.
[[1]] |
A list containing information related to plotting tessellations. This information will include coordinates, boundaries, and other details necessary for visualizing the tessellations |
[[2]] |
A list containing information related to Sammon’s projection coordinates of the data points in the reduced-dimensional space. |
[[3]] |
A list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell. |
[[4]] |
A list that contains all the diagnostics information of the model when diagnose is set to TRUE. Otherwise NA. |
[[5]] |
A list that contains all the information required to generates a Mean Absolute Deviation (MAD) plot, if hvt_validation is set to TRUE. Otherwise NA |
[[6]] |
A list containing detailed information about the hierarchical vector quantized data along with a summary section containing no of points, Quantization Error and the centroids for each cell which is the output of 'hvq' |
[[7]] |
model info: A list that contains model-generated timestamp, input parameters passed to the model , the validation results and the dimensionality reduction evaluation metrics table. |
Shubhra Prakash <shubhra.prakash@mu-sigma.com>, Sangeet Moy Das <sangeet.das@mu-sigma.com>, Shantanu Vaidya <shantanu.vaidya@mu-sigma.com>,Bidesh Ghosh <bidesh.gosh@mu-sigma.com>,Alimpan Dey <alimpan.dey@mu-sigma.com>
plotHVT
data("EuStockMarkets")
hvt.results <- trainHVT(EuStockMarkets, n_cells = 60, depth = 1, quant.err = 0.1,
distance_metric = "L1_Norm", error_metric = "max",
normalize = TRUE,quant_method="kmeans")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.