hdbscan: hdbscan
In largeVis: High-Quality Visualizations of Large, High-Dimensional Datasets

Description Usage Arguments Details Value Note References See Also Examples

Implemenation of the hdbscan algorithm.

1 2	hdbscan(edges, neighbors = NULL, minPts = 20, K = 5, threads = NULL, verbose = getOption("verbose", TRUE))

`edges`	An edge matrix of the type returned by `buildEdgeMatrix` or, alternatively, a `largeVis` object.
`neighbors`	An adjacency matrix of the type returned by `randomProjectionTreeSearch`. Must be specified unless `edges` is a `largeVis` object.
`minPts`	The minimum number of points in a cluster.
`K`	The number of points in the core neighborhood. (See details.)
`threads`	Maximum number of threads. Determined automatically if `NULL` (the default). It is unlikely that this parameter should ever need to be adjusted. It is only available to make it possible to abide by the CRAN limitation that no package use more than two cores.
`verbose`	Verbosity.

The hyperparameter K controls the size of core neighborhoods. The algorithm measures the density around a point as 1 / the distance between that point and its kth nearest neighbor. A low value of K is similar to clustering nearest neighbors rather than based on density. A high value of K may cause the algorithm to miss some (usually contrived) clustering patterns, such as where clusters are made up of points arranged in lines to form shapes.

The function must be provided sufficient nearest-neighbor data for whatever is specified for k. If k = 5, for example, the edge matrix (and neighbor matrix, if specified) must contain data on at least 5 neighbors for each point. This should not be problematic in typical use in connection with largeVis, which is ordinarily run with a far higher k-value than hdbscan.

An object of type hdbscan with the following fields:

'clusters': A vector of the cluster membership for each vertex. Outliers are given NA
'probabilities': A vector of the degree of each vertex' membership. This is calculated by standardizing each vertex' lambda_p against the lambda_birth and lambda_death of the cluster.
'glosh': A vector of GLOSH outlier scores for each node assigned to a cluster. NA for nodes not in a cluster.
'tree': The minimum spanning tree used to generate the clustering.
'hierarchy': A representation of the condensed cluster hierarchy.
'call': The call.

The hierarchy describes the complete post-condensation structure of the tree:

'nodemembership': The cluster ID of the vertex's immediate parent, after condensation.
'lambda': λ_p for each vertex.
'parent': The cluster ID of each cluster's parent.
'stability': The cluster's stability, taking into account child-node stabilities.
'selected': Whether the cluster was selected.
'coredistances': The core distance determined for each vertex.
'lamba_birth': λ_b for each cluster.
'lambda_deaeth': λ_d for each cluster.

This is not precisely the HDBSCAN algorithm because it relies on the nearest neighbor data generated by the LargeVis algorithm. In particular, HDBSCAN assumes that all points can be connected in a single minimal-spanning tree. This implementation uses a minimal-spanning forest, because the graph may not be fully connected depending on the amount of nearest-neighbor data provided. If, for example, the data has distinct clusters where no member of some cluster is a nearest neighbor of a point in any other cluster, which can easily happen, the algorithm will generate distinct trees for each disconnected set of points. In testing, this improved the performance of the algorithm.

Do not rely on the content of the probabilities field for outliers. A future version will hopefully provide some measure of outlier factor.

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

https://github.com/lmcinnes/hdbscan

## Not run: 
library(largeVis)
library(clusteringdatasets)  # See https://github.com/elbamos/clusteringdatasets
data(spiral)
dat <- as.matrix(spiral[, 1:2])
neighbors <- randomProjectionTreeSearch(t(dat), K = 10, tree_threshold = 100,
                                       max_iter = 5, threads = 1)
edges <- buildEdgeMatrix(t(dat), neighbors)
clusters <- hdbscan(edges, neighbors = neighbors, verbose = FALSE, threads = 1)

# Calling largeVis while setting sgd_batches to 1 is
# the simplest way to generate the data structures neeeded for hdbscan
spiralVis <- largeVis(t(dat), K = 10, tree_threshold = 100, max_iter = 5,
                      sgd_batches = 1, threads = 1)
clusters <- hdbscan(spiralVis, verbose = FALSE, threads = 1)
# The gplot function helps to visualize the clustering
largeHighDimensionalDataset <- matrix(rnorm(50000), ncol = 50)
vis <- largeVis(largeHighDimensionalDataset)
clustering <- hdbscan(vis)
gplot(clustering, t(vis$coords))

## End(Not run)