Description Usage Arguments Details Value Note References See Also Examples
Implemenation of the hdbscan algorithm.
1 2 |
edges |
An edge matrix of the type returned by |
neighbors |
An adjacency matrix of the type returned by |
minPts |
The minimum number of points in a cluster. |
K |
The number of points in the core neighborhood. (See details.) |
threads |
Maximum number of threads. Determined automatically if |
verbose |
Verbosity. |
The hyperparameter K
controls the size of core neighborhoods.
The algorithm measures the density around a point as 1 / the distance between
that point and its kth nearest neighbor. A low value of K
is similar
to clustering nearest neighbors rather than based on density. A high value of
K
may cause the algorithm to miss some (usually contrived) clustering
patterns, such as where clusters are made up of points arranged in lines to form
shapes.
The function must be provided sufficient nearest-neighbor data for whatever
is specified for k. If k = 5, for example, the edge matrix (and
neighbor matrix, if specified) must contain data on at least 5 neighbors for
each point. This should not be problematic in typical use in connection with
largeVis
, which is ordinarily run with a far higher k-value
than hdbscan.
An object of type hdbscan
with the following fields:
A vector of the cluster membership for each vertex. Outliers
are given NA
A vector of the degree of each vertex' membership. This is calculated by standardizing each vertex' lambda_p against the lambda_birth and lambda_death of the cluster.
A vector of GLOSH outlier scores for each node assigned to a cluster. NA for nodes not in a cluster.
The minimum spanning tree used to generate the clustering.
A representation of the condensed cluster hierarchy.
The call.
The hierarchy describes the complete post-condensation structure of the tree:
The cluster ID of the vertex's immediate parent, after condensation.
λ_p for each vertex.
The cluster ID of each cluster's parent.
The cluster's stability, taking into account child-node stabilities.
Whether the cluster was selected.
The core distance determined for each vertex.
λ_b for each cluster.
λ_d for each cluster.
This is not precisely the HDBSCAN
algorithm because it relies on the
nearest neighbor data generated by the LargeVis
algorithm. In particular,
HDBSCAN
assumes that all points can be connected in a single minimal-spanning
tree. This implementation uses a minimal-spanning forest, because the graph may not
be fully connected depending on the amount of nearest-neighbor data provided.
If, for example, the data has distinct clusters where no member of some cluster is a
nearest neighbor of a point in any other cluster, which can easily happen, the algorithm will
generate distinct trees for each disconnected set of points. In testing, this
improved the performance of the algorithm.
Do not rely on the content of the probabilities
field for outliers. A future version
will hopefully provide some measure of outlier factor.
R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013
https://github.com/lmcinnes/hdbscan
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ## Not run:
library(largeVis)
library(clusteringdatasets) # See https://github.com/elbamos/clusteringdatasets
data(spiral)
dat <- as.matrix(spiral[, 1:2])
neighbors <- randomProjectionTreeSearch(t(dat), K = 10, tree_threshold = 100,
max_iter = 5, threads = 1)
edges <- buildEdgeMatrix(t(dat), neighbors)
clusters <- hdbscan(edges, neighbors = neighbors, verbose = FALSE, threads = 1)
# Calling largeVis while setting sgd_batches to 1 is
# the simplest way to generate the data structures neeeded for hdbscan
spiralVis <- largeVis(t(dat), K = 10, tree_threshold = 100, max_iter = 5,
sgd_batches = 1, threads = 1)
clusters <- hdbscan(spiralVis, verbose = FALSE, threads = 1)
# The gplot function helps to visualize the clustering
largeHighDimensionalDataset <- matrix(rnorm(50000), ncol = 50)
vis <- largeVis(largeHighDimensionalDataset)
clustering <- hdbscan(vis)
gplot(clustering, t(vis$coords))
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.