hdbscan: hdbscan

Description Usage Arguments Details Value Note References See Also Examples

Description

Implemenation of the hdbscan algorithm.

Usage

1
2
hdbscan(edges, neighbors = NULL, minPts = 20, K = 5, threads = NULL,
  verbose = getOption("verbose", TRUE))

Arguments

edges

An edge matrix of the type returned by buildEdgeMatrix or, alternatively, a largeVis object.

neighbors

An adjacency matrix of the type returned by randomProjectionTreeSearch. Must be specified unless edges is a largeVis object.

minPts

The minimum number of points in a cluster.

K

The number of points in the core neighborhood. (See details.)

threads

Maximum number of threads. Determined automatically if NULL (the default). It is unlikely that this parameter should ever need to be adjusted. It is only available to make it possible to abide by the CRAN limitation that no package use more than two cores.

verbose

Verbosity.

Details

The hyperparameter K controls the size of core neighborhoods. The algorithm measures the density around a point as 1 / the distance between that point and its kth nearest neighbor. A low value of K is similar to clustering nearest neighbors rather than based on density. A high value of K may cause the algorithm to miss some (usually contrived) clustering patterns, such as where clusters are made up of points arranged in lines to form shapes.

The function must be provided sufficient nearest-neighbor data for whatever is specified for k. If k = 5, for example, the edge matrix (and neighbor matrix, if specified) must contain data on at least 5 neighbors for each point. This should not be problematic in typical use in connection with largeVis, which is ordinarily run with a far higher k-value than hdbscan.

Value

An object of type hdbscan with the following fields:

'clusters'

A vector of the cluster membership for each vertex. Outliers are given NA

'probabilities'

A vector of the degree of each vertex' membership. This is calculated by standardizing each vertex' lambda_p against the lambda_birth and lambda_death of the cluster.

'glosh'

A vector of GLOSH outlier scores for each node assigned to a cluster. NA for nodes not in a cluster.

'tree'

The minimum spanning tree used to generate the clustering.

'hierarchy'

A representation of the condensed cluster hierarchy.

'call'

The call.

The hierarchy describes the complete post-condensation structure of the tree:

'nodemembership'

The cluster ID of the vertex's immediate parent, after condensation.

'lambda'

λ_p for each vertex.

'parent'

The cluster ID of each cluster's parent.

'stability'

The cluster's stability, taking into account child-node stabilities.

'selected'

Whether the cluster was selected.

'coredistances'

The core distance determined for each vertex.

'lamba_birth'

λ_b for each cluster.

'lambda_deaeth'

λ_d for each cluster.

Note

This is not precisely the HDBSCAN algorithm because it relies on the nearest neighbor data generated by the LargeVis algorithm. In particular, HDBSCAN assumes that all points can be connected in a single minimal-spanning tree. This implementation uses a minimal-spanning forest, because the graph may not be fully connected depending on the amount of nearest-neighbor data provided. If, for example, the data has distinct clusters where no member of some cluster is a nearest neighbor of a point in any other cluster, which can easily happen, the algorithm will generate distinct trees for each disconnected set of points. In testing, this improved the performance of the algorithm.

Do not rely on the content of the probabilities field for outliers. A future version will hopefully provide some measure of outlier factor.

References

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

See Also

https://github.com/lmcinnes/hdbscan

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## Not run: 
library(largeVis)
library(clusteringdatasets)  # See https://github.com/elbamos/clusteringdatasets
data(spiral)
dat <- as.matrix(spiral[, 1:2])
neighbors <- randomProjectionTreeSearch(t(dat), K = 10, tree_threshold = 100,
                                       max_iter = 5, threads = 1)
edges <- buildEdgeMatrix(t(dat), neighbors)
clusters <- hdbscan(edges, neighbors = neighbors, verbose = FALSE, threads = 1)

# Calling largeVis while setting sgd_batches to 1 is
# the simplest way to generate the data structures neeeded for hdbscan
spiralVis <- largeVis(t(dat), K = 10, tree_threshold = 100, max_iter = 5,
                      sgd_batches = 1, threads = 1)
clusters <- hdbscan(spiralVis, verbose = FALSE, threads = 1)
# The gplot function helps to visualize the clustering
largeHighDimensionalDataset <- matrix(rnorm(50000), ncol = 50)
vis <- largeVis(largeHighDimensionalDataset)
clustering <- hdbscan(vis)
gplot(clustering, t(vis$coords))

## End(Not run)

elbamos/largeVis documentation built on May 16, 2019, 2:58 a.m.