clusterTree: Density clustering: the cluster tree
In TDA: Statistical Tools for Topological Data Analysis

clusterTree

R Documentation

Density clustering: the cluster tree

Description

Given a point cloud, or a matrix of distances, the function clusterTree computes a density estimator and returns the corresponding cluster tree of superlevel sets (lambda tree and kappa tree; see references).

Usage

clusterTree(
    X, k, h = NULL, density = "knn", dist = "euclidean", d = NULL,
    Nlambda = 100, printProgress = FALSE)

Arguments

`X`	If `dist="euclidean"`, then `X` is an `n` by `d` matrix of coordinates, where `n` is the number of points stored in `X` and `d` is the dimension of the space. If `dist="arbitrary"`, then `X` is an `n` by `n` matrix of distances.
`k`	an integer value specifying the parameter of the underlying k-nearest neighbor similarity graph, used to determine connected components. If `density="knn"`, then `k` is also used to compute the k-nearest neighbor density estimator.
`h`	real value: if `density = "kde"`, then `h` is used to compute the kernel density estimator with bandwidth `h`. The default value is `NULL`.
`density`	string: if `"knn"` then the k-nearest neighbor density estimator is used to compute the cluster tree; if `"kde"` then the kernel density estimator is used to compute the cluster tree. The default value is `"knn"`.
`dist`	string: can be `"euclidean"`, when `X` is a point cloud or `"arbitrary"`, when `X` is a matrix of distances. The default value is `"euclidean"`.
`d`	integer: if `dist="arbitrary"`, then `d` is the dimension of the underlying space. The default value is `"NULL"`.
`Nlambda`	integer: size of the grid of values of the density estimator, used to compute the cluster tree. High `Nlambda` (i.e. a fine grid) means a more accurate cluster Tree. The default value is `100`.
`printProgress`	logical: if `TRUE`, a progress bar is printed. The default value is `FALSE`.

Details

The function clusterTree is an implementation of Algorithm 1 in the first reference.

Value

The function clusterTree returns an object of class clusterTree, a list with the following components

`density`	Vector of length `n`: the values of the density estimator evaluated at each of the points stored in `X`
`DataPoints`	A list whose elements are the points of `X` corresponding to each branch, in the same order of `id`
`n`	The number of points stored in the input matrix `X`
`id`	Vector: the IDs associated to the branches of the cluster tree
`children`	A list whose elements are the IDs of the children of each branch, in the same order of `id`
`parent`	Vector: the IDs of the parents of each branch, in the same order of `id`
`silo`	A list whose elements are the horizontal coordinates of the silo of each branch, in the same order of `id`
`Xbase`	Vector: the horiontal coordinates of the branches of the cluster tree, in the same order of `id`
`lambdaBottom`	Vector: the vertical bottom coordinates of the branches of the lambda tree, in the same order of `id`
`lambdaTop`	Vector: the vertical top coordinates of the branches of the lambda tree, in the same order of `id`
`rBottom`	(only if `density="knn"`) Vector: the vertical bottom coordinates of the branches of the r tree, in the same order of `id`
`rTop`	(only if `density="knn"`) Vector: the vertical top coordinates of the branches of the r tree, in the same order of `id`
`alphaBottom`	Vector: the vertical bottom coordinates of the branches of the alpha tree, in the same order of `id`
`alphaTop`	Vector: the vertical top coordinates of the branches of the alpha tree, in the same order of `id`
`Kbottom`	Vector: the vertical bottom coordinates of the branches of the kappa tree, in the same order of `id`
`Ktop`	Vector: the vertical top coordinates of the branches of the kappa tree, in the same order of `id`

Author(s)

Fabrizio Lecci

References

Kent BP, Rinaldo A, Verstynen T (2013). "DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering." arXiv:1307.8136

Lecci F, Rinaldo A, Wasserman L (2014). "Metric Embeddings for Cluster Trees"

Examples

## Generate data: 3 clusters
n <- 1200    #sample size
Neach <- floor(n / 4) 
X1 <- cbind(rnorm(Neach, 1, .8), rnorm(Neach, 5, 0.8))
X2 <- cbind(rnorm(Neach, 3.5, .8), rnorm(Neach, 5, 0.8))
X3 <- cbind(rnorm(Neach, 6, 1), rnorm(Neach, 1, 1))
X <- rbind(X1, X2, X3)

k <- 100     #parameter of knn

## Density clustering using knn and kde
Tree <- clusterTree(X, k, density = "knn")
TreeKDE <- clusterTree(X, k, h = 0.3, density = "kde")

par(mfrow = c(2, 3))
plot(X, pch = 19, cex = 0.6)
# plot lambda trees
plot(Tree, type = "lambda", main = "lambda Tree (knn)")
plot(TreeKDE, type = "lambda", main = "lambda Tree (kde)")
# plot clusters
plot(X, pch = 19, cex = 0.6, main = "cluster labels")
for (i in Tree[["id"]]){
  points(matrix(X[Tree[["DataPoints"]][[i]],],ncol = 2), col = i, pch = 19,
         cex = 0.6)
}
#plot kappa trees
plot(Tree, type = "kappa", main = "kappa Tree (knn)")
plot(TreeKDE, type = "kappa", main = "kappa Tree (kde)")

TDA documentation built on April 3, 2025, 9 p.m.