dbcv: Density-Based Clustering Validation Index (DBCV)
In dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

View source: R/dbcv.R

dbcv	R Documentation

Density-Based Clustering Validation Index (DBCV)

Description

Calculate the Density-Based Clustering Validation Index (DBCV) for a clustering.

Usage

dbcv(x, cl, d, metric = "euclidean", sample = NULL)

Arguments

`x`	a data matrix or a dist object.
`cl`	a clustering (e.g., a integer vector)
`d`	dimensionality of the original data if a dist object is provided.
`metric`	distance metric used. The available metrics are the methods implemented by `dist()` plus `"sqeuclidean"` for the squared Euclidean distance used in the original DBCV implementation.
`sample`	sample size used for large datasets.

Details

DBCV (Moulavi et al, 2014) computes a score based on the density sparseness of each cluster and the density separation of each pair of clusters.

The density sparseness of a cluster (DSC) is deﬁned as the maximum edge weight of a minimal spanning tree for the internal points of the cluster using the mutual reachability distance based on the all-points-core-distance. Internal points are connected to more than one other point in the cluster. Since clusters of a size less then 3 cannot have internal points, they are ignored (considered noise) in this implementation.

The density separation of a pair of clusters (DSPC) is deﬁned as the minimum reachability distance between the internal nodes of the spanning trees of the two clusters.

The validity index for a cluster is calculated using these measures and aggregated to a validity index for the whole clustering using a weighted average.

The index is in the range [-1,1]. If the cluster density compactness is better than the density separation, a positive value is returned. The actual value depends on the separability of the data. In general, greater values of the measure indicating a better density-based clustering solution.

Noise points are included in the calculation only in the weighted average, therefore clustering with more noise points will get a lower index.

Performance note: This implementation calculates a distance matrix and thus can only be used for small or sampled datasets.

Value

A list with the DBCV score for the clustering, the density sparseness of cluster (dsc) values, the density separation of pairs of clusters (dspc) distances, and the validity indices of clusters (c_c).

Author(s)

Matt Piekenbrock and Michael Hahsler

References

Davoud Moulavi and Pablo A. Jaskowiak and Ricardo J. G. B. Campello and Arthur Zimek and Jörg Sander (2014). Density-Based Clustering Validation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 839-847 \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1137/1.9781611973440.96")}

Pablo A. Jaskowiak (2022). MATLAB implementation of DBCV. https://github.com/pajaskowiak/dbcv

Examples

# Load a test dataset
data(Dataset_1)
x <- Dataset_1[, c("x", "y")]
class <- Dataset_1$class

clplot(x, class)

# We use MinPts 3 and use the knee at eps = .1 for dbscan
kNNdistplot(x, minPts = 3)

cl <- dbscan(x, eps = .1, minPts = 3)
clplot(x, cl)

dbcv(x, cl)

# compare to the DBCV index on the original class labels and
# with a random partitioning
dbcv(x, class)
dbcv(x, sample(1:4, replace = TRUE, size = nrow(x)))

# find the best eps using dbcv
eps_grid <- seq(.05,.2, by = .01)
cls <- lapply(eps_grid, FUN = function(e) dbscan(x, eps = e, minPts = 3))
dbcvs <- sapply(cls, FUN = function(cl) dbcv(x, cl)$score)

plot(eps_grid, dbcvs, type = "l")

eps_opt <- eps_grid[which.max(dbcvs)]
eps_opt

cl <- dbscan(x, eps = eps_opt, minPts = 3)
clplot(x, cl)

dbscan documentation built on April 3, 2025, 7:04 p.m.