DbscanParam-class: Density-based clustering with DBSCAN

DbscanParam-classR Documentation

Density-based clustering with DBSCAN

Description

Perform density-based clustering with a fast re-implementation of the DBSCAN algorithm.

Usage

DbscanParam(
  eps = NULL,
  min.pts = 5,
  core.prop = 0.5,
  chunk.size = 1000,
  BNPARAM = KmknnParam(),
  num.threads = 1,
  BPPARAM = NULL
)

## S4 method for signature 'ANY,DbscanParam'
clusterRows(x, BLUSPARAM, full = FALSE)

Arguments

eps

Numeric scalar specifying the distance to use to define neighborhoods. If NULL, this is determined from min.pts and core.prop.

min.pts

Integer scalar specifying the minimum number of neighboring observations required for an observation to be a core point.

core.prop

Numeric scalar specifying the proportion of observations to treat as core points. This is only used when eps=NULL, see Details.

chunk.size

Integer scalar specifying the number of points to process per chunk.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use for the neighbor searches. This should be able to support both nearest-neighbor and range queries.

num.threads

Integer scalar specifying the number of threads to use.

BPPARAM

Deprecated and ignored, use num.threads instead.

x

A numeric matrix-like object where rows represent observations and columns represent variables.

BLUSPARAM

A BlusterParam object specifying the algorithm to use.

full

Logical scalar indicating whether additional statistics should be returned.

Details

DBSCAN operates by identifying core points, i.e., observations with at least min.pts neighbors within a distance of eps. It identifies which core points are neighbors of each other, one chunk.size at a time, forming components of connected core points. All non-core points are then connected to the closest core point within eps. All groups of points that are connected in this manner are considered to be part of the same cluster. Any unconnected non-core points are treated as noise and reported as NA.

As a suitable value of eps may not be known beforehand, we can automatically determine it from the data. For all observations, we compute the distance to the kth neighbor where k is defined as round(min.pts * core.prop). We then define eps as the core.prop quantile of the distances across all observations. The default of core.prop=0.5 means that around half of the observations will be treated as core points.

Larger values of eps will generally result in fewer observations classified as noise, as they are more likely to connect to a core point. It may also promote agglomeration of existing clusters into larger entities if they are connected by regions of (relatively) low density. Conversely, larger values of min.pts will generally increase the number of noise points and may fragment larger clusters into subclusters.

To modify an existing DbscanParam object x, users can simply call x[[i]] or x[[i]] <- value where i is any argument used in the constructor.

Value

The DbscanParam constructor will return a DbscanParam object with the specified parameters.

The clusterRows method will return a factor of length equal to nrow(x) containing the cluster assignments. Note that this may contain NA values corresponding to noise points. If full=TRUE, a list is returned with clusters (the factor, as above) and objects (a list containing the eps and min.pts used in the analysis).

Author(s)

Aaron Lun

References

Ester M et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 226-231.

Examples

clusterRows(iris[,1:4], DbscanParam())
clusterRows(iris[,1:4], DbscanParam(core.prop=0.8))


LTLA/bluster documentation built on Sept. 8, 2024, 4:37 a.m.