In mhahsler/dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

pkg <- 'dbscan'

source("https://raw.githubusercontent.com/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg, anaconda = "r-dbscan", stackoverflow = "dbscan%2br")

Introduction

This R package [@hahsler2019dbscan] provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

DBSCAN: Density-based spatial clustering of applications with noise [@ester1996density].
Jarvis-Patrick Clustering: Clustering using a similarity measure based on shared near neighbors [@jarvis1973].
SNN Clustering: Shared nearest neighbor clustering [@erdoz2003].
HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction [@campello2015hierarchical].
FOSC: Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree [@campello2013density].
OPTICS/OPTICSXi: Ordering points to identify the clustering structure and cluster extraction methods [@ankerst1999optics].

Outlier Detection

LOF: Local outlier factor algorithm [@breunig2000lof].
GLOSH: Global-Local Outlier Score from Hierarchies algorithm [@campello2015hierarchical].

Cluster Evaluation

DBCV: Density-based clustering validation [@moulavi2014].

Fast Nearest-Neighbor Search (using kd-trees)

kNN search
Fixed-radius NN search

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are for Euclidean distance typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python's scikit-learn.

pkg_usage(pkg)
pkg_citation(pkg, 2)
pkg_install(pkg)

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = .42, minPts = 5)
db

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = .4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

Using dbscan with tidyverse

dbscan provides for all clustering algorithms tidy(), augment(), and glance() so they can be easily used with tidyverse, ggplot2 and tidymodels.

library(tidyverse)
db <- x %>% dbscan(eps = .42, minPts = 5)

Get cluster statistics as a tibble

tidy(db)

Visualize the clustering with ggplot2 (use an x for noise points)

augment(db, x) %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +
    geom_point(aes(color = .cluster, shape = noise)) +
    scale_shape_manual(values=c(19, 4))

Using dbscan from Python

R, the R package dbscan, and the Python package rpy2 need to be installed.

```{python, eval = FALSE} import pandas as pd import numpy as np

prepare data

iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']) iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

get R dbscan package

from rpy2.robjects import packages dbscan = packages.importr('dbscan')

enable automatic conversion of pandas dataframes to R dataframes

from rpy2.robjects import pandas2ri pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5) print(db)

DBSCAN clustering for 150 objects.

Parameters: eps = 0.5, minPts = 5

Using euclidean distances and borderpoints = TRUE

The clustering contains 2 cluster(s) and 17 noise points.

0 1 2

17 49 84

Available fields: cluster, eps, minPts, dist, borderPoints

```{python, eval = FALSE}
# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels

## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
##       dtype=int32)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework's Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.

Changes

List of changes from NEWS.md

References

mhahsler/dbscan documentation built on June 15, 2025, 9:42 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mhahsler/dbscan
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

In mhahsler/dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

Introduction

Usage

Using dbscan with tidyverse

Using dbscan from Python

prepare data

get R dbscan package

enable automatic conversion of pandas dataframes to R dataframes

DBSCAN clustering for 150 objects.

Parameters: eps = 0.5, minPts = 5

Using euclidean distances and borderpoints = TRUE

The clustering contains 2 cluster(s) and 17 noise points.

0 1 2

17 49 84

Available fields: cluster, eps, minPts, dist, borderPoints

License

Changes

References

R Package Documentation

Browse R Packages

We want your feedback!

mhahsler/dbscan Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

In mhahsler/dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

Introduction

Usage

Using dbscan with tidyverse

Using dbscan from Python

prepare data

get R dbscan package

enable automatic conversion of pandas dataframes to R dataframes

DBSCAN clustering for 150 objects.

Parameters: eps = 0.5, minPts = 5

Using euclidean distances and borderpoints = TRUE

The clustering contains 2 cluster(s) and 17 noise points.

0 1 2

17 49 84

Available fields: cluster, eps, minPts, dist, borderPoints

License

Changes

References

R Package Documentation

Browse R Packages

We want your feedback!

mhahsler/dbscan
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms