DSC_DBSTREAM: DBSTREAM Clustering Algorithm

View source: R/DSC_DBSTREAM.R

DSC_DBSTREAMR Documentation

DBSTREAM Clustering Algorithm

Description

Micro Clusterer with reclustering. Implements a simple density-based stream clustering algorithm that assigns data points to micro-clusters with a given radius and implements shared-density-based reclustering.

Usage

DSC_DBSTREAM(
  formula = NULL,
  r,
  lambda = 0.001,
  gaptime = 1000L,
  Cm = 3,
  metric = "Euclidean",
  noise_multiplier = 1,
  shared_density = FALSE,
  alpha = 0.1,
  k = 0,
  minweight = 0
)

get_shared_density(x, use_alpha = TRUE)

change_alpha(x, alpha)

## S3 method for class 'DSC_DBSTREAM'
plot(
  x,
  dsd = NULL,
  n = 500,
  col_points = NULL,
  dim = NULL,
  method = "pairs",
  type = c("auto", "micro", "macro", "both", "none"),
  shared_density = FALSE,
  use_alpha = TRUE,
  assignment = FALSE,
  ...
)

DSOutlier_DBSTREAM(
  formula = NULL,
  r,
  lambda = 0.001,
  gaptime = 1000L,
  Cm = 3,
  metric = "Euclidean",
  outlier_multiplier = 2
)

Arguments

formula

NULL to use all features in the stream or a model formula of the form ~ X1 + X2 to specify the features used for clustering. Only ., + and - are currently supported in the formula.

r

The radius of micro-clusters.

lambda

The lambda used in the fading function.

gaptime

weak micro-clusters (and weak shared density entries) are removed every gaptime points.

Cm

minimum weight for a micro-cluster.

metric

metric used to calculate distances.

noise_multiplier, outlier_multiplier

multiplier for radius r to declare noise or outliers.

shared_density

Record shared density information. If set to TRUE then shared density is used for reclustering, otherwise reachability is used (overlapping clusters with less than r * (1 - alpha) distance are clustered together).

alpha

For shared density: The minimum proportion of shared points between to clusters to warrant combining them (a suitable value for 2D data is .3). For reachability clustering it is a distance factor.

k

The number of macro clusters to be returned if macro is true.

minweight

The proportion of the total weight a macro-cluster needs to have not to be noise (between 0 and 1).

x

A DSC_DBSTREAM object to get the shared density information from.

use_alpha

only return shared density if it exceeds alpha.

dsd

a data stream object.

n

number of plots taken from the dsd to plot.

col_points

color used for plotting.

dim

an integer vector with the dimensions to plot. If NULL then for methods "pairs" and "pc" all dimensions are used and for "scatter" the first two dimensions are plotted.

method

plot method.

type

Plot micro clusters (type="micro"), macro clusters (type="macro"), both micro and macro clusters (type="both"), outliers(type="outliers"), or everything together (type="all"). type="auto" leaves to the class of DSC to decide.

assignment

logical; show assignment area of micro-clusters.

...

further arguments are passed on to plot or pairs in graphics.

Details

The DBSTREAM algorithm checks for each new data point in the incoming stream, if it is below the threshold value of dissimilarity value of any existing micro-clusters, and if so, merges the point with the micro-cluster. Otherwise, a new micro-cluster is created to accommodate the new data point.

Although DSC_DBSTREAM is a micro clustering algorithm, macro clusters and weights are available.

update() invisibly return the assignment of the data points to clusters. The columns are .class with the index of the strong micro-cluster and .mc_id with the permanent id of the strong micro-cluster.

plot() for DSC_DBSTREAM has two extra logical parameters called assignment and shared_density which show the assignment area and the shared density graph, respectively.

predict() can be used to assign new points to clusters. Points are assigned to a micro-cluster if they are within its assignment area (distance is less then r times noise_multiplier).

DSOutlier_DBSTREAM classifies points as outlier/noise if they that cannot be assigned to a micro-cluster representing a dense region as a outlier/noise. Parameter outlier_multiplier specifies how far a point has to be away from a micro-cluster as a multiplier for the radius r. A larger value means that outliers have to be farther away from dense regions and thus reduce the chance of misclassifying a regular point as an outlier.

Value

An object of class DSC_DBSTREAM (subclass of DSC, DSC_R, DSC_Micro).

Author(s)

Michael Hahsler and Matthew Bolanos

References

Michael Hahsler and Matthew Bolanos. Clustering data streams based on shared density between micro-clusters. IEEE Transactions on Knowledge and Data Engineering, 28(6):1449–1461, June 2016

See Also

Other DSC_Micro: DSC_BICO(), DSC_BIRCH(), DSC_DStream(), DSC_Micro(), DSC_Sample(), DSC_Window(), DSC_evoStream()

Other DSC_TwoStage: DSC_DStream(), DSC_TwoStage(), DSC_evoStream()

Other DSOutlier: DSC_DStream(), DSOutlier()

Examples

set.seed(1000)
stream <- DSD_Gaussians(k = 3, d = 2, noise = 0.05)

# create clusterer with r = .05
dbstream <- DSC_DBSTREAM(r = .05)
update(dbstream, stream, 500)
dbstream

# check micro-clusters
nclusters(dbstream)
head(get_centers(dbstream))
plot(dbstream, stream)

# plot micro-clusters with assignment area
plot(dbstream, stream, type = "none", assignment = TRUE)


# DBSTREAM with shared density
dbstream <- DSC_DBSTREAM(r = .05, shared_density = TRUE, Cm = 5)
update(dbstream, stream, 500)
dbstream

plot(dbstream, stream)
# plot the shared density graph (several options)
plot(dbstream, stream, type = "micro", shared_density = TRUE)
plot(dbstream, stream, type = "none", shared_density = TRUE, assignment = TRUE)

# see how micro and macro-clusters relate
# each micro-cluster has an entry with the macro-cluster id
# Note: unassigned micro-clusters (noise) have an NA
microToMacro(dbstream)

# do some evaluation
evaluate_static(dbstream, stream, measure = "purity")
evaluate_static(dbstream, stream, measure = "cRand", type = "macro")

# use DBSTREAM also returns the cluster assignment
# later retrieve the cluster assignments for each point)
data("iris")
dbstream <- DSC_DBSTREAM(r = 1)
cl <- update(dbstream, iris[,-5], return = "assignment")
dbstream

head(cl)

# micro-clusters
plot(iris[,-5], col = cl$.class, pch = cl$.class)

# macro-clusters (2 clusters since reachability cannot separate two of the three species)
plot(iris[,-5], col = microToMacro(dbstream, cl$.class))

# use DBSTREAM with a formula (cluster all variables but X2)
stream <- DSD_Gaussians(k = 3, d = 4, noise = 0.05)
dbstream <- DSC_DBSTREAM(formula = ~ . - X2, r = .2)

update(dbstream, stream, 500)
get_centers(dbstream)

# use DBSTREAM for outlier detection
stream <- DSD_Gaussians(k = 3, d = 4, noise = 0.05)
outlier_detector <- DSOutlier_DBSTREAM(r = .2)

update(outlier_detector, stream, 500)
outlier_detector

plot(outlier_detector, stream)

points <- get_points(stream, 20)
points
which(is.na(predict(outlier_detector, points)))

stream documentation built on May 29, 2024, 9:43 a.m.