DSD_Gaussians: Mixture of Gaussians Data Stream Generator

View source: R/DSD_Gaussians.R

DSD_GaussiansR Documentation

Mixture of Gaussians Data Stream Generator

Description

A data stream generator that produces a data stream with a mixture of static Gaussians.

Usage

DSD_Gaussians(
  k = 3,
  d = 2,
  p,
  mu,
  sigma,
  variance_limit = c(0.001, 0.002),
  separation = 6,
  space_limit = c(0, 1),
  noise = 0,
  noise_limit = space_limit,
  noise_separation = 3,
  separation_type = c("Euclidean", "Mahalanobis"),
  verbose = FALSE
)

Arguments

k

Determines the number of clusters.

d

Determines the number of dimensions.

p

A vector of probabilities that determines the likelihood of generated a data point from a particular cluster.

mu

A matrix of means for each dimension of each cluster.

sigma

A list of length k of covariance matrices.

variance_limit

Lower and upper limit for the randomly generated variance when creating cluster covariance matrices.

separation

Minimum separation distance between clusters (measured in standard deviations according to separation_type).

space_limit

Defines the space bounds. All constructs are generated inside these bounds. For clusters this means that their centroids must be within these space bounds.

noise

Noise probability between 0 and 1. Noise is uniformly distributed within noise range (see below).

noise_limit

A matrix with d rows and 2 columns. The first column contains the minimum values and the second column contains the maximum values for noise.

noise_separation

Minimum separation distance between cluster centers and noise points (measured in standard deviations according to separation_type). 0 means separation is ignored.

separation_type

The type of the separation distance calculation. It can be either Euclidean distance or Mahalanobis distance.

verbose

Report cluster and outlier generation process.

Details

DSD_Gaussians creates a mixture of k static clusters in a d-dimensional space. The cluster centers mu and the covariance matrices sigma can be supplied or will be randomly generated. The probability vector p defines for each cluster the probability that the next data point will be chosen from it (defaults to equal probability). Separation between generated clusters (and outliers; see below) can be imposed by using Euclidean or Mahalanobis distance, which is controlled by the separation_type parameter. Separation value then is supplied in the separation parameter. The generation method is similar to the one suggested by Jain and Dubes (1988).

Noise points which are uniformly chosen from noise_limit can be added.

Outlier points can be added. The outlier spatial positions predefined_outlier_space_positions and the outlier stream positions predefined_outlier_stream_positions can be supplied or will be randomly generated. Cluster and outlier separation distance is determined by and outlier_virtual_variance parameters. The outlier virtual variance defines an empty space around outliers, which separates them from their surrounding. Unlike noise, outliers are data points of interest for end-users, and the goal of outlier detectors is to find them in data streams. For more details, read the "Introduction to stream" vignette.

Value

Returns a object of class DSD_Gaussian (subclass of DSD_R, DSD).

Author(s)

Michael Hahsler

References

Jain and Dubes (1988) Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

See Also

Other DSD: DSD_BarsAndGaussians(), DSD_Benchmark(), DSD_Cubes(), DSD_MG(), DSD_Memory(), DSD_Mixture(), DSD_NULL(), DSD_ReadDB(), DSD_ReadStream(), DSD_Target(), DSD_UniformNoise(), DSD_mlbenchData(), DSD_mlbenchGenerator(), DSD(), DSF(), animate_data(), close_stream(), get_points(), plot.DSD(), reset_stream()

Examples

# Example 1: create data stream with three clusters in 3-dimensional data space
#            with 5 times sqrt(variance_limit) separation.
set.seed(1)
stream1 <- DSD_Gaussians(k = 3, d = 3)
stream1

get_points(stream1, n = 5)
plot(stream1, xlim = c(0, 1), ylim = c(0, 1))


# Example 2: create data stream with specified cluster positions,
# 5% noise in a given bounding box and
# with different densities (1 to 9 between the two clusters)
stream2 <- DSD_Gaussians(k = 2, d = 2,
    mu = rbind(c(-.5, -.5), c(.5, .5)),
    p = c(.1, .9),
    variance_limit = c(0.02, 0.04),
    noise = 0.05,
    noise_limit = rbind(c(-1, 1), c(-1, 1)))

get_points(stream2, n = 5)
plot(stream2, xlim = c(-1, 1), ylim = c(-1, 1))


# Example 3: create 4 clusters and noise separated by a Mahalanobis
# distance. Distance to noise is increased to 6 standard deviations to make them
# easier detectable outliers.
stream3 <- DSD_Gaussians(k = 4, d = 2,
  separation_type = "Mahalanobis",
  space_limit = c(5, 20),
  variance_limit = c(1, 2),
  noise = 0.05,
  noise_limit = c(0, 25),
  noise_separation = 6
  )
plot(stream3)

stream documentation built on March 7, 2023, 6:09 p.m.