synthetic_stream: Create a Synthetic Data Stream

View source: R/synthetic_stream.R

synthetic_streamR Documentation

Create a Synthetic Data Stream


This function creates a synthetic data stream with data points in roughly [0, 1]^p by choosing points form k clusters following a sequence through these clusters. Each cluster has a density function following a d-dimensional normal distributions. In the test set outliers are introduced.


synthetic_stream(k = 10, d = 2, n_subseq = 100, p_transition = 0.5, p_swap = 0,
n_train = 5000, n_test = 1000, p_outlier = 0.01, rangeVar = c(0, 0.005))



number of clusters.


dimensionality of data set.


length of subsequence which will be repeat to create the data set.


probability that the next position in the subsequence will belong to a different cluster.


probability that two data points are swapped. This represents measurement errors (e.g., a data points arrive out of order) or that the data stream does not exactly follow the subsequence.


size of training set (without outliers).


size of test set (with outliers).


probability that a data point is replaced by an outlier (a randomly chosen point in [0,1]^p).


Used to create the random covariance matrices for the clusters. See genPositiveDefMat() in clusterGeneration for details.


The data generation process creates a data set consisting of k clusters in roughly [0,1]^d. The data points for each cluster are be drawn from a multivariate normal distribution given a random mean and a random variance/covariance matrix for each cluster. The temporal aspect is modeled by a fixed subsequence (of length n_subseq) through the k clusters. In each step in the subsequence we have a transition probability p_transition that the next data point is in the same cluster or in a randomly chosen other cluster, thus we can create slowly or fast changing data. For the complete sequence, the subsequence is repeated to create n_test/n_train data points. The data set is generated by drawing a data point from the cluster corresponding to each position in the sequence. Outliers are introduced by replacing data points in the data set with probability $p_outlier by randomly chosen data points in [0,1]^d. Finally, to introduce imperfection in the temporal sequence (e.g., because the data does not follow exactly a repeating sequence or because observations do not arrive in the correct order), we swap two consecutive observations with probability p_swap.


A list with the following elements:


test data.


training data.


sequence of the test data points through the clusters.


sequence of the training data points through the clusters.


index where points are swapped.


index where points are swapped.


logical vector for outliers in test data.


centers and covariance matrices for the clusters.


## create only test data (with outliers)
ds <- synthetic_stream(n_train = 0)

## plot test data
plot(ds$test, pch = ds$sequence_test, col = "gray")
text(ds$model$mu[, 1], ds$model$mu[, 2], 1:10)

## mark outliers
points(ds$test[ds$outlier_position, ],
  pch = 3, lwd = 2, col = "red")

rEMM documentation built on May 29, 2024, 4:35 a.m.