SigmaIndex: Sigma-Index: a statistical indexing organizing structure
In dkrleza/SHClus: Statistical Hierarchical Clustering

Description Usage Arguments Details Methods References Examples

View source: R/SI_Wrappers.R

The Sigma-Index statistical organizing structure.

1 2	SigmaIndex(theta=3, neighborhood=9, precision_switch=TRUE) convertFromDSD(x, total_elements=500, theta=3, neighborhood=9, precision_switch=TRUE)

`theta`	(numeric) - A threshold that defines the bound of the statistical population. Used in statistical classification and inference.
`neighborhood`	(numeric) - A neighborhood statistical threshold, used to form the sigma-index DAG. Must be >`theta`.
`precision_switch`	(logical) - Keep the precision over the speed.
`x`	(DSD_Gaussians) - Gaussians data stream definition from the stream package. Gaussian definitions and outliers are used to covert to the Sigma-Index DAG. The same DSD can be used later to test the created Sigma-Index.
`total_elements`	(integer) - When converting a Sigma-Index DAG from DSD, we need to supply a total number of data points, to calculate population elements and numbers, since the Sigma-Index is probability structured.

A Sigma-Index is a statistical organizing structure aiming to improve query and processing times for statistical algorithms, when searching space comprising statistical populations. Those interested in inspecting the Sigma-Index code, must start from the following units:

SigmaIndex.hpp, SigmaIndex.cpp, SigmaIndex_Inline.cpp, SigmaIndexProxy.hpp, and SigmaIndexProxy.cpp - C++ units comprising the Sigma-Index implementation suitable for the Statistical Hierachical Clusterer.
SHC.cpp - C++ unit that uses the Sigma-Index. After construction of the SHC object [1], we create and add a Sigma-Index using the void useSigmaIndex(int neighborhood_mutiplier=2, bool precision_switch=true); method. Each processing is done through the SHC method shared_ptr<ClassificationResult> process(VectorXd *newElement, bool classifyOnly=false);, which is the primary place for the Sigma-Index invocations.

The Sigma-Index computational cost reduction is covered by three (3) distinct tests:

A generic synthetic test - SHC_TestCase_ClustersAndOutliers_SigmaIndex
A synthetic test covering balancing theorems - SHC_TestCase_ClustersAndOutliers_SigmaIndex_Theorems
A real-life sensor dataset test - SHC_TestCase_Sensors_SigmaIndex

All methods here are detailed in the stream package.

addPopulation(x, id, mean, covariance, elements=1, ...)

Adds a new population to the SigmaIndex object x. Population is characterized by its identifier (id - (string)), a centroid (mean - (numeric vector)), covariance matrix (covariance - (matrix)), and a number of elements (elements - (integer)). The covariance matrix must be non-singular and invertible.

addDataPoints(x, id, data_points, neighborhood, ...)

Updates a statistical population with one or more data points in the SigmaIndex object x. If adding one value, a caller must supply the population identifier (id - (string)) and a new data point (data_points - (numeric vector)). When supplying multiple data points, id must be a string vector containing population identifiers for all added data points, and data_points must be a data frame containing all data points.

addDataPointsInc(x, data_points, query_results, ...)

Incrementally updates a statistical population with one or more data points in the SigmaIndex object x. Prior to calling this method, a query must be done on the same set of data points. Results of this query are then used to incrementally update the SigmaIndex DAG. See the example.

queryDataPoints(x, data_points, ...)

Performs a query on the SigmaIndex object x. data_points parameter can be one value or a data frame comprising a set of values. Returns a list having the following members:

classified - a set of population identifiers for which a tested data point is closer (<=) than the theta threshold to the population centroid
neighborhood - a set of popluations for which each tested data point is between theta and neighborhood thresholds. During this query call, SigmaIndex object is collecting computation cost reduction statistics.

getTotalPopulations(x, ...)

Returns the number of populations in the SigmaIndex object x.

print(x, ...)

Prints details and structure of the SigmaIndex object x.

getPopulations(x, ...)

Returns details of all populations in the SigmaIndex object x. Returns a named list having the following members:

the name of the element is the population identifier
mean (numeric vector) - the population centroid
icovariance (matrix) - the inverted covariance matrix of the population
elements (integer) - the number of the population elements

removePopulation(x, id, ...)

Removes the population having identifier id from the SigmaIndex object x.

resetStatistics(x, ...)

Resets the statistical counters of the SigmaIndex object x.

getHistogram(x, ...)

Returns a numeric vector comprising 101 integers, which represents a histogram of the SigmaIndex object x. Such histogram contains the number of population elements related to the specific computation cost reduction.

getStatistics(x, ...)

Returns the statistical counters for the SigmaIndex object x. These counters can be resetted by the resetStatistics call. This method returns a list comprising the following members:

totalCount - Total populations searched in all queries
classifiedNodes - Total number of populations that had queried data points in range <=neighborhood from their centroids
missedNodes - Total number of populations that had queried data points in range >neihgborhood from their centroids
sequentialNodes - Total number of populations that would be searched in the sequential scan approach
computationCostReduction - The overall computational cost reduction comparing to the sequential scan approacj

[1] Krleža D, Vrdoljak B, and Brčić M, Statistical hierarchical clustering algorithm for outlier detection in evolving data streams, Machine Learning, Sep. 2020

 
d <- DSD_Gaussians(k=40,outliers=40,separation_type="Mahalanobis",separation=4,
                   space_limit=c(0,150),variance_limit=4,
                   outlier_options=list(outlier_horizon=200000))
si <- convertFromDSD(d,200000,theta=3.2,neighborhood=12.9)
res <- queryDataPoints(si, get_points(d, 200000))
hist <- getHistogram(si)
stat <- getStatistics(si)

# incremental update
points <- get_points(d, 200)
query_res <- queryDataPoints(si, points)
addDataPointsInc(si, points, query_res)