Bernstein.compute.pvalues: Function to compute the stability indices and the p-values...
In mosclust: Model Order Selection for Clustering

Bernstein.compute.pvalues

R Documentation

Function to compute the stability indices and the p-values associated to a set of clusterings according to Bernstein inequality.

Description

For a given similarity matrix a list of stability indices, sorted by descending order, from the most significant clustering to the least significant is given, and the corresponding p-values, computed according to a Bernstein inequality based test are provided.

Usage

Bernstein.compute.pvalues(sim.matrix)

Bernstein.ind.compute.pvalues(sim.matrix)

Arguments

sim.matrix

a matrix that stores the similarity between pairs of clustering across multiple number of clusters and multiple clusterings. Rows correspond to the different clusterings; columns to the n repeated clusterings for each number of clusters. Row 1 corresponds to a 2-clustering, row 2 to a 3-clustering, ... row m to a m+1 clustering.

Details

The stability index for a given clustering is computed as the mean of the similarity indices between pairs of k-clusterings obtained from the perturbed data. The similarity matrix given as input can be obtained from the functions do.similarity.resampling, do.similarity.projection, do.similarity.noise. A list of p-values, sorted by descending order, from the most significant clustering to the least significant is given according to a test based on Bernstein inequality. The test is based on the distribution of the similarity measures between pairs of clustering performed on perturbed data, but differently from the chi-square based test (see Chi.square.compute.pvalues), no assumptions are made about the "a priori" distribution of the similarity measures. The function Bernstein.ind.compute.pvalues assumes also that the the random variables represented by the means of the similarities between pairs of clusterings are independent, while, on the contrary, the function Bernstein.compute.pvalues no assumptions are made. Low p-value mean that there is a significant difference between the top sorted and the given clustering. Please, see the papers cited in the reference section for more technical details.

Value

a list with 4 components:

`ordered.clusterings`	a vector with the number of clusters ordered from the most significant to the least significant
`p.value`	a vector with the corresponding p-values computed according to Bernstein inequality and Bonferroni correction in descending order (their values correspond to the clusterings of the vector ordered.clusterings)
`means`	vector with the mean similarity for each clustering
`variance`	vector with the variance of the similarity for each clustering

Author(s)

Giorgio Valentini valentini@di.unimi.it

References

W. Hoeffding, Probability inequalities for sums of independent random variables, J. Amer. Statist. Assoc. vol.58 pp. 13-30, 1963.

A.Bertoni, G. Valentini, Discovering significant structures in clustered data through Bernstein inequality, CISI '06, Conferenza Italiana Sistemi Intelligenti, Ancona, Italia, 2006.

Examples


library("clusterv")
# Computation of the p-values according to Bernstein inequality using 
# resampling techniques and a hierarchical clustering algorithm
M <- generate.sample.h2 (n=20, l=10, Delta.h=4, Delta.v=2, sd=0.15);
S.HC <- do.similarity.resampling (M, c=15, nsub=20, f=0.8, s=sFM, 
                           alg.clust.sim=Hierarchical.sim.resampling);
# Bernstein test with no assumption of independence
Bernstein.compute.pvalues(S.HC)
# Bernstein test with  assumption of independence
Bernstein.ind.compute.pvalues(S.HC)

mosclust documentation built on June 8, 2025, 11:23 a.m.