Bernstein.compute.pvalues: Function to compute the stability indices and the p-values...

Bernstein.compute.pvaluesR Documentation

Function to compute the stability indices and the p-values associated to a set of clusterings according to Bernstein inequality.

Description

For a given similarity matrix a list of stability indices, sorted by descending order, from the most significant clustering to the least significant is given, and the corresponding p-values, computed according to a Bernstein inequality based test are provided.

Usage

Bernstein.compute.pvalues(sim.matrix)

Bernstein.ind.compute.pvalues(sim.matrix)

Arguments

sim.matrix

a matrix that stores the similarity between pairs of clustering across multiple number of clusters and multiple clusterings. Rows correspond to the different clusterings; columns to the n repeated clusterings for each number of clusters. Row 1 corresponds to a 2-clustering, row 2 to a 3-clustering, ... row m to a m+1 clustering.

Details

The stability index for a given clustering is computed as the mean of the similarity indices between pairs of k-clusterings obtained from the perturbed data. The similarity matrix given as input can be obtained from the functions do.similarity.resampling, do.similarity.projection, do.similarity.noise. A list of p-values, sorted by descending order, from the most significant clustering to the least significant is given according to a test based on Bernstein inequality. The test is based on the distribution of the similarity measures between pairs of clustering performed on perturbed data, but differently from the chi-square based test (see Chi.square.compute.pvalues), no assumptions are made about the "a priori" distribution of the similarity measures. The function Bernstein.ind.compute.pvalues assumes also that the the random variables represented by the means of the similarities between pairs of clusterings are independent, while, on the contrary, the function Bernstein.compute.pvalues no assumptions are made. Low p-value mean that there is a significant difference between the top sorted and the given clustering. Please, see the papers cited in the reference section for more technical details.

Value

a list with 4 components:

ordered.clusterings

a vector with the number of clusters ordered from the most significant to the least significant

p.value

a vector with the corresponding p-values computed according to Bernstein inequality and Bonferroni correction in descending order (their values correspond to the clusterings of the vector ordered.clusterings)

means

vector with the mean similarity for each clustering

variance

vector with the variance of the similarity for each clustering

Author(s)

Giorgio Valentini valentini@di.unimi.it

References

W. Hoeffding, Probability inequalities for sums of independent random variables, J. Amer. Statist. Assoc. vol.58 pp. 13-30, 1963.

A.Bertoni, G. Valentini, Discovering significant structures in clustered data through Bernstein inequality, CISI '06, Conferenza Italiana Sistemi Intelligenti, Ancona, Italia, 2006.

See Also

Chi.square.compute.pvalues, Hypothesis.testing,

do.similarity.resampling, do.similarity.projection, do.similarity.noise

Examples


library("clusterv")
# Computation of the p-values according to Bernstein inequality using 
# resampling techniques and a hierarchical clustering algorithm
M <- generate.sample.h2 (n=20, l=10, Delta.h=4, Delta.v=2, sd=0.15);
S.HC <- do.similarity.resampling (M, c=15, nsub=20, f=0.8, s=sFM, 
                           alg.clust.sim=Hierarchical.sim.resampling);
# Bernstein test with no assumption of independence
Bernstein.compute.pvalues(S.HC)
# Bernstein test with  assumption of independence
Bernstein.ind.compute.pvalues(S.HC)


mosclust documentation built on June 8, 2025, 11:23 a.m.