statistical_test: Statistical test for clustering relevance

Description Usage Arguments Details Value References

View source: R/statistical_test.R

Description

Statistical test for checking the clustering of a specified dataset is relevant. Several datasets are generated under a null hypothesis and their distribution of nearest neighbours distances are compared with the one of the original dataset.

Usage

1
statistical_test(X, s, null_distrib = "gaussian")

Arguments

X

data matrix or data frame of size n x d, n observations and d features

s

number of reference datasets to generate

null_distrib

type of the null hypothesis. Can either be "gaussian", "uniform" or "uniformity". "gaussian" draws observations from a mulidimensional normal distribution with the same mean and variance as in the original dataset for each feature . "uniform" draws uniformely observations in the range of each feature. "uniformity" draws observation from a uniform distribution as in gap statistics (Tibshirani et al. 2001).

Details

The function plots the empirical distribution function of the nearest neighbours of the observed data against the empirical distribution under the null hypothesis. It also plots the identity line, representing the case where both distributions are in perfect agreement. If the first curve is quickly above the second line it means that it is likely that the clustering is relevant. If the returned pvalue is under 0.03, it is also a hint that the dataset is likely to have clusters.

Value

list of 2 components

U

vector containing the discrepancy measures. The first value is the measure for the observed data, the s remaining are for the generated datasets.

pvalue

proportion of discrepancy measure of the generated datasets that are at least as large as the discrepancy measure of the original dataset.

References


mattmail/clusterAnalysis documentation built on Nov. 4, 2019, 6:18 p.m.