uhclust: U-statistic based significance hierarchical clustering

Description Usage Arguments Details Value Examples

Description

Hierarchical clustering method that partitions the data only when these partitions are statistically significant.

Usage

1
2
uhclust(md = NULL, data = NULL, alpha = 0.05, rep = 15,
  plot = TRUE)

Arguments

md

Matrix of squared Euclidean distances between all data points.

data

Data matrix. Each row represents an observation.

alpha

Significance level.

rep

Number of times to repeat optimization procedures. Important for problems with multiple optima.

plot

Logical, TRUE if p-value annotated dendrogram should be plotted.

Details

This is the significance hierarchical clustering procedure of Valk and Cybis (2018). The data are repeatedly partitioned into two subgroups, through function uclust, according to a hierarchical scheme. The procedure stops when resulting subgroups are homogeneous or have fewer than 3 elements. This function should be used in high dimension small sample size settings.

Either data or md should be provided. If data are entered directly, Bn will be computed considering the squared Euclidean distance. It is important that if a distance matrix is entered, it consists of squared Euclidean distances, otherwise test results are invalid.

Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.

For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." arXiv preprint arXiv:1805.12179 (2018).

See also is_homo, uclust and Utest_class.

Value

Returns an object of class hclust with three additional attribute arrays:

Pvalues

P-values from uclust for the final data partition at each node of the dendrogram. This array is in the same order of height, and only contains values for tests that were performed.

alpha

Bonferroni corrected significance levels for uclust for the data partitions at each node of the dendrogram. This array is in the same order of height, and only contains values for tests that were performed.

groups

Final group assignments.

Examples

1
2
3
4
5
6
7
8
x = matrix(rnorm(100000),nrow=50)  #creating homogeneous Gaussian dataset
res = uhclust(data=x)


x[1:30,] = x[1:30,]+0.7   #Heterogeneous dataset
x[1:10,] = x[1:10,]+0.4
res = uhclust(data=x)
res$groups

gcybis/Uclust documentation built on May 8, 2019, 1:20 p.m.