is_homo: U-statistic based homogeneity test

Description Usage Arguments Details Value Examples

Description

Homogeneity test based on the statistic bn. The test assesses whether there exists a data partition for which group separation is statistically significant according to the U-test. The null hypothesis is overall sample homogeneity, and a sample is considered homogeneous if it cannot be divided into two statistically significant subgroups.

Usage

1
is_homo(md = NULL, data = NULL, rep = 10)

Arguments

md

Matrix of squared Euclidean distances between all data points.

data

Data matrix. Each row represents an observation.

rep

Number of times to repeat optimization procedure. Important for problems with multiple optima.

Details

This is the homogeneity test of Cybis et al. (2017) extended to account for groups of size 1. The test is performed through two steps: an optimization procedure that finds the data partition that maximizes the standardized Bn and a test for the resulting maximal partition. Should be used in high dimension small sample size settings.

Either data or md should be provided. If data are entered directly, Bn will be computed considering the squared Euclidean distance. It is important that if a distance matrix is entered, it consists of squared Euclidean distances, otherwise test results are invalid.

Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.

For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." arXiv preprint arXiv:1805.12179 (2018).

Value

Returns a list with the following elements:

minFobj

Test statistic. Minimum of the objective function for optimization (-stdBn).

group1

Elements in group 1 in the maximal partition. (obs: this is not the best partition for the data, see uclust)

group2

Elements in group 2 in the maximal partition.

p.MaxTest

P-value for the homogeneity test.

Rep.Fobj

Values for the minimum objective function on all rep optimization runs.

bootB

Resampling variance estimate for partitions with groups of size n/2 (or (n-1)/2 and (n+1)/2 if n is odd).

bootB1

Resampling variance estimate for partitions with one group of size 1.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
x = matrix(rnorm(500000),nrow=50)  #creating homogeneous Gaussian dataset
res = is_homo(data=x)

x[1:30,] = x[1:30,]+0.15   #Heterogeneous dataset (first 30 samples have different mean)
res = is_homo(data=x)

md = as.matrix(dist(x)^2)   #squared Euclidean distances for the same data
res = is_homo(md)

# Multidimensional sacling plot of distance matrix
fit <- cmdscale(md, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))

gcybis/Uclust documentation built on May 8, 2019, 1:20 p.m.