is_homo: U-statistic based homogeneity test
In gcybis/Uclust: Clustering and Classification Inference with U-Statistics

Description Usage Arguments Details Value Examples

Homogeneity test based on the statistic bn. The test assesses whether there exists a data partition for which group separation is statistically significant according to the U-test. The null hypothesis is overall sample homogeneity, and a sample is considered homogeneous if it cannot be divided into two statistically significant subgroups.

1	is_homo(md = NULL, data = NULL, rep = 10)

`md`	Matrix of squared Euclidean distances between all data points.
`data`	Data matrix. Each row represents an observation.
`rep`	Number of times to repeat optimization procedure. Important for problems with multiple optima.

This is the homogeneity test of Cybis et al. (2017) extended to account for groups of size 1. The test is performed through two steps: an optimization procedure that finds the data partition that maximizes the standardized Bn and a test for the resulting maximal partition. Should be used in high dimension small sample size settings.

Either data or md should be provided. If data are entered directly, Bn will be computed considering the squared Euclidean distance. It is important that if a distance matrix is entered, it consists of squared Euclidean distances, otherwise test results are invalid.

Variance of bn is estimated through resampling, and thus, p-values may vary a bit in different runs.

For more detail see Cybis, Gabriela B., Marcio Valk, and Sílvia RC Lopes. "Clustering and classification problems in genetics through U-statistics." Journal of Statistical Computation and Simulation 88.10 (2018) and Valk, Marcio, and Gabriela Bettella Cybis. "U-statistical inference for hierarchical clustering." arXiv preprint arXiv:1805.12179 (2018).

Returns a list with the following elements:

minFobj: Test statistic. Minimum of the objective function for optimization (-stdBn).
group1: Elements in group 1 in the maximal partition. (obs: this is not the best partition for the data, see uclust)
group2: Elements in group 2 in the maximal partition.
p.MaxTest: P-value for the homogeneity test.
Rep.Fobj: Values for the minimum objective function on all rep optimization runs.
bootB: Resampling variance estimate for partitions with groups of size n/2 (or (n-1)/2 and (n+1)/2 if n is odd).
bootB1: Resampling variance estimate for partitions with one group of size 1.

x = matrix(rnorm(500000),nrow=50)  #creating homogeneous Gaussian dataset
res = is_homo(data=x)

x[1:30,] = x[1:30,]+0.15   #Heterogeneous dataset (first 30 samples have different mean)
res = is_homo(data=x)

md = as.matrix(dist(x)^2)   #squared Euclidean distances for the same data
res = is_homo(md)

# Multidimensional sacling plot of distance matrix
fit <- cmdscale(md, eig = TRUE, k = 2)
x <- fit$points[, 1]
y <- fit$points[, 2]
plot(x,y, main=paste("Homogeneity test: p-value =",res$p.MaxTest))