mvbinary.test: Test two multivariate binary datasets

View source: R/binary.R

mvbinary.testR Documentation

Test two multivariate binary datasets

Description

Peforms a two-sample test for two binary vectors testing H_0: the underlying probability vectors are the same vs. H_1: they are different.

Usage

mvbinary.test(x, y = NULL, numPerms = 5000)

Arguments

x, y

Matrices (or dataframes) containing multiple integer vector observations as rows. x and y must be the same type and dimension. Alternatively, x can be a list of two matrices (or dataframes) to be compared. In this case, y is NULL by default.

numPerms

Number of permutations to use to calculate the p-value. Default value is 5000.

Details

The statistic is T = ∑_{j=1}^d D_j^2 I( |Dj| ≥ δ(d)) where d is the dimension of the data. Additionally:

  • Dj = (\hat{p}_{1j} − \hat{p}_{2j} )/√{ \hat{p}_j (1 − \hat{p}_j )(1/n1 + 1/n2) }

  • \hat{p}_{cj} is the estimate of p_{cj} for the c^{th} group calculated by the j^th column mean

  • \hat{p}_j is the pooled estimate for the j^{th} variable.

  • δ(d) = √{2 log (a_d d)} where a_d = (log d)^{-2}

The p-value associated with the statistic is calculated using the permutation method. The observation vectors are repeatedly shuffled between groups, each time being used to re-calculate the statistic. A null distribution is constructed and used to calcualate the p-value.

Value

A list containing the computed statistic, a list of statistics (null.statistics) used to construct the null distritubution (from the permutation method), and the associated pvalue. The pvalue is the percent of null.statistics that are more extreme than the statistic computed from the original dataset.

Warning

As described in the reference below, this method may not perform well (low power) on highly correlated variables.

Also, note that for large values of numPerms, run time may be long. However, larger values of numPerms produce more accurate estimates of the p-value.

See Also

Amanda Plunkett & Junyong Park (2017), Two-sample Tests for Sparse High-Dimensional Binary Data, Communications in Statistics - Theory and Methods, 46:22, 11181-11193

Examples

# Binarize the twoNewsGroups dataset:
data(twoNewsGroups)
binData <- list(twoNewsGroups[[1]] > 0, twoNewsGroups[[2]] > 0)
names(binData) <- names(twoNewsGroups)

# Perform the test:
result <- mvbinary.test(binData, numPerms = 100)
result$pvalue

# The following are equivalent to the previous test:
result <- mvbinary.test(binData[[1]], binData[[2]], numPerms = 100)
result <- binData |> mvbinary.test(numPerms = 100)


AmandaRP/hddtest documentation built on March 18, 2023, 5:53 p.m.