kb.test: Kernel-based quadratic distance (KBQD) Goodness-of-Fit tests
In QuadratiK: Collection of Methods Constructed using Kernel-Based Quadratic Distances

kb.test

R Documentation

Kernel-based quadratic distance (KBQD) Goodness-of-Fit tests

Description

This function performs the kernel-based quadratic distance goodness-of-fit tests. It includes tests for multivariate normality, two-sample tests and k-sample tests.

Usage

kb.test(
  x,
  y = NULL,
  h = NULL,
  method = "subsampling",
  B = 150,
  b = NULL,
  Quantile = 0.95,
  mu_hat = NULL,
  Sigma_hat = NULL,
  centeringType = "Nonparam",
  K_threshold = 10,
  alternative = "skewness"
)

## S4 method for signature 'ANY'
kb.test(
  x,
  y = NULL,
  h = NULL,
  method = "subsampling",
  B = 150,
  b = 0.9,
  Quantile = 0.95,
  mu_hat = NULL,
  Sigma_hat = NULL,
  centeringType = "Nonparam",
  K_threshold = 10,
  alternative = "skewness"
)

## S4 method for signature 'kb.test'
show(object)

Arguments

`x`	Numeric matrix or vector of data values.
`y`	Numeric matrix or vector of data values. Depending on the input `y`, the corresponding test is performed. if `y` = NULL, the function performs the tests for normality on `x` if `y` is a data matrix, with same dimensions of `x`, the function performs the two-sample test between `x` and `y`. if `y` is a numeric or factor vector, indicating the group memberships for each observation, the function performs the k-sample test.
`h`	Bandwidth for the kernel function. If a value is not provided, the algorithm for the selection of an optimal h is performed automatically. See the function `select_h` for more details.
`method`	The method used for critical value estimation ("subsampling", "bootstrap", or "permutation")(default: "subsampling").
`B`	The number of iterations to use for critical value estimation (default: 150).
`b`	The size of the subsamples used in the subsampling algorithm (default: 0.8).
`Quantile`	The quantile to use for critical value estimation, 0.95 is the default value.
`mu_hat`	Mean vector for the reference distribution.
`Sigma_hat`	Covariance matrix of the reference distribution.
`centeringType`	String indicating the method used for centering the normal kernel ('Param' or 'Nonparam').
`K_threshold`	maximum number of groups allowed. Default is 10. It is a control parameter. Change in case of more than 10 samples.
`alternative`	Family of alternative chosen for selecting h, between "location", "scale" and "skewness" (only if `h` is not provided).
`object`	Object of class `kb.test`

Details

The function kb.test performs the kernel-based quadratic distance tests using the Gaussian kernel with bandwidth parameter h. Depending on the shape of the input y the function performs the tests of multivariate normality, the non-parametric two-sample tests or the k-sample tests.

The quadratic distance between two probability distributions F and G is defined as

d_{K}(F,G)=\iint K(x,y)d(F-G)(x)d(F-G)(y),

where G is a distribution whose goodness of fit we wish to assess and K denotes the Normal kernel defined as

K_{{h}}(\mathbf{s}, \mathbf{t}) = (2 \pi)^{-d/2} \left(\det{\mathbf{\Sigma}_h}\right)^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}(\mathbf{s} - \mathbf{t})^\top \mathbf{\Sigma}_h^{-1}(\mathbf{s} - \mathbf{t})\right\},

for every \mathbf{s}, \mathbf{t} \in \mathbb{R}^d \times \mathbb{R}^d, with covariance matrix \mathbf{\Sigma}_h=h^2 I and tuning parameter h.

Test for Normality:
Let x_1, x_2, ..., x_n be a random sample with empirical distribution function \hat F. We test the null hypothesis of normality, i.e. H_0:F=G=\mathcal{N}_d(\mu, \Sigma).

We consider the U-statistic estimate of the sample KBQD

U_{n}=\frac{1}{n(n-1)}\sum_{i=2}^{n}\sum_{j=1}^{i-1} K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),

then the first test statistics is

T_{n}=\frac{U_{n}}{\sqrt{Var(U_{n})}},

with Var(U_n) computed exactly following Lindsay et al.(2014), and the V-statistic estimate

V_{n} = \frac{1}{n}\sum_{i=1}^{n} \sum_{j=1}^{n}K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),

where K_{cen} denotes the Normal kernel K_h with parametric centering with respect to the considered normal distribution G = \mathcal{N}_d(\mu, \Sigma).

The asymptotic distribution of the V-statistic is an infinite combination of weighted independent chi-squared random variables with one degree of freedom. The cutoff value is obtained using the Satterthwaite approximation c \cdot \chi_{DOF}^2, where c and DOF are computed exactly following the formulas in Lindsay et al.(2014).

For the U-statistic the cutoff is determined empirically:
- Generate data from the considered normal distribution ;
- Compute the test statistics for B Monte Carlo(MC) replications;
- Compute the 95th quantile of the empirical distribution of the test statistic.
k-sample test:
Consider k random samples of i.i.d. observations \mathbf{x}^{(i)}_1, \mathbf{x}^{(i)}_{2},\ldots, \mathbf{x}^{(i)}_{n_i} \sim F_i, i = 1, \ldots, k. We test if the samples are generated from the same unknown distribution, that is H_0: F_1 = F_2 = \ldots = F_k versus H_1: F_i \not = F_j, for some 1 \le i \not = j \le k.
We construct a matrix distance \hat{\mathbf{D}}, with off-diagonal elements

\hat{D}_{ij} = \frac{1}{n_i n_j} \sum_{\ell=1}^{n_i} \sum_{r=1}^{n_j}K_{\bar{F}}(\mathbf{x}^{(i)}_\ell,\mathbf{x}^{(j)}_r), \qquad \mbox{ for }i \not= j

and in the diagonal

\hat{D}_{ii} = \frac{1}{n_i (n_i -1)} \sum_{\ell=1}^{n_i} \sum_{r\not= \ell}^{n_i} K_{\bar{F}}(\mathbf{x}^{(i)}_\ell, \mathbf{x}^{(i)}_r), \qquad \mbox{ for }i = j,

where K_{\bar{F}} denotes the Normal kernel K_h centered non-parametrically with respect to

\bar{F} = \frac{n_1 \hat{F}_1 + \ldots + n_k \hat{F}_k}{n}, \quad \mbox{ with } n=\sum_{i=1}^k n_i.

We compute the trace statistic

\mathrm{trace}(\hat{\mathbf{D}}_n) = \sum_{i=1}^{k}\hat{D}_{ii}

and D_n, derived considering all the possible pairwise comparisons in the k-sample null hypothesis, given as

D_n = (k-1) \mathrm{trace}(\hat{\mathbf{D}}_n) - 2 \sum_{i=1}^{k}\sum_{j> i}^{k}\hat{D}_{ij}.

We compute the empirical critical value by employing numerical techniques such as the bootstrap, permutation and subsampling algorithms:
- Generate k-tuples, of total size n_B, from the pooled sample following one of the sampling methods;
- Compute the k-sample test statistic;
- Repeat B times;
- Select the 95^{th} quantile of the obtained values.
Two-sample test:
Let x_1, x_2, ..., x_{n_1} \sim F and y_1, y_2, ..., y_{n_2} \sim G be random samples from the distributions F and G, respectively. We test the null hypothesis that the two samples are generated from the same unknown distribution, that is H_0: F=G vs H_1:F\not=G. The test statistics coincide with the k-sample test statistics when k=2.

Kernel centering

The arguments mu_hat and Sigma_hat indicate the normal model considered for the normality test, that is H_0: F = N(mu_hat, Sigma_hat). For the two-sample and k-sample tests, mu_hat and Sigma_hat can be used for the parametric centering of the kernel, in the case we want to specify the reference distribution, with centeringType = "Param". This is the default method when the test for normality is performed. The normal kernel centered with respect to G \sim N_d(\mathbf{\mu}, \mathbf{V}) can be computed as

K_{cen(G)}(\mathbf{s}, \mathbf{t}) = K_{\mathbf{\Sigma_h}}(\mathbf{s}, \mathbf{t}) - K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{\mu}, \mathbf{t}) - K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{s}, \mathbf{\mu}) + K_{\mathbf{\Sigma_h} + 2\mathbf{V}}(\mathbf{\mu}, \mathbf{\mu}).

We consider the non-parametric centering of the kernel with respect to \bar{F}=(n_1 F_1 + \ldots n_k F_k)/n where n=\sum_{i=1}^k n_i, with centeringType = "Nonparam", for the two- and k-sample tests. Let \mathbf{z}_1,\ldots, \mathbf{z}_n denote the pooled sample. For any s,t \in \{\mathbf{z}_1,\ldots, \mathbf{z}_n\}, it is given by

K_{cen(\bar{F})}(\mathbf{s},\mathbf{t}) = K(\mathbf{s},\mathbf{t}) - \frac{1}{n}\sum_{i=1}^{n} K(\mathbf{s},\mathbf{z}_i) - \frac{1}{n}\sum_{i=1}^{n} K(\mathbf{z}_i,\mathbf{t}) + \frac{1}{n(n-1)}\sum_{i=1}^{n} \sum_{j \not=i}^{n} K(\mathbf{z}_i,\mathbf{z}_j).

Value

An S4 object of class kb.test containing the results of the kernel-based quadratic distance tests, based on the normal kernel. The object contains the following slots:

method: Description of the kernel-based quadratic distance test performed.
x Data list of samples X (and Y).
Un The value of the U-statistic.
H0_Un A logical value indicating whether or not the null hypothesis is rejected according to Un.
CV_Un The critical value computed for the test Un.
Vn The value of the V-statistic (if available).
H0_Vn A logical value indicating whether or not the null hypothesis is rejected according to Vn (if available).
CV_Vn The critical value computed for the test Vn (if available).
h List with the value of bandwidth parameter used for the normal kernel function. If select_h is used, the matrix of computed power values and the corresponding power plot are also provided.
B Number of bootstrap/permutation/subsampling replications.
var_Un exact variance of the kernel-based U-statistic.
cv_method The method used to estimate the critical value (one of "subsampling", "permutation" or "bootstrap").

Note

For the two- and k-sample tests, the slots Vn, H0_Vn and CV_Vn are empty, while the computed statistics are both reported in slots Un, H0_Un and CV_Un.

A U-statistic is a type of statistic that is used to estimate a population parameter. It is based on the idea of averaging over all possible distinct combinations of a fixed size from a sample. A V-statistic considers all possible tuples of a certain size, not just distinct combinations and can be used in contexts where unbiasedness is not required.

References

Markatou, M. and Saraceno, G. (2024). “A Unified Framework for Multivariate Two- and k-Sample Kernel-based Quadratic Distance Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374

Lindsay, B.G., Markatou, M. and Ray, S. (2014) "Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests", Journal of the American Statistical Association, 109:505, 395-410, DOI: 10.1080/01621459.2013.836972

Examples

# create a kb.test object
x <- matrix(rnorm(100), ncol = 2)
y <- matrix(rnorm(100), ncol = 2)

# Normality test
my_test <- kb.test(x, h=0.5)
my_test

# Two-sample test
my_test <- kb.test(x, y, h = 0.5, method = "subsampling", b = 0.9,
                   centeringType = "Nonparam")
my_test

# k-sample test
z <- matrix(rnorm(100, 2), ncol = 2)
dat <- rbind(x, y, z)
group <- rep(c(1, 2, 3), each = 50)
my_test <- kb.test(x = dat, y = group, h = 0.5, method = "subsampling", b = 0.9)
my_test

QuadratiK documentation built on April 12, 2025, 2 a.m.