kb.test: Kernel-based quadratic distance (KBQD) Goodness-of-Fit tests

kb.testR Documentation

Kernel-based quadratic distance (KBQD) Goodness-of-Fit tests

Description

This function performs the kernel-based quadratic distance goodness-of-fit tests. It includes tests for multivariate normality, two-sample tests and k-sample tests.

Usage

kb.test(
  x,
  y = NULL,
  h = NULL,
  method = "subsampling",
  B = 150,
  b = NULL,
  Quantile = 0.95,
  mu_hat = NULL,
  Sigma_hat = NULL,
  centeringType = "Nonparam",
  K_threshold = 10,
  alternative = "skewness"
)

## S4 method for signature 'ANY'
kb.test(
  x,
  y = NULL,
  h = NULL,
  method = "subsampling",
  B = 150,
  b = 0.9,
  Quantile = 0.95,
  mu_hat = NULL,
  Sigma_hat = NULL,
  centeringType = "Nonparam",
  K_threshold = 10,
  alternative = "skewness"
)

## S4 method for signature 'kb.test'
show(object)

Arguments

x

Numeric matrix or vector of data values.

y

Numeric matrix or vector of data values. Depending on the input y, the corresponding test is performed.

  • if y = NULL, the function performs the tests for normality on x

  • if y is a data matrix, with same dimensions of x, the function performs the two-sample test between x and y.

  • if y is a numeric or factor vector, indicating the group memberships for each observation, the function performs the k-sample test.

h

Bandwidth for the kernel function. If a value is not provided, the algorithm for the selection of an optimal h is performed automatically. See the function select_h for more details.

method

The method used for critical value estimation ("subsampling", "bootstrap", or "permutation")(default: "subsampling").

B

The number of iterations to use for critical value estimation (default: 150).

b

The size of the subsamples used in the subsampling algorithm (default: 0.8).

Quantile

The quantile to use for critical value estimation, 0.95 is the default value.

mu_hat

Mean vector for the reference distribution.

Sigma_hat

Covariance matrix of the reference distribution.

centeringType

String indicating the method used for centering the normal kernel ('Param' or 'Nonparam').

K_threshold

maximum number of groups allowed. Default is 10. It is a control parameter. Change in case of more than 10 samples.

alternative

Family of alternative chosen for selecting h, between "location", "scale" and "skewness" (only if h is not provided).

object

Object of class kb.test

Details

The function kb.test performs the kernel-based quadratic distance tests using the Gaussian kernel with bandwidth parameter h. Depending on the shape of the input y the function performs the tests of multivariate normality, the non-parametric two-sample tests or the k-sample tests.

The quadratic distance between two probability distributions F and G is defined as

d_{K}(F,G)=\iint K(x,y)d(F-G)(x)d(F-G)(y),

where G is a distribution whose goodness of fit we wish to assess and K denotes the Normal kernel defined as

K_{{h}}(\mathbf{s}, \mathbf{t}) = (2 \pi)^{-d/2} \left(\det{\mathbf{\Sigma}_h}\right)^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}(\mathbf{s} - \mathbf{t})^\top \mathbf{\Sigma}_h^{-1}(\mathbf{s} - \mathbf{t})\right\},

for every \mathbf{s}, \mathbf{t} \in \mathbb{R}^d \times \mathbb{R}^d, with covariance matrix \mathbf{\Sigma}_h=h^2 I and tuning parameter h.

  • Test for Normality:
    Let x_1, x_2, ..., x_n be a random sample with empirical distribution function \hat F. We test the null hypothesis of normality, i.e. H_0:F=G=\mathcal{N}_d(\mu, \Sigma).

    We consider the U-statistic estimate of the sample KBQD

    U_{n}=\frac{1}{n(n-1)}\sum_{i=2}^{n}\sum_{j=1}^{i-1} K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),

    then the first test statistics is

    T_{n}=\frac{U_{n}}{\sqrt{Var(U_{n})}},

    with Var(U_n) computed exactly following Lindsay et al.(2014), and the V-statistic estimate

    V_{n} = \frac{1}{n}\sum_{i=1}^{n} \sum_{j=1}^{n}K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),

    where K_{cen} denotes the Normal kernel K_h with parametric centering with respect to the considered normal distribution G = \mathcal{N}_d(\mu, \Sigma).

    The asymptotic distribution of the V-statistic is an infinite combination of weighted independent chi-squared random variables with one degree of freedom. The cutoff value is obtained using the Satterthwaite approximation c \cdot \chi_{DOF}^2, where c and DOF are computed exactly following the formulas in Lindsay et al.(2014).

    For the U-statistic the cutoff is determined empirically:

    • Generate data from the considered normal distribution ;

    • Compute the test statistics for B Monte Carlo(MC) replications;

    • Compute the 95th quantile of the empirical distribution of the test statistic.

  • k-sample test:
    Consider k random samples of i.i.d. observations \mathbf{x}^{(i)}_1, \mathbf{x}^{(i)}_{2},\ldots, \mathbf{x}^{(i)}_{n_i} \sim F_i, i = 1, \ldots, k. We test if the samples are generated from the same unknown distribution, that is H_0: F_1 = F_2 = \ldots = F_k versus H_1: F_i \not = F_j, for some 1 \le i \not = j \le k.
    We construct a matrix distance \hat{\mathbf{D}}, with off-diagonal elements

    \hat{D}_{ij} = \frac{1}{n_i n_j} \sum_{\ell=1}^{n_i} \sum_{r=1}^{n_j}K_{\bar{F}}(\mathbf{x}^{(i)}_\ell,\mathbf{x}^{(j)}_r), \qquad \mbox{ for }i \not= j

    and in the diagonal

    \hat{D}_{ii} = \frac{1}{n_i (n_i -1)} \sum_{\ell=1}^{n_i} \sum_{r\not= \ell}^{n_i} K_{\bar{F}}(\mathbf{x}^{(i)}_\ell, \mathbf{x}^{(i)}_r), \qquad \mbox{ for }i = j,

    where K_{\bar{F}} denotes the Normal kernel K_h centered non-parametrically with respect to

    \bar{F} = \frac{n_1 \hat{F}_1 + \ldots + n_k \hat{F}_k}{n}, \quad \mbox{ with } n=\sum_{i=1}^k n_i.

    We compute the trace statistic

    \mathrm{trace}(\hat{\mathbf{D}}_n) = \sum_{i=1}^{k}\hat{D}_{ii}

    and D_n, derived considering all the possible pairwise comparisons in the k-sample null hypothesis, given as

    D_n = (k-1) \mathrm{trace}(\hat{\mathbf{D}}_n) - 2 \sum_{i=1}^{k}\sum_{j> i}^{k}\hat{D}_{ij}.

    We compute the empirical critical value by employing numerical techniques such as the bootstrap, permutation and subsampling algorithms:

    • Generate k-tuples, of total size n_B, from the pooled sample following one of the sampling methods;

    • Compute the k-sample test statistic;

    • Repeat B times;

    • Select the 95^{th} quantile of the obtained values.

  • Two-sample test:
    Let x_1, x_2, ..., x_{n_1} \sim F and y_1, y_2, ..., y_{n_2} \sim G be random samples from the distributions F and G, respectively. We test the null hypothesis that the two samples are generated from the same unknown distribution, that is H_0: F=G vs H_1:F\not=G. The test statistics coincide with the k-sample test statistics when k=2.

Kernel centering

The arguments mu_hat and Sigma_hat indicate the normal model considered for the normality test, that is H_0: F = N(mu_hat, Sigma_hat). For the two-sample and k-sample tests, mu_hat and Sigma_hat can be used for the parametric centering of the kernel, in the case we want to specify the reference distribution, with centeringType = "Param". This is the default method when the test for normality is performed. The normal kernel centered with respect to G \sim N_d(\mathbf{\mu}, \mathbf{V}) can be computed as

K_{cen(G)}(\mathbf{s}, \mathbf{t}) = K_{\mathbf{\Sigma_h}}(\mathbf{s}, \mathbf{t}) - K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{\mu}, \mathbf{t}) - K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{s}, \mathbf{\mu}) + K_{\mathbf{\Sigma_h} + 2\mathbf{V}}(\mathbf{\mu}, \mathbf{\mu}).

We consider the non-parametric centering of the kernel with respect to \bar{F}=(n_1 F_1 + \ldots n_k F_k)/n where n=\sum_{i=1}^k n_i, with centeringType = "Nonparam", for the two- and k-sample tests. Let \mathbf{z}_1,\ldots, \mathbf{z}_n denote the pooled sample. For any s,t \in \{\mathbf{z}_1,\ldots, \mathbf{z}_n\}, it is given by

K_{cen(\bar{F})}(\mathbf{s},\mathbf{t}) = K(\mathbf{s},\mathbf{t}) - \frac{1}{n}\sum_{i=1}^{n} K(\mathbf{s},\mathbf{z}_i) - \frac{1}{n}\sum_{i=1}^{n} K(\mathbf{z}_i,\mathbf{t}) + \frac{1}{n(n-1)}\sum_{i=1}^{n} \sum_{j \not=i}^{n} K(\mathbf{z}_i,\mathbf{z}_j).

Value

An S4 object of class kb.test containing the results of the kernel-based quadratic distance tests, based on the normal kernel. The object contains the following slots:

  • method: Description of the kernel-based quadratic distance test performed.

  • x Data list of samples X (and Y).

  • Un The value of the U-statistic.

  • H0_Un A logical value indicating whether or not the null hypothesis is rejected according to Un.

  • CV_Un The critical value computed for the test Un.

  • Vn The value of the V-statistic (if available).

  • H0_Vn A logical value indicating whether or not the null hypothesis is rejected according to Vn (if available).

  • CV_Vn The critical value computed for the test Vn (if available).

  • h List with the value of bandwidth parameter used for the normal kernel function. If select_h is used, the matrix of computed power values and the corresponding power plot are also provided.

  • B Number of bootstrap/permutation/subsampling replications.

  • var_Un exact variance of the kernel-based U-statistic.

  • cv_method The method used to estimate the critical value (one of "subsampling", "permutation" or "bootstrap").

Note

For the two- and k-sample tests, the slots Vn, H0_Vn and CV_Vn are empty, while the computed statistics are both reported in slots Un, H0_Un and CV_Un.

A U-statistic is a type of statistic that is used to estimate a population parameter. It is based on the idea of averaging over all possible distinct combinations of a fixed size from a sample. A V-statistic considers all possible tuples of a certain size, not just distinct combinations and can be used in contexts where unbiasedness is not required.

References

Markatou, M. and Saraceno, G. (2024). “A Unified Framework for Multivariate Two- and k-Sample Kernel-based Quadratic Distance Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374

Lindsay, B.G., Markatou, M. and Ray, S. (2014) "Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests", Journal of the American Statistical Association, 109:505, 395-410, DOI: 10.1080/01621459.2013.836972

See Also

kb.test for the class definition.

Examples

# create a kb.test object
x <- matrix(rnorm(100),ncol=2)
y <- matrix(rnorm(100),ncol=2)

# Normality test
my_test <- kb.test(x, h=0.5)
my_test

# Two-sample test
my_test <- kb.test(x,y,h=0.5, method="subsampling",b=0.9,
                     centeringType = "Nonparam")
my_test

# k-sample test
z <- matrix(rnorm(100,2),ncol=2)
dat <- rbind(x,y,z)
group <- rep(c(1,2,3),each=50)
my_test <- kb.test(x=dat,y=group,h=0.5, method="subsampling",b=0.9)
my_test


QuadratiK documentation built on Oct. 29, 2024, 5:08 p.m.