kb.test | R Documentation |
This function performs the kernel-based quadratic distance goodness-of-fit
tests. It includes tests for multivariate normality, two-sample tests and
k
-sample tests.
kb.test(
x,
y = NULL,
h = NULL,
method = "subsampling",
B = 150,
b = NULL,
Quantile = 0.95,
mu_hat = NULL,
Sigma_hat = NULL,
centeringType = "Nonparam",
K_threshold = 10,
alternative = "skewness"
)
## S4 method for signature 'ANY'
kb.test(
x,
y = NULL,
h = NULL,
method = "subsampling",
B = 150,
b = 0.9,
Quantile = 0.95,
mu_hat = NULL,
Sigma_hat = NULL,
centeringType = "Nonparam",
K_threshold = 10,
alternative = "skewness"
)
## S4 method for signature 'kb.test'
show(object)
x |
Numeric matrix or vector of data values. |
y |
Numeric matrix or vector of data values. Depending on the input
|
h |
Bandwidth for the kernel function. If a value is not provided, the
algorithm for the selection of an optimal h is performed
automatically. See the function |
method |
The method used for critical value estimation ("subsampling", "bootstrap", or "permutation")(default: "subsampling"). |
B |
The number of iterations to use for critical value estimation (default: 150). |
b |
The size of the subsamples used in the subsampling algorithm (default: 0.8). |
Quantile |
The quantile to use for critical value estimation, 0.95 is the default value. |
mu_hat |
Mean vector for the reference distribution. |
Sigma_hat |
Covariance matrix of the reference distribution. |
centeringType |
String indicating the method used for centering the normal kernel ('Param' or 'Nonparam'). |
K_threshold |
maximum number of groups allowed. Default is 10. It is a control parameter. Change in case of more than 10 samples. |
alternative |
Family of alternative chosen for selecting h, between
"location", "scale" and "skewness" (only if |
object |
Object of class |
The function kb.test
performs the kernel-based quadratic
distance tests using the Gaussian kernel with bandwidth parameter h
.
Depending on the shape of the input y
the function performs the tests
of multivariate normality, the non-parametric two-sample tests or the
k-sample tests.
The quadratic distance between two probability distributions F
and
G
is
defined as
d_{K}(F,G)=\iint K(x,y)d(F-G)(x)d(F-G)(y),
where G
is a distribution whose goodness of fit we wish to assess and
K
denotes the Normal kernel defined as
K_{{h}}(\mathbf{s}, \mathbf{t}) = (2 \pi)^{-d/2}
\left(\det{\mathbf{\Sigma}_h}\right)^{-\frac{1}{2}}
\exp\left\{-\frac{1}{2}(\mathbf{s} - \mathbf{t})^\top
\mathbf{\Sigma}_h^{-1}(\mathbf{s} - \mathbf{t})\right\},
for every \mathbf{s}, \mathbf{t} \in \mathbb{R}^d \times
\mathbb{R}^d
, with covariance matrix \mathbf{\Sigma}_h=h^2 I
and
tuning parameter h
.
Test for Normality:
Let x_1, x_2, ..., x_n
be a random sample with empirical
distribution function \hat F
. We test the null hypothesis of
normality, i.e. H_0:F=G=\mathcal{N}_d(\mu, \Sigma)
.
We consider the U-statistic estimate of the sample KBQD
U_{n}=\frac{1}{n(n-1)}\sum_{i=2}^{n}\sum_{j=1}^{i-1}
K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),
then the first test statistics is
T_{n}=\frac{U_{n}}{\sqrt{Var(U_{n})}},
with Var(U_n)
computed exactly following Lindsay et al.(2014),
and the V-statistic estimate
V_{n} = \frac{1}{n}\sum_{i=1}^{n}
\sum_{j=1}^{n}K_{cen}(\mathbf{x}_{i}, \mathbf{x}_{j}),
where K_{cen}
denotes the Normal kernel K_h
with parametric
centering with respect to the considered normal distribution
G = \mathcal{N}_d(\mu, \Sigma)
.
The asymptotic distribution of the V-statistic is an infinite combination
of weighted independent chi-squared random variables with one degree of
freedom. The cutoff value is obtained using the Satterthwaite
approximation c \cdot \chi_{DOF}^2
, where c
and DOF
are computed exactly following the formulas in Lindsay et al.(2014).
For the U
-statistic the cutoff is determined empirically:
Generate data from the considered normal distribution ;
Compute the test statistics for B
Monte Carlo(MC) replications;
Compute the 95th quantile of the empirical distribution of the test statistic.
k-sample test:
Consider k
random samples of i.i.d. observations
\mathbf{x}^{(i)}_1,
\mathbf{x}^{(i)}_{2},\ldots, \mathbf{x}^{(i)}_{n_i} \sim F_i
,
i = 1, \ldots, k
.
We test if the samples are generated from the same unknown distribution,
that is H_0: F_1 = F_2 = \ldots = F_k
versus
H_1: F_i \not = F_j
, for some 1 \le i \not = j \le k
.
We construct a matrix distance \hat{\mathbf{D}}
, with
off-diagonal elements
\hat{D}_{ij} = \frac{1}{n_i n_j} \sum_{\ell=1}^{n_i}
\sum_{r=1}^{n_j}K_{\bar{F}}(\mathbf{x}^{(i)}_\ell,\mathbf{x}^{(j)}_r),
\qquad \mbox{ for }i \not= j
and in the diagonal
\hat{D}_{ii} = \frac{1}{n_i (n_i -1)} \sum_{\ell=1}^{n_i}
\sum_{r\not= \ell}^{n_i} K_{\bar{F}}(\mathbf{x}^{(i)}_\ell,
\mathbf{x}^{(i)}_r), \qquad \mbox{ for }i = j,
where K_{\bar{F}}
denotes the Normal kernel K_h
centered non-parametrically with respect to
\bar{F} = \frac{n_1 \hat{F}_1 + \ldots + n_k \hat{F}_k}{n},
\quad \mbox{ with } n=\sum_{i=1}^k n_i.
We compute the trace statistic
\mathrm{trace}(\hat{\mathbf{D}}_n) = \sum_{i=1}^{k}\hat{D}_{ii}
and D_n
, derived considering all the possible pairwise comparisons
in the k-sample null hypothesis, given as
D_n = (k-1) \mathrm{trace}(\hat{\mathbf{D}}_n)
- 2 \sum_{i=1}^{k}\sum_{j> i}^{k}\hat{D}_{ij}.
We compute the empirical critical value by employing numerical techniques such as the bootstrap, permutation and subsampling algorithms:
Generate k-tuples, of total size n_B
, from the pooled sample
following one of the sampling methods;
Compute the k-sample test statistic;
Repeat B
times;
Select the 95^{th}
quantile of the obtained values.
Two-sample test:
Let x_1, x_2, ..., x_{n_1} \sim F
and
y_1, y_2, ..., y_{n_2} \sim G
be
random samples from the distributions F
and G
, respectively.
We test the null hypothesis that the two samples are generated from
the same unknown distribution, that is H_0: F=G
vs
H_1:F\not=G
. The test statistics coincide with the k
-sample
test statistics when k=2
.
The arguments mu_hat
and Sigma_hat
indicate the normal model
considered for the normality test, that is H_0: F = N(
mu_hat
,
Sigma_hat
).
For the two-sample and k
-sample tests, mu_hat
and
Sigma_hat
can
be used for the parametric centering of the kernel, in the case we want to
specify the reference distribution, with centeringType = "Param"
.
This is the default method when the test for normality is performed.
The normal kernel centered with respect to
G \sim N_d(\mathbf{\mu}, \mathbf{V})
can be computed as
K_{cen(G)}(\mathbf{s}, \mathbf{t}) =
K_{\mathbf{\Sigma_h}}(\mathbf{s}, \mathbf{t}) -
K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{\mu}, \mathbf{t})
- K_{\mathbf{\Sigma_h} + \mathbf{V}}(\mathbf{s}, \mathbf{\mu}) +
K_{\mathbf{\Sigma_h} + 2\mathbf{V}}(\mathbf{\mu}, \mathbf{\mu}).
We consider the non-parametric centering of the kernel with respect to
\bar{F}=(n_1 F_1 + \ldots n_k F_k)/n
where n=\sum_{i=1}^k n_i
,
with centeringType = "Nonparam"
, for the two- and k
-sample
tests.
Let \mathbf{z}_1,\ldots, \mathbf{z}_n
denote the pooled sample. For any
s,t \in \{\mathbf{z}_1,\ldots, \mathbf{z}_n\}
, it is given by
K_{cen(\bar{F})}(\mathbf{s},\mathbf{t}) = K(\mathbf{s},\mathbf{t}) -
\frac{1}{n}\sum_{i=1}^{n} K(\mathbf{s},\mathbf{z}_i) -
\frac{1}{n}\sum_{i=1}^{n} K(\mathbf{z}_i,\mathbf{t}) +
\frac{1}{n(n-1)}\sum_{i=1}^{n} \sum_{j \not=i}^{n}
K(\mathbf{z}_i,\mathbf{z}_j).
An S4 object of class kb.test
containing the results of the
kernel-based quadratic distance tests, based on the normal kernel. The object
contains the following slots:
method
: Description of the kernel-based quadratic
distance test performed.
x
Data list of samples X (and Y).
Un
The value of the U-statistic.
H0_Un
A logical value indicating whether or not the null
hypothesis is rejected according to Un.
CV_Un
The critical value computed for the test Un.
Vn
The value of the V-statistic (if available).
H0_Vn
A logical value indicating whether or not the null
hypothesis is rejected according to Vn (if available).
CV_Vn
The critical value computed for the test Vn
(if available).
h
List with the value of bandwidth parameter used for the
normal kernel function. If select_h
is used, the matrix of computed
power values and the corresponding power plot are also provided.
B
Number of bootstrap/permutation/subsampling replications.
var_Un
exact variance of the kernel-based U-statistic.
cv_method
The method used to estimate the critical value
(one of "subsampling", "permutation" or "bootstrap").
For the two- and k
-sample tests, the slots Vn
, H0_Vn
and
CV_Vn
are empty, while the computed statistics are both reported in
slots Un
, H0_Un
and CV_Un
.
A U-statistic is a type of statistic that is used to estimate a population parameter. It is based on the idea of averaging over all possible distinct combinations of a fixed size from a sample. A V-statistic considers all possible tuples of a certain size, not just distinct combinations and can be used in contexts where unbiasedness is not required.
Markatou, M. and Saraceno, G. (2024). “A Unified Framework for
Multivariate Two- and k-Sample Kernel-based Quadratic Distance
Goodness-of-Fit Tests.”
https://doi.org/10.48550/arXiv.2407.16374
Lindsay, B.G., Markatou, M. and Ray, S. (2014) "Kernels, Degrees of Freedom, and Power Properties of Quadratic Distance Goodness-of-Fit Tests", Journal of the American Statistical Association, 109:505, 395-410, DOI: 10.1080/01621459.2013.836972
kb.test for the class definition.
# create a kb.test object
x <- matrix(rnorm(100),ncol=2)
y <- matrix(rnorm(100),ncol=2)
# Normality test
my_test <- kb.test(x, h=0.5)
my_test
# Two-sample test
my_test <- kb.test(x,y,h=0.5, method="subsampling",b=0.9,
centeringType = "Nonparam")
my_test
# k-sample test
z <- matrix(rnorm(100,2),ncol=2)
dat <- rbind(x,y,z)
group <- rep(c(1,2,3),each=50)
my_test <- kb.test(x=dat,y=group,h=0.5, method="subsampling",b=0.9)
my_test
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.