MDgof-Methods"

knitr::opts_chunk$set(error=TRUE,
  collapse = TRUE,
  comment = "#>"
)
library(MDgof)

In the following discussion $F(\mathbf{x})$ will denote the cumulative distribution function and $\hat{F}(\mathbf{x})$ the empirical distribution function of a random vector $\mathbf{x}$.

Except for the chi-square tests none of the tests included in the package has a large sample theory that would allow for finding p values, and so for all of them simulation is used.

Continuous data

Tests based on a comparison of the theoretical and the empirical distribution function.

A number of classical tests are based on a test statistic of the form $\psi(F,\hat{F})$, where $\psi$ is some functional measuring the "distance" between two functions. Unfortunately in d dimensions the number of evaluations of $F$ needed generally is of the order of $n^d$, and therefore becomes computationally to expensive even for $d=2$ and for moderately sized data sets. This is especially true because none of these tests has a large-sample theory for the test statistic, and therefore p values need to be found via simulation. Mdgof includes four such test, which are more in the spirit of "inspired by .." than actual implementations of the true tests. They are

Quick Kolmogorov-Smirnov test (qKS)

The Kolmogorov-Smirnov test is one of the best known and most widely used goodness-of-fit tests. It is based on

$$\psi(F,\hat{F})=\max\left{\vert F(\mathbf{x})-\hat{F}(\mathbf{x}\vert:\mathbf{x} \in \mathbf{R^d}\right}$$ In one dimension the maximum always occurs at one of the data points ${x_1,..,x_n}$. In d dimensions however the maximum can occur at any point whose coordinates is any combination of any of the coordinates of the points in the data set, and there are $n^d$ of those.

Instead the test implemented in MDgof finds the maximum again just at the data points:

$$TS=\max\left{\vert F(\mathbf{x}_i)-\hat{F}(\mathbf{x}_i\vert\right}$$ The KS test was first proposed in [@Kolmogorov1933] and [@Smirnov1939]. We use the notation qKS (quick Kolmogorov-Smirnov) to distinguish the test implemented in MDgof from the full test.

Quick Kuiper's test (qK)

This is a variation of the KS test proposed in [@Kuiper1960]:

$$\psi(F,\hat{F})=\max\left{ F(\mathbf{x})-\hat{F}(\mathbf{x}):\mathbf{x} \in \mathbf{R^d}\right}+\max\left{\hat{F}(\mathbf{x})-F(\mathbf{x}):\mathbf{x} \in \mathbf{R^d}\right}$$ $$TS=\max\left{ F(\mathbf{x}_i)-\hat{F}(\mathbf{x}_i)\right}+\max\left{\hat{F}(\mathbf{x}_i)-F(\mathbf{x}_i)\right}$$ Quick Cramer-vonMises test (qCvM)

Another classic test using

$$\psi(F,\hat{F})=\int \left(F(\mathbf{x})-\hat{F}(\mathbf{x})\right)^2 d\mathbf{x}$$

$$TS=\sum_{i=1}^n \left(F(\mathbf{x}_i)-\hat{F}(\mathbf{x}_i)\right)^2$$ This test was first discussed in [@Anderson1962].

Quick Anderson-Darling test (qAD)

The Anderson-Darling test is based on the test statistic

$$\psi(F,\hat{F})=\int \frac{\left(F(\mathbf{x})-\hat{F}(\mathbf{x})\right)^2}{F(\mathbf{x})[1-F(\mathbf{x})]} d\mathbf{x}$$

$$TS=\sum_{i=1}^n \frac{\left(F(\mathbf{x}_i)-\hat{F}(\mathbf{x}_i)\right)^2}{F(\mathbf{x}_i)[1-F(\mathbf{x}_i)]}$$ and was first proposed in [@anderson1952].

Bickel-Breiman Test (BB)

This test uses the density, not the cumulative distribution function.

Let $R_j=\min \left{||\mathbf{x}_i-\mathbf{x}_j||:1\le i\ne j \le n\right}$ be some distance measure in $\mathbf{R}^d$, not necessarily Euclidean distance. Let $f$ be the density function under the null hypothesis and define

$$U_j=\exp\left[ -n\int_{||\mathbf{x}-\mathbf{x}_i||<R_j}f(\mathbf{x})d\mathbf{x}\right]$$ Then it can be shown that under the null hypothesis $U_1,..,U_n$ have a uniform distribution on $[0,1]$, and a goodness-of-fit test for univariate data such as Kolmogorov-Smirnov can be applied. This test was first discussed in [@bickel1983].

Bakshaev-Rudzkis test (BR)

This test proceeds by estimating the density via a kernel density estimator and then comparing it to the density specified in the null hypothesis. Details are discussed in [@bakshaev2015].

Kernel Stein Discrepancy (KSD)

Based on the Kernel Stein distance measure between two probability distributions. For details see [@Liu2016].

Tests based on the Rosenblatt transform.

The Rosenblatt transform is a generalization of the probability integral transform. It transforms a random vector $(X_1,..,X_d)$ into $(U_1,..,U_d)$, where $U_i\sim U[0,1]$ and $U_i$ is independent of $U_j$. It uses

$$ \begin{aligned} &U_1 = F_{X_1}(x_1)\ &U_2 = F_{X_2|X_1}(x_2|x_1)\ &... \ &U_d = F_{X_d|X_1,..,X_{d-1}}(x_d|x_1,..,x_{d-1})\ \end{aligned} $$ and so requires knowledge of the conditional distributions. In our case of a goodness-of-fit test, however, these will generally not be know. One can show, though, that

$$ \begin{aligned} &F_{X_1}(x_1) = F(x_1, \infty)\ &F_{X_2|X_1}(x_2|x_1) = \frac{\frac{d}{dx_1}F(x_1, x_2,\infty,..,\infty)}{\frac{d}{dx_1}F(x_1, \infty,..\infty)}\ &... \ &F_{X_d|X_1,..,X_{d-1}}(x_d|x_1,..,x_{d-1}) = \frac{\frac{d^{d-1}}{dx_1x_2..x_{d-1}}F(x_1,.., x_d)}{\frac{d^{d-1}}{dx_1x_2..x_{d-1}}F(x_1, \infty,..,\infty)}\ \end{aligned} $$ Unfortunately for general cdf $F$, these derivatives will have to be found numerically, and for $d>2$ this would not be feasable because of issues with calculation times and numerical instabilities. For these reasons these methods are only implemented for bivariate data.

MDgof includes two tests based on the Rosenblatt transform:

Fasano-Franceschini test (FF)

This implements a version of the KS test after a Rosenblatt transform. It also it is discussed in [@Fasano1987].

Ripley's K test (Rk)

This test finds the number of observations with a radius r of a given observation for different values of R. After the Rosenblatt transform (if the null hypothesis is true) the data is supposed to be independent uniforms, and so the area of a circle of radius r is $\pi r^2$. The two are the compared via the mean square. This test was proposed in [@ripley1976]. The test is implemented in MDgof using the R library spatstat [@baddeley2005].

Discrete data

Methods for discrete (or histogram) data are implemented only for dimension 2 because for higher dimensions the sample sizes required would be to large. The methods are

Methods based on the empirical distribution fuction.

These are discretized versions of the Kolmogorov-Smirnov test (KS), Kuiper's test (K), Cramer-vonMises test (CvM) and Anderson-Darling test(AD). Note that unlike in the continuous case these tests are implemented using the full theoretical ideas and are not based on short cuts.

Methods based on the density

These are methods that directly compare the observed bin counts $O_{i,j}$ with the theoretical ones $E_{i,j}=nP(X_1=x_i,X_2=y_j)$ under the null hypothesis. They are

Pearson's chi-square

$$TS=\sum_{ij} \frac{(O_{ij}-E_{ij})^2}{E_{ij}}$$ Total Variation

$$TS =\frac1{n^2}\sum_{ij} \left(O_{ij}-E_{ij}\right)^2$$

Kullback-Leibler

$$TS =\frac1{n}\sum_{ij} O_{ij}\log\left(O_{ij}/E_{ij}\right)$$ Hellinger

$$TS =\frac1{n}\sum_{ij} \left(\sqrt{O_{ij}}-\sqrt{E_{ij}}\right)^2$$

References



Try the MDgof package in your browser

Any scripts or data that you put into this service are public.

MDgof documentation built on Feb. 13, 2026, 1:06 a.m.