# disc_ks_test: Computes the p-value for a one-sample two-sided... In KSgeneral: Computing P-Values of the K-S Test for (Dis)Continuous Null Distribution

## Description

Computes the p-value P(D_{n} ≥ d_{n}), where d_{n} is the value of the KS test statistic computed based on a data sample \{x_{1}, ..., x_{n}\}, when F(x) is purely discrete, using the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)).

## Usage

 `1` ```disc_ks_test(x, y, ..., exact = NULL, tol = 1e-08, sim.size = 1e+06, num.sim = 10) ```

## Arguments

 `x` a numeric vector of data sample values \{x_{1}, ..., x_{n}\}. `y` a pre-specified discrete cdf, F(x), under the null hypothesis. Note that `y` should be a step function within the class: `stepfun`, of which `ecdf` is a subclass! `...` values of the parameters of the cdf, F(x), specified (as a character string) by `y`. `exact` logical variable specifying whether one wants to compute exact p-value P(D_{n} ≥ d_{n}) using the Exact-KS-FFT method, in which case `exact = TRUE` or wants to compute an approximate p-value P(D_{n} ≥ d_{n}) using the simulation-based algorithm of Wood and Altavela (1978), in which case `exact = FALSE`. When `exact = NULL` and `n <= 100000`, the exact P(D_{n} ≥ d_{n}) will be computed using the Exact-KS-FFT method. Otherwise, the asymptotic complementary cdf is computed based on Wood and Altavela (1978). By default, `exact = NULL`. `tol` the value of ε that is used to compute the values of A_{i} and B_{i}, i = 1, ..., n, as detailed in Step 1 of Section 2.1 in Dimitrova, Kaishev and Tan (2020) (see also (ii) in the Procedure Exact-KS-FFT therein). By default, `tol = 1e-08`. Note that a value of `NA` or `0` will lead to an error! `sim.size` the required number of simulated trajectories in order to produce one Monte Carlo estimate (one MC run) of the asymptotic p-value using the algorithm of Wood and Altavela (1978). By default, `sim.size = 1e+06`. `num.sim` the number of MC runs, each producing one estimate (based on `sim.size` number of trajectories), which are then averaged in order to produce the final estimate for the asymptotic p-value. This is done in order to reduce the variance of the final estimate. By default, `num.sim = 10`.

## Details

Given a random sample \{X_{1}, ..., X_{n}\} of size `n` with an empirical cdf F_{n}(x), the two-sided Kolmogorov-Smirnov goodness-of-fit statistic is defined as D_{n} = \sup | F_{n}(x) - F(x) | , where F(x) is the cdf of a prespecified theoretical distribution under the null hypothesis H_{0}, that \{X_{1}, ..., X_{n}\} comes from F(x).

The function `disc_ks_test` implements the Exact-KS-FFT method expressing the p-value as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using FFT (see Dimitrova, Kaishev, Tan (2020)). It represents an accurate and fast (run time O(n^{2}log(n))) alternative to the function `ks.test` from the package dgof, which computes a p-value P(D_{n} ≥ d_{n}), where d_{n} is the value of the KS test statistic computed based on a user provided data sample \{x_{1}, ..., x_{n}\}, assuming F(x) is purely discrete.

In the function `ks.test`, the p-value for a one-sample two-sided KS test is calculated by combining the approaches of Gleser (1985) and Niederhausen (1981). However, the function `ks.test` due to Arnold and Emerson (2011) only provides exact p-values for `n` 30, since as noted by the authors, when `n` is large, numerical instabilities may occur. In the latter case, `ks.test` uses simulation to approximate p-values, which may be rather slow and inaccurate (see Table 6 of Dimitrova, Kaishev, Tan (2020)).

Thus, making use of the Exact-KS-FFT method, the function `disc_ks_test` provides an exact and highly computationally efficient (alternative) way of computing the p-value P(D_{n} ≥ d_{n}), when F(x) is purely discrete.

Lastly, incorporated into the function `disc_ks_test` is the MC simulation-based method of Wood and Altavela (1978) for estimating the asymptotic p-value of D_{n}. The latter method is the default method behind `disc_ks_test` when the sample size `n` is `n` 100000.

## Value

A list with class "htest" containing the following components:

 `statistic ` the value of the statistic. `p.value ` the p-value of the test. `alternative ` "two-sided". `data.name ` a character string giving the name of the data.

## References

Arnold T.A., Emerson J.W. (2011). "Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions". The R Journal, 3(2), 34-39.

Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.

Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.

Niederhausen H. (1981). "Sheffer Polynomials for Computing Exact Kolmogorov-Smirnov and Renyi Type Distributions". The Annals of Statistics, 58-64.

Wood C.L., Altavela M.M. (1978). "Large-Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions". Biometrika, 65(1), 235-239.

`ks.test`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25``` ```# Comparison of results obtained from dgof::ks.test # and KSgeneral::disc_ks_test, when F(x) follows the discrete # Uniform[1, 10] distribution as in Example 3.5 of # Dimitrova, Kaishev, Tan (2020) # When the sample size is larger than 100, the # function dgof::ks.test will be numerically # unstable x3 <- sample(1:10, 25, replace = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE) dgof::ks.test(x3, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x3, ecdf(1:10), exact = TRUE)\$p - dgof::ks.test(x3, ecdf(1:10), exact = TRUE)\$p x4 <- sample(1:10, 500, replace = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE) dgof::ks.test(x4, ecdf(1:10), exact = TRUE) KSgeneral::disc_ks_test(x4, ecdf(1:10), exact = TRUE)\$p - dgof::ks.test(x4, ecdf(1:10), exact = TRUE)\$p # Using stepfun() to specify the same discrete distribution as defined by ecdf(): steps <- stepfun(1:10, cumsum(c(0, rep(0.1, 10)))) KSgeneral::disc_ks_test(x3, steps, exact = TRUE) ```