# Fasano-Franceschini Test: an Implementation of a 2-Dimensional Kolmogorov-Smirnov test in R In fasano.franceschini.test: Fasano-Franceschini Test: A 2-D Kolmogorov-Smirnov Two-Sample Test

{=html}

# Abstract {#sec:abstact}

The univariate Kolmogorov-Smirnov (KS) test is a non--parametric
statistical test designed to assess whether two samples come from the
same underlying distribution. The versatility of the KS test has made it
a cornerstone of statistical analysis across the scientific disciplines.
However, the test proposed by Kolmogorov and Smirnov does not naturally
extend to multidimensional distributions. Here, we present the
fasano.franceschini.test package, an **R** implementation of the 2-D
KS two--sample test as defined by Fasano and Franceschini[@Fasano1987].
The fasano.franceschini.test package provides three improvements over
the current 2-D KS test on the Comprehensive **R** Archive Network
(CRAN): (i) the Fasano and Franceschini test has been shown to run in
$O(n^2)$ versus the Peacock implementation which runs in $O(n^3)$; (ii)
the package implements a procedure for handling ties in the data; and
(iii) the package implements a parallelized permutation procedure for
improved significance testing. Ultimately, the
fasano.franceschini.test package presents a robust statistical test
for analyzing random samples defined in 2-dimensions.

# Introduction {#sec:intro}

The Kolmogorov--Smirnov (KS) is a non--parametric, univariate
statistical test designed to assess whether a set of data is consistent
with a given probability distribution (or, in the two-sample case,
whether the two samples come from the same underlying distribution).
First derived by Kolmogorov and Smirnov in a series of papers
[@Kolmogorov1933; @Kolmogorov1933a; @Smirnov1936; @Smirnov1937;
@Smirnov1939; @Smirnov1944; @Smirnov1948], the one-sample KS test
defines the distribution of the quantity $D_{KS}$, the maximal absolute
difference between the empirical cumulative distribution function (CDF)
of a set of values and a reference probability distribution. Kolmogorov
and Smirnov's key insight was proving the distribution of $D_{KS}$ was
independent of the CDFs being tested. Thus, the test can effectively be
used to compare any univariate empirical data distribution to any
continuous univariate reference distribution. The two-sample KS test
could further be used to compare any two univariate empirical data
distributions against each other to determine if they are drawn from the
same underlying univariate distribution.

The nonparametric versatility of the univariate KS test has made it a
cornerstone of statistical analysis and is commonly used across the
scientific disciplines [@Atasoy2017; @Chiang2018; @Hahne2018;
@Hargreaves2020; @Wong2020; @Kaczanowska2021]. However, the KS test as
proposed by Kolmogorov and Smirnov does not naturally extend to
distributions in more than one dimension. Fortunately, a solution to the
dimensionality issue was articulated by Peacock [@Peacock1983] and later
extended by Fasano and Franceschini [@Fasano1987].

Currently, only the Peacock implementation of the 2-D two-sample KS test
is available in **R** [@R] with the Peacock.test package via the
peacock2() function, but this has been shown to be markedly slower
than the Fasano and Franceschini algorithm [@Lopes2007]. A **C**
implementation of the Fasano--Franceschini test is available in
[@numericalRecipes]; however, arguments have been made to the validity
of the implementation of the test not being distribution-free
[@Babu2006]. Furthermore, in the **C** implementation, statistical
testing is based on a fit to Monte Carlo simulation that is only valid
for significance levels $\alpha \lessapprox 0.20$.

Here we present the fasano.franceschini.test package as an **R**
implementation of the 2-D two-sample KS test described by Fasano and
Franceschini [@Fasano1987]. The fasano.franceschini.test package
provides two improvements over the current 2-D KS test available on the
Comprehensive Archive Network (CRAN): (i) the Fasano and Franceschini
test has been shown to run in $O(n^2)$ versus the Peacock implementation
which runs in $O(n^3)$; and (ii) the package implements a permutation
procedure for improved significance testing and mitigates the
limitations of the test brought noted by @Babu2006.

# Models and software {#sec:models}

## 1-D Kolmogorov--Smirnov Test

The Kolmogorov--Smirnov (KS) test is a non--parametric method for
determining whether a sample is consistent with a given probability
distribution [@Stephens1992a]. In one dimension, the Kolmogorov-Smirnov
statistic ($D_{KS}$) is the defined by the maximum absolute difference
between the cumulative density functions of the data and model
(one--sample), or between the two data sets (two--sample), as
illustrated in **Figure [1](#fig:kstest1D){reference-type="ref"
reference="fig:kstest1D"}**.

<center>

![**Figure 1** \| **LEFT:** Probability density function (PDF) of two
normal distributions: orange sample 1,
$\mathcal{N}(\mu = 0,\,\sigma^{2} = 1)$; blue sample 2,
$\mathcal{N}(\mu = 5,\,\sigma^{2} = 1)$. **RIGHT:** Cumulative density
functions (CDF) of the two PDFs; the black dotted line represents the
maximal absolute difference between the CDFs
($D_{KS}$).](pdfvsCDF.png){#fig:kstest1D width="75%"}

</center>

In the large--sample limit ($n \geq 80$), it can be shown [@Kendall1946]
that $D_{KS}$ converges in distribution to

{=tex}
$$D_{KS} \overset{d}{\rightarrow} \Phi(\lambda) = 2 \sum_{k=1}^{\infty} -1^{k-1}e^{-2k^2\lambda^2} \,. (\#eq:1)$$


In the one-sample case with a sample of size $n$, the $p$ value is given by \label{eq:2}

{=tex} $$\mathbb{P}(D > observed) = \Phi ( D\sqrt{n})\,; (#eq:2)$$

in the two-sample case, the $p$ value is given by

{=tex}
$$\mathbb{P}(D > observed) = \Phi \left( D\sqrt{\frac{n_1n_2}{n_1+n_2}} \right)\,. (\#eq:3)$$


where $n_1$ and $n_2$ are the number of observations in the first and second samples respectively.

## Higher dimensional variations: Peacock Test (1983) and Fasano--Franceschini Test (1987)

Extending the above to two or higher dimension is complicated by the fact that CDFs are not well-defined in more than one dimension. In 2-D, there are 4 ways (3 independent) of defining the cumulative distribution, since the direction in which we order the $x$ and $y$ points is arbitrary (Figure 2{reference-type="ref" reference="fig:kstest2Dissue"}); more generally, in $k$-dimensional space there are $2^{k}-1$ independent ways of defining the cumulative distribution function [@Peacock1983].

{#fig:kstest2Dissue width="75%"}

# Summary and discussion {#sec:summary}

The fasano.franceschini.test package is an R implementation of the 2-D two-sample KS test as defined by Fasano and Franceschini [@Fasano1987]. It improves upon existing packages by implementing a fast algorithm and a parallelized permutation procedure for improved statistical testing. Complete package documentation and source code is available via the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/ and the package website at https://nesscoder.github.io/fasano.franceschini.test/.

# Computational details {#computational-details .unnumbered}

The results in this paper were obtained using R 4.0.3 with the fasano.franceschini.test 1.0.0 package. R itself and all package dependencies (methods 4.0.3; parallel 4.0.3) are available from the Comprehensive Archive Network (CRAN) at https://CRAN.R-project.org/.

# Acknowledgments {#acknowledgments .unnumbered}

Research reported in this publication was supported by the NSF-Simons Center for Quantitative Biology at Northwestern University, an NSF-Simons MathBioSys Research Center. This work was supported by a grant from the Simons Foundation/SFARI (597491-RWC) and the National Science Foundation (1764421). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation and Simons Foundation.

E.N.C developed the fasano.franceschini.test package and produced the tutorials/documentation; E.N.C. and R.B. wrote the paper.

# References

## Try the fasano.franceschini.test package in your browser

Any scripts or data that you put into this service are public.

fasano.franceschini.test documentation built on Sept. 5, 2021, 6:02 p.m.