knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The package discretefit
implements fast Monte Carlo simulations for goodness-of-fit (GOF) tests for discrete distributions. This includes tests based on the root-mean-square statistic, the Chi-squared statistic, the log-likelihood-ratio ($G^2$) statistic, the Freeman-Tukey (Hellinger-distance) statistic, the Kolmogorov-Smirnov statistic, and the Cramer-von Mises statistic.
Simulations are written in C++ (utilizing Rcpp
) and are considerably faster than the simulated Chi-squared GOF test in the R stats
package.
You can install discretefit
from CRAN.
install.packages("discretefit")
You can also install the development version from GitHub:
devtools::install_github("josh-mc/discretefit")
The GOF tests in discretefit
function on a vector of counts, x, and a vector of probabilities, p. In the below example, x represents a vector of counts for five categories, and p represents a vector of probabilities for each corresponding category. The GOF tests provides p-values for the null hypothesis that x is a random sample of the discrete distribution defined by p.
library(discretefit) library(bench) x <- c(42, 0, 13, 2, 109) p <- c(0.2, 0.05, 0.1, 0.05, 0.6) chisq_gof(x, p) rms_gof(x, p) g_gof(x, p) ft_gof(x, p) ks_gof(x, p) cvm_gof(x, p)
The simulated Chi-squared GOF test in discretefit
produces identical p-values to the simulated Chi-squared GOF test in the stats
package that is part of base R.
set.seed(499) chisq_gof(x, p, reps = 2000)$p.value set.seed(499) chisq.test(x, p = p, simulate.p.value = TRUE)$p.value
However, because Monte Carlo simulations in discretefit
are implemented in C++, chisq_gof
is much faster than chisq.test
, ~20 times faster for the following case.
speed <- bench::mark(min_iterations = 1000, check = FALSE, chisq_gof = chisq_gof(x, p, reps = 2000), chisq.test = chisq.test(x, p = p, simulate.p.value = TRUE, B = 2000) ) ggplot2::autoplot(speed) + ggplot2::theme_minimal() + ggplot2::xlab(NULL) + ggplot2::ylab(NULL)
Additionally, the simulated GOF tests in base R is vectorized, so for large vectors attempting a large number of simulations may not be possible because of memory constraints. Since the functions in discretefit
are not vectorized, memory use is minimized.
In a surprising number of cases, a simulated GOF test based on the root-mean-square statistic outperforms the Chi-squared test and other tests in the Cressie-Read power divergence family. This has been demonstrated by Perkins, Tygert, and Ward (2011). They provide the following toy example.
Take a discrete distribution with 50 bins (or categories). The probability for the first and second bin is 0.25. The probability for each of the remaining 48 bins is 0.5 / 48 (~0.0104).
Now take the observed counts of 15 for the first bin, 5 for the second bin, and zero for each of the remaining 48 bins. It's obvious that these observations are very unlikely to occur for a random sample from the above distribution. However, the Chi-squared, $G^2$, and Freeman-Tukey tests fail to reject the null hypothesis.
x <- c(15, 5, rep(0, 48)) p <- c(0.25, 0.25, rep(1/(2 * 50 -4), 48)) chisq_gof(x, p) g_gof(x, p) ft_gof(x, p)
By contrast, the root-mean-square test convincingly rejects the null hypothesis.
rms_gof(x, p)
For additional examples, see Perkins, Tygert, and Ward (2011) and Ward and Carroll (2014).
Several other packages implement GOF tests for discrete distributions.
As noted above, the stats
package in base R implements a simulated Chi-squared GOF test.
I'm not aware of an R package that implements a simulated $G^2$ GOF test but the packages RVAideMemoire
and DescTools
implement GOF tests that utilize approximations based on the Chi-squared distribution.
The dgof
package (Anderson and Emerson 2011) implements simulated Kolmogorov-Smirnov GOF tests and simulated Cramer-von Mises GOF tests . The cvmdisc
package also implements a simulated Cramer-von Mises GOF test.
The KSgeneral
package (Dimitrova, Kaishev, and Tan, 2020) implements exact Kolmogorov-Smirnov GOF tests and fast, simulated GOF tests utilizing the algorithm introduced by Wood and Altavela (1978).
I'm not aware of another R package that implements a root-mean-square GOF test.
Arnold, Taylor B. and John W. Emerson. “Nonparametric goodness-of-fit tests for discrete null distributions.” R Journal. https://doi.org/10.32614/rj-2011-016
Dimitrova, Dimitrina S., Vladimir K. Kaishev, and Senren Tan. "Computing the Kolmogorov-Smirnov distribution when the underlying CDF is purely discrete, mixed, or continuous." Journal of Statistical Software, 2020. https://doi.org/10.18637/jss.v095.i10
Eddelbuettel, Dirk and Romain Francois. "Rcpp: Seamless R and C++ Integration." Journal of Statistical Software, 2011. https://www.jstatsoft.org/article/view/v040i08
Perkins, William, Mark Tygert, and Rachel Ward. "Computing the confidence levels for a root-mean-square test of goodness-of-fit." Applied Mathematics and Computation, 2011. https://doi.org/10.1016/j.amc.2011.03.124
Ward, Rachel and Raymond J. Carroll. "Testing Hardy–Weinberg equilibrium with a simple root-mean-square statistic." Biostatistics, 2014. https://doi.org/10.1093/biostatistics/kxt028
Wood, Constance L., and Michele M. Altavela. “Large-Sample Results for Kolmogorov–Smirnov Statistics for Discrete Distributions.” Biometrika, 1978. https://doi.org/10.1093/biomet/65.1.235
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.