View source: R/generic_gof_tests.R
gof_test_sim | R Documentation |
Many statistical tests have null hypotheses that assume a distribution is fully specified
(with its parameters known in advance). It is common to estimate parameters from data,
and in this case a general method for adapting the statistical test is to use
Monte Carlo to produce a simulated distribution of the test statistic, and derive the p-value from this distribution.
This approach is used in the LcKS
function of the KScorrect
package.
However, the implementation in LcKS
only supports the KS test and a closed list of distributions,
because it has bespoke code for each supported distribution
for estimating parameters and simulating values using the estimated parameters.
This function generalises the approach in LcKS
as explained in the documentation
of gof_test_sim_uniparam
. It is a higher-level function
that configures gof_test_sim_uniparam
appropriately given the name of the distribution being
tested against. It is still necessary to provide an appropriate estimation function for fn_estimate_params
that is adapted to the distribution being tested including whether or not data is overlapping.
gof_test_sim(
x,
test_type = c("KS", "AD"),
dist = "norm",
noverlap = 1,
fn_estimate_params = estimate_mean_sd_ol,
nreps = 999,
parallelise = FALSE,
ncores = NULL,
bs_ci = NULL,
nreps_bs_ci = 10000
)
x |
The data being tested. |
test_type |
The type of the test. Either a character string (KS and AD are supported)
or a function that implements a different test statistic with the same signature as
|
dist |
The name of a distribution, such that it can be prepended by |
noverlap |
The extent of any overlap in the data. |
fn_estimate_params |
A function that takes the data and the extent of the overlap in the data, and returns a single object holding estimated parameters of the distribution being fitted. The method of estimation should be unbiased. Note that for many distributions, MLE only gives asymptotically unbiased parameters. Users should validate that their estimation functions are unbiased and if necessary adjust the threshold p-value to compensate for this. |
nreps |
The number of repetitions of the simulation to use. |
parallelise |
Flag indicating whether or not to parallelise the calculations. |
ncores |
The number of cores to use when parallelising.
|
bs_ci |
The width of a confidence interval around the p-value,
which will be calculated using a non-parametric bootstrap.
|
nreps_bs_ci |
The number of iterations used in the bootstrapped confidence interval. |
As far as possible this function abstracts from
the lower-level implementation details of gof_test_sim_uniparam
, by taking the name of
a distribution function (e.g. "norm"
) and creating uniparameter versions of the CDF
("p" function, so pnorm
in this example) and simulation ("r" functions, here rnorm
).
Where overlapping data is used (noverlap > 1
) it will simulate from a Gaussian copula
to induce the same autocorrelation structure as overlapping data
(see calc_theo_sercor
), convert this to autocorrelated random uniform values,
and apply a uniparameter version of the inverse CDF ("q" function) for "dist"
.
Notwithstanding this abstraction, it is necessary to supply an appropriate parameter estimation function
for fn_estimate_params
. estimate_mean_sd_ol
is used by default, to match with the
default value of dist = "norm"
.
The framework here can in principle also be used where parameters are known in advance rather than estimated from the data (by making the estimation function return the pre-specified parameters), but there is very little value to this use case, as Monte Carlo is rarely necessary when the parameters are known (and is certainly not necessary for the KS and AD tests).
Optionally, the calculations can be parallelised over multiple cores. This is useful when the number of simulations is large and estimation of parameters is slow, for example using MLE to estimate parameters from a generalised hyperbolic distribution.
Since Monte Carlo simulation is used, the function can optionally estimate the simulation uncertainty arising from a finite number of simulations, using a non-parameteric (resampling with replacement) approach from the distribution of simulated test statistics produced.
A list with five components:
The test statistic.
The p-value for the test statistic, derived by simulation.
The number of NA
values produced in the simulation of the test statistic.
These generally indicate that the parameter estimation failed.
If bs_ci
is not NULL
, the lower end
of the confidence interval around the p-value, calculated using
a non-parametric bootstrap with nreps_bs_ci
repetitions.
Otherwise NA
.
If bs_ci
is not NULL
, the upper end
of the confidence interval around the p-value, calculated using
a non-parametric bootstrap with nreps_bs_ci
repetitions.
Otherwise NA
.
gof_test_sim(rnorm(100))
estimate_unif <- function(x, noverlap = 1) list(min = min(x), max = max(x))
gof_test_sim(runif(100), dist = "unif", fn_estimate_params = estimate_unif)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.