gof_test_sim: Perform a goodness of fit test using simulation

View source: R/generic_gof_tests.R

gof_test_simR Documentation

Perform a goodness of fit test using simulation

Description

Many statistical tests have null hypotheses that assume a distribution is fully specified (with its parameters known in advance). It is common to estimate parameters from data, and in this case a general method for adapting the statistical test is to use Monte Carlo to produce a simulated distribution of the test statistic, and derive the p-value from this distribution. This approach is used in the LcKS function of the KScorrect package. However, the implementation in LcKS only supports the KS test and a closed list of distributions, because it has bespoke code for each supported distribution for estimating parameters and simulating values using the estimated parameters. This function generalises the approach in LcKS as explained in the documentation of gof_test_sim_uniparam. It is a higher-level function that configures gof_test_sim_uniparam appropriately given the name of the distribution being tested against. It is still necessary to provide an appropriate estimation function for fn_estimate_params that is adapted to the distribution being tested including whether or not data is overlapping.

Usage

gof_test_sim(
  x,
  test_type = c("KS", "AD"),
  dist = "norm",
  noverlap = 1,
  fn_estimate_params = estimate_mean_sd_ol,
  nreps = 999,
  parallelise = FALSE,
  ncores = NULL,
  bs_ci = NULL,
  nreps_bs_ci = 10000
)

Arguments

x

The data being tested.

test_type

The type of the test. Either a character string (KS and AD are supported) or a function that implements a different test statistic with the same signature as calc_ks_test_stat or calc_ad_test_stat.

dist

The name of a distribution, such that it can be prepended by "p" to get a probability function, and by "r" to get a random simulation function. For example "norm" or "unif".

noverlap

The extent of any overlap in the data. 1 means no overlap, and fn_simulate should operate by ordinary simulation. If noverlap > 1 then autocorrelation must be induced in the simulations that is consistent with the degree of overlap, to give unbiased test results. This is done automatically when this function is called via gof_test_sim. fn_estimate_params must also allow for the degree of overlap.

fn_estimate_params

A function that takes the data and the extent of the overlap in the data, and returns a single object holding estimated parameters of the distribution being fitted. The method of estimation should be unbiased. Note that for many distributions, MLE only gives asymptotically unbiased parameters. Users should validate that their estimation functions are unbiased and if necessary adjust the threshold p-value to compensate for this.

nreps

The number of repetitions of the simulation to use.

parallelise

Flag indicating whether or not to parallelise the calculations.

ncores

The number of cores to use when parallelising. NULL means one fewer than the number of cores on the machine.

bs_ci

The width of a confidence interval around the p-value, which will be calculated using a non-parametric bootstrap. NULL means no confidence interval will be produced.

nreps_bs_ci

The number of iterations used in the bootstrapped confidence interval.

Details

As far as possible this function abstracts from the lower-level implementation details of gof_test_sim_uniparam, by taking the name of a distribution function (e.g. "norm") and creating uniparameter versions of the CDF ("p" function, so pnorm in this example) and simulation ("r" functions, here rnorm). Where overlapping data is used (noverlap > 1) it will simulate from a Gaussian copula to induce the same autocorrelation structure as overlapping data (see calc_theo_sercor), convert this to autocorrelated random uniform values, and apply a uniparameter version of the inverse CDF ("q" function) for "dist".

Notwithstanding this abstraction, it is necessary to supply an appropriate parameter estimation function for fn_estimate_params. estimate_mean_sd_ol is used by default, to match with the default value of dist = "norm".

The framework here can in principle also be used where parameters are known in advance rather than estimated from the data (by making the estimation function return the pre-specified parameters), but there is very little value to this use case, as Monte Carlo is rarely necessary when the parameters are known (and is certainly not necessary for the KS and AD tests).

Optionally, the calculations can be parallelised over multiple cores. This is useful when the number of simulations is large and estimation of parameters is slow, for example using MLE to estimate parameters from a generalised hyperbolic distribution.

Since Monte Carlo simulation is used, the function can optionally estimate the simulation uncertainty arising from a finite number of simulations, using a non-parameteric (resampling with replacement) approach from the distribution of simulated test statistics produced.

Value

A list with five components:

ts

The test statistic.

p_value

The p-value for the test statistic, derived by simulation.

count_NA

The number of NA values produced in the simulation of the test statistic. These generally indicate that the parameter estimation failed.

p_value_lower

If bs_ci is not NULL, the lower end of the confidence interval around the p-value, calculated using a non-parametric bootstrap with nreps_bs_ci repetitions. Otherwise NA.

p_value_upper

If bs_ci is not NULL, the upper end of the confidence interval around the p-value, calculated using a non-parametric bootstrap with nreps_bs_ci repetitions. Otherwise NA.

Examples

gof_test_sim(rnorm(100))
estimate_unif <- function(x, noverlap = 1) list(min = min(x), max = max(x))
gof_test_sim(runif(100), dist = "unif", fn_estimate_params = estimate_unif)

PaulMTeggin/practechniques documentation built on Aug. 19, 2023, 4:44 p.m.