gof_test_sim_uniparam: Perform a goodness of fit test using simulation with... In PaulMTeggin/practechniques: Functions for the Practical Actuarial Techniques online book

 gof_test_sim_uniparam R Documentation

Perform a goodness of fit test using simulation with uniparameter plug-in functions

Description

Many statistical tests have null hypotheses that assume a distribution is fully specified (with its parameters known in advance). It is common to estimate parameters from data, and in this case a general method for adapting the statistical test is to use Monte Carlo to produce a simulated distribution of the test statistic, and derive the p-value from this distribution. This approach is used in the `LcKS` function of the `KScorrect` package. However, the implementation in `LcKS` only supports the KS test and a closed list of distributions, because it has bespoke code for each supported distribution for estimating parameters and simulating values using the estimated parameters. This function generalises the approach in `LcKS` by adopting the underlying `LcKS` algorithm and allowing general estimation, test statistic and simulation functions to be plugged into that algorithm.

Usage

``````gof_test_sim_uniparam(
x,
fn_estimate_params,
fn_calc_test_stat,
fn_simulate,
noverlap = 1,
nreps = 999,
parallelise = FALSE,
ncores = NULL,
bs_ci = NULL,
nreps_bs_ci = 10000
)
``````

Arguments

 `x` The data being tested. `fn_estimate_params` A function that takes the data and the extent of the overlap in the data, and returns a single object holding estimated parameters of the distribution being fitted. The method of estimation should be unbiased. Note that for many distributions, MLE only gives asymptotically unbiased parameters. Users should validate that their estimation functions are unbiased and if necessary adjust the threshold p-value to compensate for this. `fn_calc_test_stat` A function that takes the data and the estimated parameters object, and calculates the test statistic for the distribution being tested. `fn_simulate` A function takes the number of values to simulate, the estimated parameters object, and the extent of any overlap in the data, and returns that number of simulated values from the distribution being tested against. `noverlap` The extent of any overlap in the data. `1` means no overlap, and `fn_simulate` should operate by ordinary simulation. If `noverlap > 1` then autocorrelation must be induced in the simulations that is consistent with the degree of overlap, to give unbiased test results. This is done automatically when this function is called via `gof_test_sim`. `fn_estimate_params` must also allow for the degree of overlap. `nreps` The number of repetitions of the simulation to use. `parallelise` Flag indicating whether or not to parallelise the calculations. `ncores` The number of cores to use when parallelising. `NULL` means one fewer than the number of cores on the machine. `bs_ci` The width of a confidence interval around the p-value, which will be calculated using a non-parametric bootstrap. `NULL` means no confidence interval will be produced. `nreps_bs_ci` The number of iterations used in the bootstrapped confidence interval.

Details

This function uses the same general approach as `LcKS`, which is to:

• Estimate parameters from the input data `x`

• Calculate a test statistic for `x` against the specified distribution function with these parameters

• Use Monte Carlo simulation to produce a simulated distribution of potential alternative values for the test statistic.

• Derive a p-value by comparing the test statistic of `x` against the simulated distribution. The p-value is calculated as the proportion of Monte Carlo samples with test statistics at least as extreme as the test statistic of `x`. A value of 1 is added to both the numerator and denominator for the same reasons as `KScorrect`, which among other reasons has the benefit of avoiding estimated p-values that are precisely zero.

However this function is more generic:

• General distributions are supported, rather than the closed list used by `KScorrect`.

• Multiple statistical tests are supported, not just KS.

• Testing can be performed against distributions fitted to overlapping data, not just IID data, using the idea of a Gaussian copula to induce autocorrelation consistent with overlapping data suggested in section 4.2 of the 2019 paper by the Extreme Events Working Party of the UK Institute and Faculty of Actuaries.

The genericity is achieved by requiring all statistical functions involved to be 'uniparameter', i.e. to have all their parameters put into a single object. This entails wrapping (say) `pnorm` so the wrapper function takes a list containing the `mean` and `sd` parameters, and passes them on.

By making all functions take their parameters as single objects, the algorithm used in the `KScorrect` package can be abstracted from the functions for estimating parameters (`fn_estimate_params`), calculating test statistics (`fn_calc_test_stat`), and simulating values using those estimated parameters (`fn_simulate`), These functions are 'plugged in' to the algorithm and called at the appropriate points. They must be mutually compatible with each other.

For simplicity and to ensure compatibility, the function `gof_test_sim` sets up the plug-in functions automatically, based on the un-prefixed name of the distribution (e.g. `"norm"`). This has a slight performance hit as it uses `do.call`, but this can be avoided if performance is key, by hand-writing the wrapper function.

Similarly, adapting to overlapping data requires the simulation to be done in a way that induces the autocorrelation consistent with overlapping data. This function can perform testing on overlapping data by suitable choice of the plug-in function `fn_simulate`. In this case the estimation function `fn_estimate_params` should also allow for bias in parameter estimation induced by the overlap. There is no need to adapt the test statistic function `fn_calc_test_stat` to overlapping data.

For some distributions the estimation of parameters may occasionally fail within the simulation. In this case the test statistic is set to `NA` and disregarded when calculating p-values. Warnings produced in parameter estimation are suppressed as (e.g. when using `MASS::fitdistr`) these often arise from estimating the uncertainty around the estimated parameters, which is not used here.

The framework here can in principle also be used where parameters are known in advance rather than estimated from the data (by making the estimation function return the pre-specified parameters), but there is limited value to this use case, as Monte Carlo is rarely necessary when the parameters are known (and is certainly not necessary for the KS and AD tests). It can be an useful approach for hybrid cases such as the 3-parameter Student's t distribution where the number of degrees of freedom is pre-specified but the location and scale parameters are not.

Optionally, the calculations can be parallelised over multiple cores using the `doParallel` package. This is useful when the number of simulations is large and estimation of parameters is slow, for example using MLE to estimate parameters from a generalised hyperbolic distribution.

Since Monte Carlo simulation is used, the function can optionally estimate the simulation uncertainty arising from a finite number of simulations, using a non-parameteric (resampling with replacement) approach from the distribution of simulated test statistics produced.

Value

A list with five components:

ts

The test statistic.

p_value

The p-value for the test statistic, derived by simulation.

count_NA

The number of `NA` values produced in the simulation of the test statistic. These generally indicate that the parameter estimation failed. These values are disregarded when calculating p-values.

p_value_lower

If `bs_ci` is not `NULL`, the lower end of the confidence interval around the p-value, calculated using a non-parametric bootstrap with `nreps_bs_ci` repetitions. Otherwise `NA`.

p_value_upper

If `bs_ci` is not `NULL`, the upper end of the confidence interval around the p-value, calculated using a non-parametric bootstrap with `nreps_bs_ci` repetitions. Otherwise `NA`.

Examples

``````fn_estimate_params <- function(x, noverlap = 1) list(mean = mean(x), sd = sd(x))
fn_p <- function(x, params) pnorm(x, params\$mean, params\$sd)
fn_test_statistic <- function(x, est_params) calc_ks_test_stat(x, est_params, fn_p)
fn_simulate <- function(N, est_params) rnorm(N, est_params\$mean, est_params\$sd)
gof_test_sim_uniparam(rnorm(100), fn_estimate_params, fn_test_statistic, fn_simulate)
``````

PaulMTeggin/practechniques documentation built on Aug. 19, 2023, 4:44 p.m.