knitr::opts_chunk$set(error=TRUE, collapse = TRUE, comment = "#>" )
library(MDgof) B=200 st=function() knitr::knit_exit() examples=MDgof::hybrid.mdgof.vignette
Note the examples are not actually run but their results are stored in the included data set hybrid.mdgof.vignette. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.
ReRunExamples=FALSE
In a standard goodness-of-fit problem we have a data set $x$ and a probability model, given in form of the cumulative distribution function $F$, and want to test $H_0: X\sim F$, that is the data set was generated by $F$.
There are however circumstances where evaluation of the function $F(x)$ is difficult or even impossible, especially for high-dimensional data. It is however possible to generate data from $F$. Then a test can be run as follows:
In what follows we will call this a hybrid test.
One additional "feature" of this approach is that often we are free to choose the sample size of $y$. Generally on would expect the power of the tests to increase.
The companion R package MD2sample has a large number of methods already implemented and is used in routines that follow.
We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis $H_0:X\sim N(\mathbf{0}, \mathbf{I_2})$:
set.seed(123)
rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) x=rnull()
examples[["ex1a"]]=hybrid_test(x, rnull, B=B)
examples[["ex1a"]]
$B$ is the number of simulation runs to estimated the p-value, $B=5000$ is the default but if the data set is large a smaller value should suffice.
Arguments of hybrid_test for continuous data
In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data.
rnull=function(p) mvtnorm::rmvnorm(200, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) phat=function(x) c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2]))) x=rnull(c(1, 2, 4, 9,0.6)) phat(x)
examples[["ex2a"]]=hybrid_test(x, rnull, phat, B=B)
examples[["ex2a"]]
In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom. We also use a larger sample size.
y=mvtnorm::rmvt(300, sigma=diag(2), df=5)
examples[["ex2b"]]=hybrid_test(y, rnull, phat, B=B)
examples[["ex2b"]]
hybrid_test allows the user to supply their own (two-sample) test method, either one that finds its own p-value or a method that requires simulation.
The routine chiTS.cont (included in the MD2sample package) finds either the test statistic or the p value of a chi square test in two dimensions. First we need to set
TSextra=list(which="statistic", nbins=cbind(c(3,3), c(3,4)))
this list is passed to chiTS.cont and tells the routine to
In this example we use the same setup as in example 2.
examples[["ex3a"]]=hybrid_test(x, rnull, phat, TS=MD2sample::chiTS.cont, TSextra=TSextra, B=B)
examples[["ex3a"]]
examples[["ex3b"]]=hybrid_test(y, rnull, phat, TS=MD2sample::chiTS.cont, TSextra=TSextra, B=B)
examples[["ex3b"]]
Arguments and output of new test routine for continuous data
see the vignette MD2sample.
The routine hybrid_power allows the user to estimate the power of the tests via two-sample methods.
Let's say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector $(0,0)$, variances equal to 1 and a covariance $\rho$. We first need a function that generates these data sets:
rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0), sigma=matrix(c(1, s, s, 1), 2, 2))
Now we can run
examples[["ex4a"]]=hybrid_power(rnull, ralt, c(0, 0.8), B=B, maxProcessor=1)
examples[["ex4a"]]
Arguments of hybrid_power
Again the user can provide their own routine:
TSextra=list(which="pvalue", nbins=c(3,4))
examples[["ex4b"]]=hybrid_power(rnull, ralt, c(0, 0.8), TS=MD2sample::chiTS.cont, TSextra=TSextra, B=500, With.p.value=TRUE)
examples[["ex4b"]]
We consider the case of data from binomial distributions:
rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=rnull()
examples[["ex5"]]=hybrid_test(x, rnull, B=B)
examples[["ex5"]]
The MDgof routine sq2rec above changes the format of the data from that output by the table command to what is required by hybrid_test.
The arguments of hybrid_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if x is a matrix with three columns and the the values in the third column are integers.
The routine chiTS.disc (included in the MD2sample package) does a chi-square test for discrete data:
TSextra=list(which="statistic")
examples[["ex6c"]]=hybrid_test(x, rnull, TS=MD2sample::chiTS.disc, TSextra=TSextra, B=B)
examples[["ex6c"]]
Power estimation for discrete data is done the same way as for continuous data:
We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.
rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } ralt=function(p) { x=rbinom(1000, 5, p) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=ralt(0.475)
examples[["ex7b"]]= hybrid_power(rnull, ralt, c(0.5, 0.475), B=200)
examples[["ex7b"]]
or using a user-supplied test that find a p value:
TSextra=list(which="pvalue")
examples[["ex7d"]]=hybrid_power(rnull, ralt, c(0.5, 0.475), TS=MD2sample::chiTS.disc, TSextra=TSextra, With.p.value = TRUE, B=B)
examples[["ex7d"]]
saveRDS(examples, "hybrid.mdgof.vignette.rds")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.