knitr::opts_chunk$set(error=TRUE, collapse = TRUE, comment = "#>" ) options(tinytex.clean = FALSE)
library(MDgof) B=200 st=function() knitr::knit_exit() examples=MDgof::examples.mdgof.vignette
Note the examples are not actually run but their results are stored in the included data set examples.mdgof.vignette. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.
ReRunExamples=FALSE
This package brings together a number of routines for the goodness-of-fit problem for multivariate data. There is a data set $\mathbf{x}$ and a cumulative distribution function $F$ (possibly depending on parameters that have to be estimated from x), and we want to test $H_0:\mathbf{X}\sim F$.
The highlights of this package are:
For descriptions of the included method see the vignette MDgof-Methods.
We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis $H_0:X\sim N(\mathbf{0}, \mathbf{I_2})$:
set.seed(123)
# cumulative distribution function under the null hypothesis pnull=function(x) { f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } # function that generates data under the null hypothesis rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) x=rnull()
examples[["ex1a"]]=gof_test(x, pnull, rnull, B=B)
examples[["ex1a"]]
$B$ is the number of simulation runs to estimated the p-value, $B=5000$ is the default but if the data set is large a smaller value should suffice.
If the density of the distribution under the null hypothesis is also known it can be included and two more tests can be run:
dnull=function(x) { f=function(x) mvtnorm::dmvnorm(x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) }
examples[["ex1b"]]=gof_test(x, pnull, rnull, dnull=dnull, B=B)
examples[["ex1b"]]
Arguments of gof_test for continuous data
In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data.
pnull=function(x, p) { f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function(p) mvtnorm::rmvnorm(200, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) dnull=function(x, p) { f=function(x) mvtnorm::dmvnorm(x, mean=p[1:2], sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } phat=function(x) c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2]))) x=rnull(c(1, 2, 4, 9,0.6)) phat(x)
examples[["ex2a"]]=gof_test(x, pnull, rnull, phat, dnull=dnull, B=B)
examples[["ex2a"]]
In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom:
y=mvtnorm::rmvt(200, sigma=diag(2), df=5)
examples[["ex2b"]]=gof_test(y, pnull, rnull, phat, dnull=dnull, B=B)
examples[["ex2b"]]
gof_test allows the user to supply their own test method, either one that finds its own p-value or a method that requires simulation.
As an example the routine newTS (included in the package) finds either the test statistic or the p value of a chi square test in two dimensions for either continuous or discrete data.
In this example we use the same setup as in example 2. newTS requires the following object TSextra (a list):
TSextra=list(Continuous=TRUE, WithEstimation=TRUE, Withpvalue=TRUE)
examples[["ex3a"]]=gof_test(x, pnull, rnull, phat, TS=newTS, TSextra=TSextra, B=B)
examples[["ex3a"]]
examples[["ex3b"]]=gof_test(y, pnull, rnull, phat, TS=newTS, TSextra=TSextra, B=B)
examples[["ex3b"]]
Arguments and output of new test routine for continuous data
The arguments have to be:
The output vector of the routine has to be a named vector. If the routine is written in Rcpp parallel programming is not available.
If several tests are run one can use the routine gof_test_adjusted_pvalue to find a p value that is adjusted for simultaneous testing:
examples[["ex3c"]]=gof_test_adjusted_pvalue(x, pnull, rnull, dnull=dnull, phat=phat, B=c(B,B))
a=examples[["ex3c"]] for(i in seq_along(a)) print(paste(names(a)[i],": ", round(a[i],4)))
The BR test had the smallest p-value ($0.0193$), which is adjusted to $0.1817$ to account for the fact that $10$ tests where run simultaneously.
The routine gof_power allows the user to estimate the power of the tests.
Let's say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector $(0,0)$, variances equal to 1 and a covariance $\rho$. We first need a function that generates these data sets:
pnull=function(x) { f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function() mvtnorm::rmvnorm(100, c(0, 0)) dnull=function(x) { f=function(x) mvtnorm::dmvnorm(x) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0), sigma=matrix(c(1, s, s, 1), 2, 2))
Now we can run
examples[["ex4a"]]=gof_power(pnull, rnull, ralt, c(0, 0.8), dnull=dnull, B=B, maxProcessor=1)
examples[["ex4a"]]
Arguments of gof_power
Again the user can provide their own routine:
TSextra=list(Continuous=TRUE, WithEstimation=FALSE, Withpvalue=TRUE)
examples[["ex4b"]]=gof_power(pnull, rnull, ralt, c(0, 0.8), TS=newTS, TSextra=TSextra, B=500, With.p.value=TRUE)
examples[["ex4b"]]
First note that tests for discrete data are implemented only in two dimensions. The data set is a matrix with three columns. Each row has a specific value for the x variable, the y variable and the corresponding counts.
We consider the case of data from binomial distributions:
pnull=function(x) { f=function(x) sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=rnull() x
examples[["ex5"]]=gof_test(x, pnull, rnull, B=B)
examples[["ex5"]]
The MDgof routine sq2rec above changes the format of the data from that output by the table command to what is required by gof_test.
The arguments of gof_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if x is a matrix with three columns and the the values in the third column are integers.
One use of the discrete data methods is if the data is actually continuous but the data set is very large, so that the continuous methods take to long to run. In that case one can discretize the data and run a discrete method. In this example we have one million observations from a bivariate random variable and we want to test whether they are from independent dependent Beta distributions, with a parameter to be estimated from the data. The null hypothesis specifies the model $X\sim Beta(a, a)$, $Y|X=x\sim Beta(x, x)$, with the parameter $a$. Here is an example of data drawn from this model, using $a=2$:
rnull_cont=function(p) { x=rbeta(250, p, p) y=rbeta(250, x, x) cbind(x=x, y=y) } dta_cont=rnull_cont(2) ggplot2::ggplot(data=as.data.frame(dta_cont), ggplot2::aes(x,y)) + ggplot2::geom_point()
Here we have the joint density
$$ \begin{aligned} &f(x, y) = f_X(x)f_{Y|X=x}(y|x) = \ &\frac{\Gamma(2a)}{\Gamma(a)^2}[x(1-x)]^a \frac{\Gamma(2x)}{\Gamma(x)^2}[y(1-y)]^x \end{aligned} $$
First we need an estimator of $a$. We can find the maximum likelihood estimator using the Newton-Raphson algorithm:
$$ \begin{aligned} &l(a) = n\left[\log \Gamma(2a)-2\log \Gamma(a)+\frac{a}{n}\sum\log [x_i(1-x_i)]\right]\ &\frac{dl}{da}=2n\left[di(2a)-di(a)+\frac{1}{2n}\sum\log [x_i(1-x_i)]\right]\ &\frac{d^2l}{da^2}=2n\left[2 tri(2a)-tri(a)\right]\ \end{aligned} $$ where $di$ and $tri$ are the first and second derivative of the log gamma function. Here is an implementation
phat_cont=function(x) { n=nrow(x) sx=sum( log(x[,1]*(1-x[,1])) )/2/n num=function(a) digamma(2*a)-digamma(a)+sx denom=function(a) 2*trigamma(2*a)-trigamma(a) anew=2 repeat { aold=anew anew=aold-num(aold)/denom(aold) if(abs(aold-anew)<0.001) break } anew } phat_cont(dta_cont)
For the function $pnull$ we make use of the following general calculation:
$$ \begin{aligned} &F(x,y) = P(X\le x;Y\le y) =\ &\int_{-\infty}^x \int_{-\infty}^y f(u,v) dvdu = \ &\int_{-\infty}^x \int_{-\infty}^y f_X(u)f_{Y|X=u}(v|u) dvdu = \ &\int_{-\infty}^x f_X(u) \left{\int_{-\infty}^y f_{Y|X=u}(v|u) dv\right}du = \ &\int_{-\infty}^x f_X(u)F_{Y|X=u}(v|u) dv \end{aligned} $$
which is implemented in
pnull=function(x, p) { f=function(x) { g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u) integrate(g, 0, x[1])$value } if(!is.matrix(x)) return(f(x)) apply(x, 1, f) }
and with this we can run the test
examples[["ex6a"]]=gof_test(dta_cont, pnull, rnull_cont, phat=phat_cont, Ranges=matrix(c(0, 1,0, 1), 2, 2), maxProcessor = 1, B=B)
examples[["ex6a"]]
Now, though, let's assume that the data set has one million observations. In that case the routine above will be much to slow. Instead we can discretize the problem. In order to make this work we will have to discretize all the routines, except for pnull as the distribution function is the same for discrete and continuous random variables.
Let's say we decide to discretize the data into a 50x30 grid of equal size bins. As in example 5 we can generate the data using the Binomial distribution. However, we will also have to find the bin probabilities, essentially the density of the discretized random vector. This is done in
rnull_disc=function(p) { pnull=function(x, p) { f=function(x) { g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u) integrate(g, 0, x[1])$value } if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } nb=c(50, 30) xvals=1:nb[1]/nb[1] yvals=1:nb[2]/nb[2] z=cbind(rep(xvals, nb[2]), rep(yvals, each=nb[1]), 0) prob=c(t(MDgof::p2dC(z, pnull, p))) z[, 3]=rbinom(prod(nb), 1e6, prob) z } dta_disc=rnull_disc(2) dta_disc[1:10, ]
The MDgof routine $p2dC$ finds the probabilities of rectangles defined by the grid from the distribution function.
Next we need to rewrite the routine that finds the estimator, now based on the discrete data:
phat_disc=function(x) { nb=c(50, 30) n=sum(x[,3]) #sample size valsx=unique(x[,1])-1/nb[1]/2#x values, center of bins x=tapply(x[,3], x[,1], sum) # counts for x alone sx=sum(x*log(valsx*(1-valsx)))/2/n num=function(a) digamma(2*a)-digamma(a)+sx denom=function(a) 2*trigamma(2*a)-trigamma(a) anew=2 repeat { aold=anew anew=aold-num(aold)/denom(aold) if(abs(aold-anew)<0.00001) break } anew } p=phat_disc(dta_disc) p
We want to apply this test to the original continuous data set, so we need to discretize it as well. This can be done with the MDgof routine discretize. Then we can run the test
set.seed(111) x=rbeta(1e6, 2, 2) y=rbeta(1e6, x, x) dta_cont=cbind(x=x, y=y) dta_disc=MDgof::discretize(dta_cont, Range=matrix(c(0,1,0,1),2,2), nbins=c(50, 30)) head(dta_disc)
examples[["ex6b"]]=gof_test(dta_disc, pnull, rnull_disc, phat=phat_disc, B=B)
examples[["ex6b"]]
The routine newTS (included in package) does a chi-square test for discrete data:
TSextra=list(Continuous=FALSE, WithEstimation=TRUE, Withpvalue=FALSE)
examples[["ex6c"]]=gof_test(dta_disc, pnull, rnull_disc, phat_disc, TS=newTS, TSextra=TSextra, B=B)
examples[["ex6c"]]
Power estimation for discrete data is done the same way as for continuous data:
We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.
pnull=function(x) { f=function(x) sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20)) if(!is.matrix(x)) return(f(x)) apply(x, 1, f) } rnull=function() { x=rbinom(1000, 5, 0.5) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } ralt=function(p) { x=rbinom(1000, 5, p) y=rbinom(1000, 3, 0.5+x/20) MDgof::sq2rec(table(x, y)) } x=ralt(0.475)
examples[["ex7a"]]=gof_test(x, pnull, rnull, B=B)
examples[["ex7a"]]
examples[["ex7b"]]= gof_power(pnull, rnull, ralt, c(0.5, 0.475),B=200)
examples[["ex7b"]]
or using a user-supplied test:
TSextra=list(Continuous=FALSE, WithEstimation=FALSE, Withpvalue=TRUE)
examples[["ex7c"]]=gof_test(x, pnull, rnull, TS=newTS, TSextra=TSextra, B=B)
examples[["ex7c"]]
examples[["ex7d"]]=gof_power(pnull, rnull, ralt, c(0.5, 0.475), TS=newTS, TSextra=TSextra, With.p.value = TRUE, B=B)
examples[["ex7d"]]
The routine run.studies allows the user to quickly compare the power of a new test with the power of the included tests for 30 different combinations of null hypothesis vs alternative for continuous or discrete data and with or without parameter estimation. It also allows the user to find the power for those case studies for different sample size and type I error probabilities.
Let's say that we want to determine whether method A or B has better power for a specific case of null hypothesis and alternative. Then if we find the powers of the tests for (say) $n=100,\alpha=0.05$ and find that method A is better, A will aslo be better for any other combination of $n$ and $\alpha$. For this reason just checking one suffices to detemine the ranking of the methods (for this one case).
Say we want to compare the power of newTS for continuous data with parameter estimation with the powers of the included tests:
TSextra=list(Continuous=TRUE, WithEstimation=TRUE, Withpvalue=TRUE)
a=run.studies(study=1:2, # do first two Continuous=TRUE, WithEstimation = TRUE, TS=newTS, TSextra=TSextra, With.p.value=TRUE, B=B)
Arguments of run.studies
MDgof includes five data sets with the results for the included tests using a sample size of 250, a true type I error probability of 0.05 and two values of the parameter under the alternative, one which makes the null hypothesis true and one which makes it false. They are
If the user wants different numbers he can run:
examples[["ex8"]]= run.studies(1:2, Continuous=TRUE, WithEstimation = FALSE, param_alt=cbind(c(0, 0.4), c(0, 0.4)), alpha=0.1, nsample=500, B=B)
examples[["ex8"]][["Null"]][1:2, ] examples[["ex8"]][["Pow"]][1:2, ]
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.