\def\eqdef{\mathrel{\raise0.4pt\hbox{$:$}\hskip-2pt=}}
knitr::opts_chunk$set(echo = TRUE, out.width = "75%", fig.dim = c(6,5), fig.align = "center") is_check <- ("CheckExEnv" %in% search()) || any(c("_R_CHECK_TIMINGS_", "_R_CHECK_LICENSE_") %in% names(Sys.getenv())) knitr::opts_chunk$set(eval = !is_check) library(distfreereg)
The procedure implemented by distfreereg()
relies on asymptotic behavior for
accurate results. In particular, when the assumptions of the procedure are met,
the asymptotic distribution of the observed statistic is equal to the
distribution of the simulated statistics. How large the sample size must be for
this asymptotic behavior to be sufficiently achieved is dependent on the details
of each application. To facilitate the exploration of sample size requirements,
distfreereg
provides the function compare()
.
In the following example, we explore the required sample size when the true mean is $$f(X;\theta) = \theta_1 + \theta_2X_1 + \theta_3X_2,$$ $\theta=(2,5,-1)$, and the errors are independent standard normal errors. The covariate matrix is generated randomly. We start with a sample size of 10.
set.seed(20240913) n <- 10 func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2] theta <- c(2,5,-1) X <- matrix(rexp(2*n, rate = 1), nrow = n) comp_dfr <- compare(theta = theta, true_mean = func, test_mean = func, true_X = X, true_covariance = list(Sigma = 3), X = X, covariance = list(Sigma = 3), theta_init = rep(1, length(theta)))
This function generates a response vector Y
from true_mean
using the given
values of true_X
, true_covariance
(by default, the errors are multivariate
normal), and theta
. It then passes Y
and most of the remaining arguments to
distfreereg()
. The observed statistics and p-values corresponding to each
simulated data set are saved, and the process is repeated. The results are
returned in an object of class compare
.
These results can be explored graphically. Several options are available. The
default behavior of the compare
method for plot()
produces a comparison of
estimated cumulative distribution functions (CDFs) for the observed and
simulated statistics. If the sample size is sufficiently large, these two curves
should be nearly identical.
plot(comp_dfr)
This plot gives some cause for concern.
Another way of exploring this uses quantile--quantile plots. Below is a Q--Q plot that compares observed and simulated statistics. If the sample size is sufficiently large, these points should lie on the line $y=x$.
plot(comp_dfr, which = "qq")
The curvature indicates an insufficient sample size. Similar to this is the next plot, which is a Q--Q plot comparing p-values to uniform quantiles. Once again, if the sample size is sufficiently large, these points should lie on the line $y=x$.
plot(comp_dfr, which = "qqp")
These plots all indicate that this sample size is insufficient for the asymptotic behavior for both statistics.
More details on this plot method are discussed here.
The compare
method for ks.test()
, which compares the observed and simulated
distributions, confirms that they are still different:
ks.test(comp_dfr)
The default statistic is whatever is the first statistic appearing in the supplied object:
names(comp_dfr$obs) ks.test(comp_dfr, stat = "CvM")
Let us repeat the simulation with a sample size of 100.
n_2 <- 100 X_2 <- matrix(rexp(2*n_2, rate = 1), nrow = n_2) comp_dfr_2 <- update(comp_dfr, true_X = X_2, X = X_2)
The following plots indicate that this sample size is sufficient.
plot(comp_dfr_2) plot(comp_dfr_2, which = "qq") plot(comp_dfr_2, which = "qqp")
A formal test confirms this assessment.
ks.test(comp_dfr_2)
The examples using compare()
above all illustrate behavior of the function
when the null hypothesis is true. It is also important to investigate the
behavior when it is false to learn about the test's power against particular
alternative hypotheses. Using the example in [Comparing Observed and Simulated
Statistics], consider the case in which the true mean function is quadratic in
$X_2$: $f(x;\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2 + x_2^2$. Having
verified above that a sample size of 100 is sufficient for the asymptotic
behavior to be sufficiently approximated, this can be investigated as follows.
alt_func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2] + 0.5*X[,2]^2 comp_dfr_3 <- update(comp_dfr_2, true_mean = alt_func, true_covariance = list(Sigma = 3))
plot(comp_dfr_3)
As hoped, these curves are clearly different. Power estimates can be obtained
with rejection()
.
rejection(comp_dfr_3)
Power for different significance levels can be found by changing the alpha
argument.
rejection(comp_dfr_3, alpha = 0.01)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.