GET.distrequal: Graphical n sample test of correspondence of distribution...

View source: R/appl_ecdf.r

GET.distrequalR Documentation

Graphical n sample test of correspondence of distribution functions

Description

Compare the distributions of two (or more) samples.

Usage

GET.distrequal(
  x,
  stat = "ECDF",
  nsim,
  r = seq(min(unlist((lapply(x, min)))), max(unlist((lapply(x, max)))), length = 100),
  tau = seq(0.1, 0.9, length = 100),
  contrasts = FALSE,
  GET.args = NULL,
  density.args = NULL,
  approxfun.args = NULL,
  rq.args = NULL,
  savefuns = FALSE,
  ...
)

Arguments

x

A list of numeric vectors, one for each sample.

stat

Character string indicating which test statistic to be used. See details.

nsim

The number of random permutations.

r

The sequence of argument values at which the test functions are to be compared. The default is 100 equally spaced values between the minimum and maximum over all groups.

tau

The sequence of argument values for the QR test statistic. The default values are 100 equally spaced values between 0.1 and 0.9.

contrasts

Logical. FALSE and TRUE specify the two test functions as described in description part of this help file.

GET.args

A named list of additional arguments to be passed to global_envelope_test, e.g. typeone specifies the type of multiple testing control, FWER or FDR. See global_envelope_test for the defaults and available options.

density.args

A named list of additional arguments to be passed for the estimation of the test statistic "DEN". For more details see density.

approxfun.args

A named list of additional arguments to be passed for the estimation of the the test statistic "QQ". For more details see approxfun.

rq.args

A named list of additional arguments to be passed for the estimation of the test statistic "QR". For more details see the function rq of quantreq.

savefuns

Logical. If TRUE, then the functions from permutations are saved to the attribute simfuns.

...

Additional parameters to be passed to global_envelope_test. For example, the type of multiple testing control, FWER or FDR must be set by typeone. And, if typeone = "fwer", the type of the global envelope can be chosen by specifying the argument type. See global_envelope_test for the defaults and available options. (The test here uses alternative="two.sided" and nstep=1 (when relevant), but all the other specifications are to be given in ....)

Details

A global envelope test can be performed to investigate whether the n distribution functions differ from each other and how do they differ. This test is a generalization of the two-sample Kolmogorov-Smirnov test with a graphical interpretation. We assume that the observations in the sample i are an i.i.d. sample from the distribution F_i(r), i=1, \dots, n, and we want to test the hypothesis

F_1(r)= \dots = F_n(r).

If contrasts = FALSE (default), then the default test statistic ("ECDF") is taken to be

\mathbf{T} = (\hat{F}_1(r), \dots, \hat{F}_n(r))

where \hat{F}_i(r) = (\hat{F}_i(r_1), \dots, \hat{F}_i(r_k)) is the ecdf of the ith sample evaluated at argument values r = (r_1,\dots,r_k).

Another possibility is given by stat = "DIFF", and then the test statistic is still based on the ECDFs and constructed from all pairwise differences,

\mathbf{T} = (\hat{F}_1(r)-\hat{F}_2(r), \hat{F}_1(r)-\hat{F}_3(r), \dots, \hat{F}_{n-1}(r)-\hat{F}_n(r))

The choices contrasts = TRUE and stat = "ECDF" lead to the same test statistic. For other (or multiple) values of stat, the argument contrasts is ignored.

All the options as the test statistics are the following:

  1. "ECDF": The ECDFs of the n-samples, as specified above

  2. "DIFF": The pairwise differences between the ECDFs, as specified above

  3. "DEN": The kernel estimated density functions of the n-samples as the test statistic

  4. "QQ": The pairwise comparisons of empirical quantiles

  5. "SHIFT" The de-trended QQ-plot (shift plot)

  6. "QR": The quantile regression process, i.e. the \beta-coefficients of the quantile regression. By default, the reference category of this test statistic is the first sample.

The test statistics are described in detail in Konstantinou et al. (2024).

The simulations under the null hypothesis that the distributions are the same are obtained by permuting the individuals of the groups. The default number of permutation, if nsim is not specified, is n \cdot 1000-1 for the case contrasts = FALSE and (n \cdot (n-1)/2) \cdot 1000 - 1 for the case contrasts = TRUE, where n is the length of x.

References

Konstantinou K., Mrkvička T. and Myllymäki M. (2024) Graphical n-sample tests of correspondence of distributions. arXiv:2403.01838 [stat.ME] https://doi.org/10.48550/arXiv.2403.01838

Examples

if(require("fda", quietly=TRUE)) {
  # Heights of boys and girls at age 10
  f.a <- growth$hgtf["10",] # girls at age 10
  m.a <- growth$hgtm["10",] # boys at age 10
  # Empirical cumulative distribution functions
  plot(ecdf(f.a))
  plot(ecdf(m.a), col='grey70', add=TRUE)
  # Create a list of the data
  fm.list <- list(Girls=f.a, Boys=m.a)
  
  res <- GET.distrequal(fm.list)
  plot(res)
  # If you want to change the labels:
  plot(res, scales = "free", labels = c("Girls", "Boys"))
  # If you want to change the x-label (y-label similarly):
  require("ggplot2")
  myxlab <- substitute(paste(italic(i), " (", j, ")", sep = ""),
                       list(i = "x", j = "Height in cm"))
  plot(res, scales = "free") + xlab(myxlab)
  # Use instead the test statistics QQ and DEN
  res <- GET.distrequal(fm.list, stat = c("QQ", "DEN"))
  plot(res, scales = "free")
  
  

  # Heights of boys and girls at age 14
  f.a <- growth$hgtf["14",] # girls at age 14
  m.a <- growth$hgtm["14",] # boys at age 14
  # Empirical cumulative distribution functions
  plot(ecdf(f.a))
  plot(ecdf(m.a), col='grey70', add=TRUE)
  # Create a list of the data
  fm.list <- list(Girls=f.a, Boys=m.a)
  
  res <- GET.distrequal(fm.list)
  plot(res) + xlab(myxlab)
  res <- GET.distrequal(fm.list, stat = c("QQ", "DEN"))
  plot(res, scales = "free") + xlab(myxlab)
  
  
}
if(require("datasets", quietly=TRUE)) {
  data("iris")
  virginica <- subset(iris, Species == "virginica")
  setosa <- subset(iris, Species == "setosa")
  versicolor <- subset(iris, Species == "versicolor")
  
  res <- GET.distrequal(x = list(virginica = virginica$Sepal.Length,
                                 setosa = setosa$Sepal.Length,
                                 versicolor = versicolor$Sepal.Length),
                        stat =  c("QQ", "DEN"))
  plot(res, scales = "free", ncol = 3)
  
  
}

GET documentation built on Sept. 11, 2024, 5:46 p.m.