ncMcTest: Perform a Monte Carlo test for the number of components in a...
In mixreg: Functions to Fit Mixtures of Regressions

Description Usage Arguments Details Value Author(s) References See Also Examples

Produces nsim simulated realizations of the likelihood ratio statistic, either parametrically or semi-parametrically, and calculates the corresponding p-value of the test.

1
2
3

ncMcTest(x, y, data=NULL, ncomp=2, ncincr=1, intercept=TRUE, nsim=99,
         seed=NULL, ts1=NULL, ts2=NULL, semiPar=FALSE,
         conditional=semiPar, verb=FALSE, progRep=TRUE, ...)

`x`	Either a formula specifying the regression model to be fitted or a predictor or matrix of predictors for the regression model. If `x` is a matrix it should NOT include an initial column of `1`s. If an intercept is desired then the `intercept` argument should be left equal to `TRUE`.
`y`	The vector of responses for the regression models. Ignored if `x` is a formula.
`data`	A list or data frame containing the variables in the regression model. Such variables will be looked for first in `data` and then in the global environment.
`ncomp`	The null-hypothesized number of components in the mixture.
`ncincr`	The increment from the null-hypothesized number of components in the mixture to the number under the alternative hypothesis; i.e. the number of components under the alternative hypothesis is `ncomp + ncincr`.
`intercept`	Logical argument indicating whether the regression models in the mixture should have intercept terms. Ignored if `x` is a formula.
`nsim`	The number of simulated replicates of the log likelihood ratio statistic to be produced.
`seed`	Positive integer scalar. The seed for random number generation. If left `NULL` it is obtained by sampling from `1:1e5`.
`ts1`	Starting values for fitting the `ncomp` component model. If ts1 is null, random starting values are used. (This is not recommended.)
`ts2`	Starting values for fitting the `ncomp+nincr` component model. If ts2 is null, random starting values are used. (This is not recommended.)
`semiPar`	Logical scalar; should semi-parametric bootstrapping should be used?
`conditional`	Logical scalar; should the component-selection probabilities be determined conditionally upon the observations?
`verb`	Logical argument indicating whether the fitting processes should be verbose (i.e. whether details should be printed out at each step of the EM algorithm). If TRUE a huge amount of screen output is produced.
`progRep`	Logical argument indicating whether the index, of the simulated statistic just constructed, should be printed out, to give an idea of how the process is progressing.
`...`	Further arguments to be passed to `mixreg()` to control the fitting procedure.

In this context the “parametric” procedure is to simulate data sets by generating data from the fitted ncomp model parameters, using Gaussian errors. In contrast, under the semiparametric bootstrapping procedure, the errors are generated by resampling from the residuals. Since at each predictor value there are ncomp residuals, one for each component of the model, the errors are selected at random from these ncomp possibilities. If the argument conditional is TRUE then the selection probabilities at this step are the conditional probabilities, of the observation being generated by each component of the model, given that observation. If conditional is FALSE then these probabilities are the corresponding entries of lambda (see Value. The residuals are sampled independently in either case. The procedure is termed semi-parametric since the sampling probabilities depend on the parameters of the model. Note that it makes no sense to specify conditional=TRUE if semiPar is FALSE. Doing so will generate an error.

It is important to be aware that the test conducted by this function is a Monte Carlo test and that the p-value produced is a Monte Carlo p-value. It is consequently an exact p-value in a sense which must be carefully understood. See for example Baddeley et al. 2015 (section 10.6) and Turner and Jeffs 2017 for explanation of the interpretation of Monte Carlo p-values and for some general discussion of Monte Carlo tests and of their advantages. Such tests effect substantial savings on computational costs with only marginal diminishment of power.

A list with components:

`lrs`	the likelihood ratio statistic for the test
`pval`	the (Monte Carlo) p-value of the test
`simStats`	a vector of the values of the likelihood ratio statistics of the simulated data sets
`aic.ncomp`	a vector of the aic values for the `ncomp` models fitted to the simulated data sets
`aic.ncomp+nincr`	a vector of the aic values for the `ncomp+nincr` models fitted to the simulated data sets. (Note that in any given instance `ncomp` and `nincr` are replaced by the actual numeric values that are used in that instance.)
`df`	the degrees of freedom that would be appropriate if the test statistic actually had a chi-squared distribution. Explicitly `df` is equal to the number of parameters in the alternative model minus the number of parameters in the null model
`screwUps`	a data frame with columns `seed`, `i` and `type`, containing respectively the random number generator seed, the index and the type of each screw-up. Note that if there were in fact no screw-ups then the `screwUps` entry of the returned value will not be present. Note also that it is possible for values in the `i` column of `screwUps` to be repeated. The entries of the `type` column take values in the set `{1,2,3,4,5}`. These values have the interpretation: 1: singularity in the likelihood surface for the `ncomp` model 2: the algorithm failed to converge when fitting the `ncomp` component model 3: singularity in the likelihood surface for the `ncomp+nincr` model 4: the algorithm converged for the `ncomp` component model but failed to converge for the `ncomp + nincr` model 5: the likelihood ratio statistic for the `ncomp` model was greater than that for the `ncomp+nincr` model (which is theoretically impossible) Note that if a screw-up does occur, the replicate is redone completely so that the returned value contains results for a full `nsim` simulations.

The returned value has an attribute "seed" which is the (initial) value of the random number generation seed that was used. This is either the value of the argument seed, or, if this was NULL, a randomly generated value.

Rolf Turner r.turner@auckland.ac.nz

T. Rolf Turner (2000). Estimating the rate of spread of a viral infection of potato plants via mixtures of regressions. Applied Statistics 49 Part 3, pp. 371 – 384.

Adrian Baddeley, Ege Rubak and Rolf Turner (2015). Spatial Point Patterns: Methodology and Applications with R. London: Chapman and Hall/CRC Press.

Rolf Turner and Celeste Jeffs (2017). A note on exact Monte Carlo hypothesis tests. Communications in Statistics: Simulation and Computation 46, pp. 6545 – 6558.

cband(), covMix(), mixreg(), plot.cband(), plot.mixresid(), qqMix(), residuals.mixreg()

    ## Not run: 
       tst12 <- ncMcTest(plntsInf ~ aphRel,ncomp=1,data=aphids,seed=42)
    
## End(Not run) # Monte Carlo p-value is 0.01; mixture model is called for.
    TS1 <- list(list(beta=c(3.0,0.1),sigsq=16,lambda=0.5),
                list(beta=c(0.0,0.0),sigsq=16,lambda=0.5))
    TS2 <- list(list(beta=c(3.0,0.1),sigsq=9,lambda=1/3),
                list(beta=c(1.5,0.05),sigsq=9,lambda=1/3),
                list(beta=c(0.0,0.0),sigsq=9,lambda=1/3))
    x <- aphids$aphRel
    y <- aphids$plntsInf
    ## Not run: 
      nsim <- 999
    
## End(Not run)
    
    tst23 <- ncMcTest(x,y,nsim=nsim,ts1=TS1,ts2=TS2)