mixreg: Fit a mixture of linear regressions.
In mixreg: Functions to Fit Mixtures of Regressions

Description Usage Arguments Value Warning Notes Author(s) References See Also Examples

Estimates the parameters for a mixture of linear regressions, assuming Gaussian errors, using the EM algorithm.

mixreg(x, y, ncomp = NULL, intercept = TRUE, eqVar = FALSE,
       thetaStart = NULL, itmax = 1000,eps = 1e-06, verb = TRUE,
       digits = 7, maxTry = 5, seed = NULL, data = NULL,
       covMat=FALSE,MC=FALSE,warn=TRUE,...)

`x`	Either a formula specifying the regression model in question or a matrix of predictors for that regression model. In the latter case `x` should NOT include an initial column of 1s. (If an intercept is required, leave `intercept` as `TRUE`!) If there is only one predictor, x may be a vector rather than a one-column matrix.
`y`	The vector of responses for the regression models. Not used and should not be specified if `x` is a formula.
`ncomp`	The number of components in the mixture. If `thetaStart` (see below) is supplied, then the number of components is determined from this argument and `ncomp` is ignored. If neither `ncomp` nor `thetaStart` is supplied then `ncomp` defaults to 2.
`intercept`	Logical scalar; should the linear regressions should have intercepts fitted? Ignored if `x` is a formula.
`eqVar`	Logical scalar; should the error variance should be the same for all component? (The alternative is that each component should be allowed a different error variance.)
`thetaStart`	Either a list or a matrix providing starting values for the estimation procedure. If it is a list each of its entries is in turn a list with entries `beta` (vector of linear coefficients), `sigsq` (variance) and `lambda` (mixing probability). If it is a matrix it must have `ncomp` rows and `ncoef + 2` columns, where `ncoef` is the number of regression coefficients in the model. The latter quantity is `ncol(x)` if `intercept` is `FALSE` and is `1 + ncol(x)` if `intercept` is `TRUE`. If `eqVar` is `TRUE` then it is sensible to have all the starting values of `sigsq` equal, but this is not strictly necessary. If `thetaStart` is not specified, starting values are produced “randomly”. This sometimes works, but sometimes doesn't. The function `visualFit()` provides a convenient means of determining starting values by a “visual” procedure. Note that if `object` is the value returned by `visualFit()` then the appropriate value for `thetaStart` is `object$theta`.
`itmax`	The maximum number of EM steps to be undertaken. If this maximum number of steps is exceeded then a warning is issued, but a fit (probably unreliable) is returned anyway. The returned value will, in this case, have the `converged` component equal to `FALSE`.
`eps`	A value specifying the convergence criterion for the EM algorithm. If the maximum absolute value of the change in the parameters is less than `eps` the algorithm is considered to have converged.
`verb`	Logical argument; if `verb` is `TRUE` then details of the progress of the algorithm are printed out at each EM step.
`digits`	The number of digits to which the details are printed out, when `verb` is `TRUE`.'
`maxTry`	If the algorithm encounters a singularity in the likelihood (as may possibly occur when eqVar is FALSE) the algorithm is restarted using new (randomly generated) starting values. The restart is attempted a maximum of maxTry times.
`seed`	A numeric scalar used to seed the random number generators if `thetaStart` is not specified (whence starting values are produced randomly). If `seed` is not an integer it gets rounded to the nearest integer (so `seed=pi` has the same impact as `seed=3`. If not supplied, `seed` is selected randomly from `1, 2, ..., 1e5`. The seed for random number generation is set to `seed` in `initRand()` before any random number generation is done. The argument `seed` is ignored if `thetaStart` is supplied.
`data`	A list or data frame in which the values of the data `x` and `y` may be stored. If `data` is not `NULL` then `mixreg()` looks for `x` and `y` first in `data` and then in the global environment.
`covMat`	Logical scalar; should the covariance matrix of the parameter estimates be calculated? (This can take some time.)
`MC`	Logical scalar; should the covariance matrix of the parameter estimates be calculated by a Monte Carlo procedure? Ignored if `covMat` is `FALSE`.
`warn`	Logical scalar. Should a warning be issued if the EM algorithm fails to converge in `itmax` iterations? Note that this failure is discernible anyway, from the `converged` component of the returned value, but it's probably best to be alerted to the failure. This argument is present mainly so that it can be set to `FALSE` internally in `ncMcTest()`, where such warnings are redundant.
`...`	Optional arguments which are passed to `covMixMC()` (if `covMat` and `MC` are both `TRUE`. These optional arguments may be `semiPar`, `conditional` or `cMseed`. (Any other argument being given to `mixreg()` will trigger an error.)

A list, of class mixreg, with components

`parmat`	The parameters of the fitted model arranged as a matrix, each row corresponding to one component of the mixture.
`theta`	The parameters of the fitted model as a list, each entry of the list being itself a list (like those in thetaStart) corresponding to one component of the mixture.
`log.like`	The log likelihood of the fitted model, based on Gaussian errors.
`aic`	The Akaike Information Criterion value for the fitted model; aic is equal to -2 * log.like + 2*M where M is the number of parameters in the model.
`intercept`	The value of the `intercept` argument.
`eqVar`	The value of the `eqVar` argument.
`nsteps`	The number of steps the EM algorithm took to converge.
`converged`	Logical scalar indicating whether the algorithm did indeed 'converge or stopped because it reach the itmax EM step.
`data`	A list or data frame providing the data `x` and `y` to which the model was fitted. It may be equal to the value of the `data` argument, or it may have been constructed, in whole or in part, from the `x` and `y` arguments.
`formula`	The formula used by `lm()` in fitting the regression models. Depending on circumstances it may be equal to the argument `fmla`, or it may have been constructed internally by `mixreg()`, in which case it is of the form `y ~ x` or, if `intercept` is `FALSE`, `y ~ x - 1` (with, of course, `x` and `y` replaced by the actual names of the objects provided as the corresponding arguments).

The returned value has an attribute "seed" specifying the seed for the random number generators that was used in producing random starting values. If thetaStart was specified then the attribute "seed" is NA.

If the x argument is a formula, then specifying y is not only meaningless, but will cause a (possibly mystifying) error to be thrown.

This function (and indeed the entire mixreg package) is structured so as to be able to deal with mixtures of regressions involving any number of predictors. However I know of no practical examples in which more than one predictor is involved, and it seems likely that models involving more than one predictor would present substantial difficulties. Some of the functions in this package (cband(), visualFit(), plot.mixreg(), ...) can cope only with single-predictor models.

The entries parmat and theta of the returned value contain the same information, presented in a different format.

Even if eqVar is TRUE, each entry of theta still has its own sigsq entry. The values of these will all be equal, however, if eqVar is TRUE.

For scalar mixtures, allowing the components to have different variances can have the effect of introducing singularities in the likelihood function. It is not clear to me what the impact of allowing different variances is in the mixtures-of-regressions setting. In respect of the scalar setting, Aitkin and Tunnicliffe Wilson (see References) assert that the singularities that may arise “do not cause any computational difficulty in practice”. However in the mixtures of regressions setting, I have observed strange anomalies which appear to be alleviated by setting eqVar=TRUE. See Examples. If results seem to be unsatisfactory, you may be well-advised to try setting eqVar=TRUE, to see if that makes an improvement.

Rolf Turner r.turner@auckland.ac.nz

T. Rolf Turner (2000). Estimating the rate of spread of a viral infection of potato plants via mixtures of regressions. Applied Statistics 49 Part 3, pp. 371 – 384.

A. P. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39, pp. 1 – 22.

M. Aitkin and G. Tunnicliffe Wilson (1980). Mixture models, outliers, and the EM algorithm. Technometrics 22, pp. 325 – 331.

ncMcTest(), cband(), qqMix(), residuals.mixreg()

# Aphids.
    fit1   <- mixreg(aphRel,plntsInf,ncomp=2,seed=42,data=aphids)
    plot(fit1)
    thStrt <- list(list(beta=c(3.0,0.1),sigsq=16,lambda=0.5),
                   list(beta=c(0.0,0.0),sigsq=16,lambda=0.5))
    fit2   <- mixreg(aphRel,plntsInf,ncomp=2,thetaStart=thStrt,data=aphids)
    fit3   <- mixreg(plntsInf ~ aphRel,ncomp=2,thetaStart=thStrt,data=aphids)
# Kilns.
## Not run: 
    vfit <- visualFit(y ~ x,ncomp=3, data=kilnAoneOut)
    fit  <- mixreg(y ~ x,ncomp=3,data=kilnAoneOut,thetaStart=vfit$theta)

## End(Not run)
    thStrt <- list(
                   list(beta=c(26.07,48808),sigsq=1.1573,lambda=0.33333333),
                   list(beta=c(23.48,32387),sigsq=1.8730,lambda=0.33333333),
                   list(beta=c(-0.0597,20760),sigsq=0.2478,lambda=0.33333333)
                 )
    # Roughly the value of vfit$theta.
    fit  <- mixreg(y ~ x,ncomp=3,data=kilnAoneOut,thetaStart=thStrt)
    plot(fit)
# Kilns, zero intercept model.
## Not run: 
    vfit <- visualFit(y ~ x - 1,ncomp=3, data=kilnAoneOut)
    fit  <- mixreg(y ~ x - 1,ncomp=3,data=kilnAoneOut,thetaStart=vfit$theta)

## End(Not run)
    thStrt <- list(list(beta=50900,sigsq=3.297,lambda=0.33333333),
                  list(beta=33800,sigsq=25.52,lambda=0.33333333),
                  list(beta=20755,sigsq=0.2477,lambda=0.33333333))
    # Roughly the value of vfit$theta.
    fit  <- mixreg(y ~ x - 1,ncomp=3,data=kilnAoneOut,thetaStart=thStrt)
    plot(fit) # Yikes!!!  (But a plot of vfit looks practically perfect.)
    fit  <- mixreg(y ~ x - 1,ncomp=3,data=kilnAoneOut,thetaStart=thStrt,eqVar=TRUE)
    plot(fit) # Looks fine.