# Fitting Generalized Linear Models for Multivariate Abundance Data

### Description

`manyglm`

is used to fit generalized linear models to high-dimensional data, such as multivariate abundance data in ecology. This is the base model-fitting function - see `plot.manyglm`

for assumption checking, and `anova.manyglm`

or `summary.manyglm`

for significance testing.

### Usage

1 2 3 4 5 6 | ```
manyglm(formula, family="negative.binomial", K=1, data=NULL, subset=NULL,
na.action=options("na.action"), theta.method = "PHI", model = FALSE,
x = TRUE, y = TRUE, qr = TRUE, cor.type= "I", shrink.param=NULL,
tol=sqrt(.Machine$double.eps), maxiter=25, maxiter2=10,
show.coef=FALSE, show.fitted=FALSE, show.residuals=FALSE,
show.warning=FALSE, offset, ... )
``` |

### Arguments

`formula` |
an object of class |

`family` |
a description of the distribution function to be used in the
in the model. The default is negative binomial regression (using a log
link, with unknown overdispersion parameter), the following family
functions are also accepted: binomial(), binomial(link="cloglog"),
poisson(), which can also be specified using character strings as
'binomial', 'cloglog' and 'poisson', respectively. In future we hope
to include other family functions as described in |

`K` |
number of trials in binomial regression. By default, K=1 for presence-absence data using logistic regression. |

`data` |
an optional data frame, list or environment (or object
coercible by |

`subset` |
an optional vector specifying a subset of observations to be used in the fitting process. |

`na.action` |
a function which indicates what should happen
when the data contain |

`theta.method` |
the method used for the estimation of the overdisperson
parameter theta, such that the mean-variance relationship is V=m+m^2/theta for
the negative binomial family. Here offers three options |

`model, x, y, qr` |
logicals. If |

`cor.type` |
the structure imposed on the estimated correlation
matrix under the fitted model. Can be "I"(default), "shrink", or "R".
See Details. This parameter is merely stored in |

`shrink.param` |
shrinkage parameter to be used if |

`tol` |
the tolerance used in estimation. |

`maxiter` |
maximum allowed iterations in the weighted least square estimation of beta. The default value is 25. |

`maxiter2` |
maximum allowed iterations in the internal ML estimations of negative binomial regression. The default value is 10. |

`show.coef, show.fitted, show.residuals, show.warning` |
logical. Whether to show model coefficients, fitted values, standardized pearson residuals, or operation warnings. |

`offset` |
this can be used to specify an _a priori_ known component to be included in the linear predictor during fitting. This should be 'NULL' or a numeric vector of length equal to NROW (i.e. number of observations) or a matrix of NROW times p (i.e. number of species). |

`...` |
further arguments passed to or from other methods. |

### Details

`manyglm`

is used to calculate the parameter estimates of generalised linear models fitted to each of many variables simultaneously as in Warton et. al. (2012) and Wang et.al.(2012). Models for `manyglm`

are specified symbolically. For details on how to specify a formula see the details section of `lm`

and `formula`

.

Generalised linear models are designed for non-normal data for which a distribution can be specified that offers a reasonable model for data, as specified using the argument `family`

. The `manyglm`

function currently handles count and binary data, and accepts either a character argument or a family argument for common choices of family. For binary (presence/absence) data, `family=binomial()`

can be used for logistic regression (logit link, "logistic regression"), or the complementary log-log link can be used `family=binomial("cloglog")`

, arguably a better choice for presence-absence data. Poisson regression `family=poisson()`

can be used for counts that are not "overdispersed" (that is, if the variance is not larger than the mean), although for multivariate abundance data it has been shown that the negative binomial distribution (`family="negative.binomial"`

) is usually a better choice (Warton 2005). In both cases, a log-link is used. If another link function or family is desired, this can be specified using the `manyany`

function, which accepts regular `family`

arguments.

In negative binomial regression, the overdispersion parameter (`theta`

) is estimated separately for each variable from the data, as controlled by `theta.method`

for negative binomial distributions. We iterate between updates of `theta`

and generalised linear model updates for regression parameters, as many as `maxiter2`

times.

`cor.type`

is the structure imposed on the estimated correlation
matrix under the fitted model. Possible values are:

`"I"`

(default) = independence is assumed (correlation matrix is the identity)

`"shrink"`

= sample correlation matrix is shrunk towards I to improve its stability.

`"R"`

= unstructured correlation matrix is used. (Only available when N>p.)

If `cor.type=="shrink"`

, a numerical value will be assigned to `shrink.param`

either through the argument or by internal estimation. The working horse function for the internal estimation is `ridgeParamEst`

, which is based on cross-validation (Warton 2008). The validation groups are chosen by random assignment, so some slight variation in the estimated values may be observed in repeat analyses. See `ridgeParamEst`

for more details. The shrinkage parameter can be any value between 0 and 1 (0="I" and 1="R", values closer towards 0 indicate more shrinkage towards "I").

### Value

`manyglm`

returns an object inheriting from `"manyglm"`

,
`"manylm"`

and "mglm".

The function `summary`

(i.e. `summary.manyglm`

) can be used to obtain or print a summary of the results and the function
`anova`

(i.e. `anova.manyglm`

) to produce an
analysis of variance table, although note that these functions use resampling so they can take a while to fit.

The generic accessor functions `coefficients`

,
`fitted.values`

and `residuals`

can be used to
extract various useful features of the value returned by `manyglm`

.

An object of class `"manyglm"`

is a list containing at least the
following components:

`coefficients` |
a named matrix of coefficients. |

`var.coefficients` |
the estimated variances of each coefficient. |

`fitted.values` |
the matrix of fitted mean values, obtained by transforming the linear predictors by the inverse of the link function. |

`linear.predictor` |
the linear fit on the scale of the linear predictor. |

`residuals` |
the matrix of |

`PIT.residuals` |
probability integral transform (PIT) residuals - for a model that fits well these should be approximately standard uniform values evenly scattered between 0 and 1. Their calculation involves some randomisation, so different fits will return slightly different values for PIT residuals. |

`sqrt.1_Hii` |
the matrix of scale terms used to standardize the Pearson reidusals. |

`var.estimator` |
the estimated variance of each observation, computed using the corresponding family function. |

`sqrt.weight` |
the matrix of square root of |

`theta` |
the estimated nuisance parameters accounting for overdispersion |

`two.loglike` |
two times the log likelihood. |

`deviance` |
up to a constant, minus twice the maximized log-likelihood. Where sensible, the constant is chosen so that a saturated model has deviance zero. |

`iter` |
the number of iterations of IWLS used. |

`data` |
a data frame storing the input data. |

`stderr.coefficients` |
the estimated standard error of each coefficient. |

`phi` |
the inverse of theta |

`tol` |
the tolerance used in estimations. |

`maxiter,maxiter2,family,theta.method,cor.type,formula` |
arguments supplied in the |

`aic` |
a vector returning Akaike's |

`AICsum` |
the sum of the AIC's over all variables. |

`shrink.param` |
the shrink parameter to be used in subsequent inference. |

`call` |
the matched call. |

`terms` |
the |

`rank` |
the numeric rank of the fitted linear model. |

`xlevels` |
(where relevant) a record of the levels of the factors used in fitting. |

`df.residual` |
the residual degrees of freedom. |

`x` |
if the argument |

`y` |
if the argument |

`model` |
if the argument |

`qr` |
if the argument |

`show.coef,show.fitted,show.residuals` |
arguments supplied in the |

`offset` |
the |

### Author(s)

Yi Wang, Ulrike Naumann and David Warton <David.Warton@unsw.edu.au>.

### References

Lawless, J. F. (1987)
*Negative binomial and mixed Poisson regression*,
Canadian Journal of Statistics 15, 209-225.

Hilbe, J. M. (2008)
*Negative Binomial Regression*,
Cambridge University Press, Cambridge.

Warton D.I. (2005)
*Many zeros does not mean zero inflation: comparing the
goodness of-fit of parametric models to multivariate abundance data*,
Environmetrics 16(3), 275-289.

Warton D.I. (2008). Penalized normal likelihood and ridge regularization
of correlation and covariance matrices. *Journal of the American
Statistical Association* 103, 340-349.

Warton D.I. (2011). Regularized sandwich estimators for analysis of high dimensional data using generalized estimating equations. *Biometrics*, 67(1), 116-123.

Warton D. I., Wright S., and Wang, Y. (2012). Distance-based multivariate analyses confound location and dispersion effects. *Methods in Ecology and Evolution*, 3(1), 89-101.

Wang Y., Neuman U., Wright S. and Warton D. I. (2012). mvabund: an R package for model-based analysis of multivariate abundance data. *Methods in Ecology and Evolution*. online 21 Feb 2012.

### See Also

`anova.manyglm`

, `summary.manyglm`

, `residuals.manyglm`

, `plot.manyglm`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ```
data(spider)
spiddat <- mvabund(spider$abund)
X <- spider$x
#To fit a log-linear model assuming counts are poisson:
glm.spid <- manyglm(spiddat~X, family="poisson")
glm.spid
summary(glm.spid, resamp="residual")
#To fit a binomial regression model to presence/absence data:
pres.abs <- spiddat
pres.abs[pres.abs>0] = 1
X <- data.frame(spider$x) #turn into a data frame to refer to variables in formula
glm.spid.bin <- manyglm(pres.abs~soil.dry+bare.sand+moss, data=X, family="binomial")
glm.spid.bin
drop1(glm.spid.bin) #AICs for one-term deletions, suggests dropping bare.sand
glm2.spid.bin <- manyglm(pres.abs~soil.dry+moss, data=X, family="binomial")
drop1(glm2.spid.bin) #backward elimination suggests settling on this model.
``` |