pglmm_compare: Phylogenetic Generalized Linear Mixed Model for Comparative...
In phyr: Model Based Phylogenetic Analysis

Description Usage Arguments Details Value Author(s) References See Also Examples

pglmm_compare performs linear regression for Gaussian, binomial and Poisson phylogenetic data, estimating regression coefficients with approximate standard errors. It simultaneously estimates the strength of phylogenetic signal in the residuals and gives an approximate conditional likelihood ratio test for the hypothesis that there is no signal. Therefore, when applied without predictor (independent) variables, it gives a test for phylogenetic signal. pglmm_compare is a wrapper for pglmm tailored for comparative data in which each value of the response (dependent) variable corresponds to a single tip on a phylogenetic tree. If there are multiple measures for each species, pglmm will be helpful.

pglmm_compare(
  formula,
  family = "gaussian",
  data = list(),
  phy,
  REML = TRUE,
  optimizer = c("nelder-mead-nlopt", "bobyqa", "Nelder-Mead", "subplex"),
  add.obs.re = TRUE,
  verbose = FALSE,
  cpp = TRUE,
  bayes = FALSE,
  reltol = 10^-6,
  maxit = 500,
  tol.pql = 10^-6,
  maxit.pql = 200,
  marginal.summ = "mean",
  calc.DIC = FALSE,
  prior = "inla.default",
  prior_alpha = 0.1,
  prior_mu = 1,
  ML.init = FALSE,
  s2.init = 1,
  B.init = NULL
)

`formula`	A two-sided linear formula object describing the fixed-effects of the model; for example, Y ~ X. Binomial data can be either presence/absence, or a two-column array of 'successes' and 'failures'. For both binomial and Poisson data, we add an observation-level random term by default via `add.obs.re = TRUE`.
`family`	Either "gaussian" for a Linear Mixed Model, or "binomial" or "poisson" for Generalized Linear Mixed Models. `family` should be specified as a character string (i.e., quoted). For binomial and Poisson data, we use the canonical logit and log link functions, respectively. Binomial data can be either presence/absence, or a two-column array of 'successes' and 'failures'. For both Poisson and binomial data, we add an observation-level random term by default via `add.obs.re = TRUE`. If `bayes = TRUE` there are two additional families available: "zeroinflated.binomial", and "zeroinflated.poisson", which add a zero inflation parameter; this parameter gives the probability that the response is a zero. The rest of the parameters of the model then reflect the "non-zero" part of the model. Note that "zeroinflated.binomial" only makes sense for success/failure response data.
`data`	A data frame containing the variables named in formula. It must has the tip labels of the phylogeny as row names; if they are not in the same order, the data frame will be arranged so that row names match the order of tip labels.
`phy`	A phylogenetic tree as an object of class "phylo".
`REML`	Whether REML or ML is used for model fitting the random effects. Ignored if `bayes = TRUE`.
`optimizer`	nelder-mead-nlopt (default), bobyqa, Nelder-Mead, or subplex. Nelder-Mead is from the stats package and the other optimizers are from the nloptr package. Ignored if `bayes = TRUE`.
`add.obs.re`	Whether to add observation-level random term for binomial and Poisson families. Normally it would be a good idea to add this to account for overdispersion, so `add.obs.re = TRUE` by default.
`verbose`	If `TRUE`, the model deviance and running estimates of `s2` and `B` are plotted each iteration during optimization.
`cpp`	Whether to use C++ function for optim. Default is TRUE. Ignored if `bayes = TRUE`.
`bayes`	Whether to fit a Bayesian version of the PGLMM using `r-inla`. We recommend against Bayesian fitting for non-Gaussian data unless sample sizes are large (>1000), because the phylogenetic variance tends to get trapped near zero.
`reltol`	A control parameter dictating the relative tolerance for convergence in the optimization; see `optim`.
`maxit`	A control parameter dictating the maximum number of iterations in the optimization; see `optim`.
`tol.pql`	A control parameter dictating the tolerance for convergence in the PQL estimates of the mean components of the GLMM. Ignored if `family = "gaussian"` or `bayes = TRUE`.
`maxit.pql`	A control parameter dictating the maximum number of iterations in the PQL estimates of the mean components of the GLMM. Ignored if `family = "gaussian"` or `bayes = TRUE`.
`marginal.summ`	Summary statistic to use for the estimate of coefficients when doing a Bayesian PGLMM (when `bayes = TRUE`). Options are: "mean", "median", or "mode", referring to different characterizations of the central tendency of the Bayesian posterior marginal distributions. Ignored if `bayes = FALSE`.
`calc.DIC`	Should the Deviance Information Criterion be calculated and returned, when doing a Bayesian PGLMM? Ignored if `bayes = FALSE`.
`prior`	Which type of default prior should be used by `pglmm`? Only used if `bayes = TRUE`. There are currently four options: "inla.default", which uses the default `INLA` priors; "pc.prior.auto", which uses a complexity penalizing prior (as described in Simpson et al. (2017)) designed to automatically choose good parameters (only available for gaussian and binomial responses); "pc.prior", which allows the user to set custom parameters on the "pc.prior" prior, using the `prior_alpha` and `prior_mu` parameters (Run `INLA::inla.doc("pc.prec")` for details on these parameters); and "uninformative", which sets a very uninformative prior (nearly uniform) by using a very flat exponential distribution. The last option is generally not recommended but may in some cases give estimates closer to the maximum likelihood estimates. "pc.prior.auto" is only implemented for `family = "gaussian"` and `family = "binomial"` currently.
`prior_alpha`	Only used if `bayes = TRUE` and `prior = "pc.prior"`, in which case it sets the alpha parameter of `INLA`'s complexity penalizing prior for the random effects.The prior is an exponential distribution where prob(sd > mu) = alpha, where sd is the standard deviation of the random effect.
`prior_mu`	Only used if `bayes = TRUE` and `prior = "pc.prior"`, in which case it sets the mu parameter of `INLA`'s complexity penalizing prior for the random effects.The prior is an exponential distribution where prob(sd > mu) = alpha, where sd is the standard deviation of the random effect.
`ML.init`	Only relevant if `bayes = TRUE`. Should maximum likelihood estimates be calculated and used as initial values for the bayesian model fit? Sometimes this can be helpful, but most of the time it may not help; thus, we set the default to `FALSE`. Also, it does not work with the zero-inflated families.
`s2.init`	An array of initial estimates of s2. If s2.init is not provided for `family="gaussian"`, these are estimated using `lm` assuming no phylogenetic signal. If `s2.init` is not provided for `family = "binomial"`, these are set to 0.25.
`B.init`	Initial estimates of B, a matrix containing regression coefficients in the model for the fixed effects. This matrix must have `dim(B.init) = c(p + 1, 1)`, where `p` is the number of predictor (independent) variables; the first element of `B` corresponds to the intercept, and the remaining elements correspond in order to the predictor (independent) variables in the formula. If `B.init` is not provided, these are estimated using `lm` or `glm` assuming no phylogenetic signal.

pglmm_compare in the package phyr is similar to binaryPGLMM in the package ape, although it has much broader functionality, including accepting more than just binary data, implementing Bayesian analyses, etc.

For non-Gaussian data, the function estimates parameters for the model

Pr(Y = 1) = θ

θ = inverse.link(b0 + b1 * x1 + b2 * x2 + … + ε)

ε ~ Gaussian(0, s2 * V)

where V is a covariance matrix derived from a phylogeny (typically under the assumption of Brownian motion evolution). Although mathematically there is no requirement for V to be ultrametric, forcing V into ultrametric form can aide in the interpretation of the model. This is especially true for binary data, because in regression for binary dependent variables, only the off-diagonal elements (i.e., covariances) of matrix V are biologically meaningful (see Ives & Garland 2014). The function converts a phylo tree object into a covariance matrix, and further standardizes this matrix to have determinant = 1. This in effect standardizes the interpretation of the scalar s2. Although mathematically not required, it is a very good idea to standardize the predictor (independent) variables to have mean 0 and variance 1. This will make the function more robust and improve the interpretation of the regression coefficients.

For Gaussian data, the function estimates parameters for the model

Y = b0 + b1 * x1 + b2 * x2 + … + ε)

ε ~ Gaussian(0, s2 * V + s2resid * I)

where s2resid * I gives the non-phylogenetic residual variance. Note that this is equivalent to a model with Pagel's lambda transformation.

An object (list) of class pglmm_compare with the following elements:

`formula`	the formula for fixed effects
`formula_original`	the formula for both fixed effects and random effects
`data`	the dataset
`family`	either `gaussian` or `binomial` or `poisson` depending on the model fit
`B`	estimates of the regression coefficients
`B.se`	approximate standard errors of the fixed effects regression coefficients. This is set to NULL if `bayes = TRUE`.
`B.ci`	approximate bayesian credible interval of the fixed effects regression coefficients. This is set to NULL if `bayes = FALSE`
`B.cov`	approximate covariance matrix for the fixed effects regression coefficients
`B.zscore`	approximate Z scores for the fixed effects regression coefficients. This is set to NULL if `bayes = TRUE`
`B.pvalue`	approximate tests for the fixed effects regression coefficients being different from zero. This is set to NULL if `bayes = TRUE`
`ss`	random effects' standard deviations for the covariance matrix sigma^2 V for each random effect in order. For the linear mixed model, the residual variance is listed last
`s2r`	random effects variances for non-nested random effects
`s2n`	random effects variances for nested random effects
`s2resid`	for linear mixed models, the residual variance
`s2r.ci`	Bayesian credible interval for random effects variances for non-nested random effects. This is set to NULL if `bayes = FALSE`
`s2n.ci`	Bayesian credible interval for random effects variances for nested random effects. This is set to NULL if `bayes = FALSE`
`s2resid.ci`	Bayesian credible interval for linear mixed models, the residual variance. This is set to NULL if `bayes = FALSE`
`logLik`	for linear mixed models, the log-likelihood for either the restricted likelihood (`REML=TRUE`) or the overall likelihood (`REML=FALSE`). This is set to NULL for generalised linear mixed models. If `bayes = TRUE`, this is the marginal log-likelihood
`AIC`	for linear mixed models, the AIC for either the restricted likelihood (`REML=TRUE`) or the overall likelihood (`REML=FALSE`). This is set to NULL for generalised linear mixed models
`BIC`	for linear mixed models, the BIC for either the restricted likelihood (`REML=TRUE`) or the overall likelihood (`REML=FALSE`). This is set to NULL for generalised linear mixed models
`DIC`	for bayesian PGLMM, this is the Deviance Information Criterion metric of model fit. This is set to NULL if `bayes = FALSE`.
`REML`	whether or not REML is used (`TRUE` or `FALSE`).
`bayes`	whether or not a Bayesian model was fit.
`marginal.summ`	The specified summary statistic used to summarise the Bayesian marginal distributions. Only present if `bayes = TRUE`
`s2.init`	the user-provided initial estimates of `s2`
`B.init`	the user-provided initial estimates of `B`
`Y`	the response (dependent) variable returned in matrix form
`X`	the predictor (independent) variables returned in matrix form (including 1s in the first column)
`H`	the residuals. For linear mixed models, this does not account for random terms, To get residuals after accounting for both fixed and random terms, use `residuals()`. For the generalized linear mixed model, these are the predicted residuals in the logit -1 space.
`iV`	the inverse of the covariance matrix. This is NULL if `bayes = TRUE`.
`mu`	predicted mean values for the generalized linear mixed model (i.e. similar to `fitted(merMod)`). Set to NULL for linear mixed models, for which we can use `fitted()`.
`Zt`	the design matrix for random effects. This is set to NULL if `bayes = TRUE`
`St`	diagonal matrix that maps the random effects variances onto the design matrix
`convcode`	the convergence code provided by `optim`. This is set to NULL if `bayes = TRUE`
`niter`	number of iterations performed by `optim`. This is set to NULL if `bayes = TRUE`
`inla.model`	Model object fit by underlying `inla` function. Only returned if `bayes = TRUE`

Anthony R. Ives

Ives, A. R. and Helmus, M. R. (2011) Generalized linear mixed models for phylogenetic analyses of community structure. Ecological Monographs, 81, 511–525.

Ives, A. R. and Garland, T., Jr. (2014) Phylogenetic regression for binary dependent variables. Pages 231–261 in L. Z. Garamszegi, editor. Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology. Springer-Verlag, Berlin Heidelberg.

pglmm; package ape and its function binaryPGLMM; package phylolm and its function phyloglm; package MCMCglmm

## Illustration of `pglmm_compare` with simulated data

# Generate random phylogeny

library(ape)

n <- 100
phy <- compute.brlen(rtree(n=n), method = "Grafen", power = 1)

# Generate random data and standardize to have mean 0 and variance 1
X1 <- rTraitCont(phy, model = "BM", sigma = 1)
X1 <- (X1 - mean(X1))/var(X1)

# Simulate binary Y
sim.dat <- data.frame(Y = array(0, dim = n), X1 = X1, row.names = phy$tip.label)
sim.dat$Y <- ape::binaryPGLMM.sim(Y ~ X1, phy = phy, data=sim.dat, s2 = 1,
                             B = matrix(c(0, .25), nrow = 2, ncol = 1), 
                             nrep = 1)$Y

# Fit model
pglmm_compare(Y ~ X1, family = "binomial", phy = phy, data = sim.dat)

# Compare with `binaryPGLMM`
ape::binaryPGLMM(Y ~ X1, phy = phy, data = sim.dat)

# Compare with `phyloglm`
summary(phylolm::phyloglm(Y ~ X1, phy = phy, data = sim.dat))

# Compare with `glm` that does not account for phylogeny
summary(glm(Y ~ X1, data = sim.dat, family = "binomial"))

# Compare with logistf() that does not account
# for phylogeny but is less biased than glm()
logistf::logistf(Y ~ X1, data = sim.dat)

## Fit model with bayes = TRUE
# pglmm_compare(Y ~ X1, family = "binomial", phy = phy, data = sim.dat, 
#               bayes = TRUE, calc.DIC = TRUE)

# Compare with `MCMCglmm`

V <- vcv(phy)
V <- V/max(V)
detV <- exp(determinant(V)$modulus[1])
V <- V/detV^(1/n)

invV <- Matrix::Matrix(solve(V),sparse = TRUE)
sim.dat$species <- phy$tip.label
rownames(invV) <- sim.dat$species

nitt <- 43000
thin <- 10
burnin <- 3000

prior <- list(R=list(V=1, fix=1), G=list(G1=list(V=1, nu=1000, alpha.mu=0, alpha.V=1)))
# commented out to save time
# summary(MCMCglmm::MCMCglmm(Y ~ X1, random = ~species, ginvers = list(species = invV),
#     data = sim.dat, slice = TRUE, nitt = nitt, thin = thin, burnin = burnin,
#    family = "categorical", prior = prior, verbose = FALSE))