anpan_pglmm: Run a phylogenetic generalized linear mixed model
In biobakery/anpan: Quantifying Microbial Strain-Host Associations

anpan_pglmm

R Documentation

Run a phylogenetic generalized linear mixed model

Description

Run a phylogenetic generalized linear mixed model

Usage

anpan_pglmm(
  meta_file,
  tree_file = NULL,
  cor_mat = NULL,
  outcome,
  covariates = NULL,
  offset = NULL,
  out_dir = NULL,
  trim_pattern = NULL,
  bug_name = NULL,
  omit_na = FALSE,
  ladderize = TRUE,
  family = "gaussian",
  show_plot_cor_mat = TRUE,
  show_plot_tree = TRUE,
  show_post = TRUE,
  show_yrep = FALSE,
  save_object = FALSE,
  verbose = TRUE,
  loo_comparison = TRUE,
  run_diagnostics = TRUE,
  reg_noise = TRUE,
  reg_gamma_params = c(1, 2),
  plot_ext = "pdf",
  beta_sd = NULL,
  int_prior_scale = 1,
  sigma_phylo_scale = 0.333,
  ...
)

Arguments

`meta_file`	either a data frame of metadata or a path to file containing the metadata
`tree_file`	either a path to a tree file readable by `ape::read.tree()` or an object of class "phylo" that is already read into R. Ignored if `cor_mat` is supplied.
`cor_mat`	a correlation matrix provided as an alternative to a tree.
`outcome`	the name of the outcome variable
`covariates`	covariates to account for (as a vector of strings)
`offset`	a variable to include as an offset
`out_dir`	if saving, directory where to save
`trim_pattern`	optional pattern to trim from tip labels of the tree
`omit_na`	logical indicating whether to omit incomplete cases of the metadata
`ladderize`	logical indicating whether to run `ape::ladderize()` on the tree before running the model
`family`	string giving the name of the distribution of the outcome variable (usually "gaussian" or "binomial")
`show_plot_cor_mat`	show a plot of the correlation matrix derived from the tree
`show_plot_tree`	show a plot of the tree overlaid with the outcome.
`show_post`	show a plot of the tree overlaid with the outcome and posterior distribution on phylogenetic effects.
`show_yrep`	show a plot of the tree overlaid with the outcome and the posterior predictive distribution for each observation if plotting the tree
`save_object`	logical indicating whether to save the model fit object
`loo_comparison`	logical indicating whether to compare the phylogenetic model against a base model (without the phylogenetic term) using `loo::loo_compare()`
`run_diagnostics`	logical indicating whether to run `cmdstanr::cmdstan_diagnose()` and `loo::pareto_k_table()` to check the MCMC and loo diagnostics respectively.
`reg_noise`	logical indicating whether to regularize the ratio of sigma_phylo to sigma_resid with a Gamma prior
`reg_gamma_params`	the shape and rate parameters of the Gamma prior on the noise term ratio. Default: c(1,2)
`plot_ext`	extension to use when saving plots
`beta_sd`	prior standard deviation parameters on the normal distribution for each covariate in the GLM component
`int_prior_scale`	standard deviation of the 0-centered normal prior for the intercept of the model (with centered covariates)
`sigma_phylo_scale`	standard deviation of half-normal prior on `sigma_phylo` for logistic PGLMMs when `family = 'binomial'`. Increasing this value can easily lead to overfitting.
`...`	other arguments to pass to `cmdstanr::sample()`

Details

the tip labels of the tree must be the sample ids from the metadata. You can use the trim_pattern argument to automatically trim off any consistent pattern from the tip labels if necessary.

The default error distribution for the outcome is "gaussian". You could change this to a phylogenetic logistic regression by changing family to "binomial" for example.

The prior for the intercept is a normal distribution centered on the mean of the outcome variable with a standard deviation of 3*sd(outcome variable).

If a prior scale on covariate effects isn't specified (i.e. beta_sd is left at the default NULL), the prior scale is set to 1 / (1 standard deviation) for each covariate. The values will be shown in a message. It's preferable to set beta_sd with domain knowledge. Consider the units/scale of your outcome variable if using family = "gaussian" or the range of plausible coefficient values on the log-odds scale if using family = "binomial".

If supplying beta_sd with a categorical predictor with >2 levels, only specify a single corresponding element in beta_sd. This appropriate element will get repeated as necessary.

sigma_phylo_scale is used to set the width of the half-normal prior on sigma_phylo when family = "binomial" (no effect when family = "gaussian"). Increasing this value can easily lead to overfitting. If sigma_phylo is high, then leafwise phylogenetic effects can get very far from zero (disregarding the correlation structure and covariate effects), which lets the model perfectly fit each observation. The default value keeps most of the prior mass on leafwise phylogenetic effects between +/- 1, which is still pretty liberal on the log-odds scale.

The model indexes the leaves according to the Cholesky factor of the correlation matrix. This likely won't be the same order as the input tree. The model_input that is returned as part of the output will have the samples in a factor with the model index order if needed.

It's normal to see some warnings during warmup, particularly about "Scale vector is inf".

This function tries to get the bug_name argument from the tree file, but if it's not provided you may need to set it yourself.

Common cmdstanr options that one might want to pass in via ... include refresh = 500 (show fewer MCMC progress updates), adapt_delta = .98 (avoid divergences at the cost of possibly needing more iterations to get convergence), show_messages = FALSE , and parallel_chains = 4 (run the MCMC chains in parallel).

If you want to use the PGLMM log-likelihood data frame with functions from the loo package, you'll need to convert it to a matrix with as.matrix(). It's converted to a tibble internally so that it prints nicely.

If int_prior_scale isn't specified, it defaults to 1 for binary outcomes and 1 standard deviation of the outcome for gaussian outcomes.

If offset is categorical, the offset value used is the empirical mean (continuous outcomes) or proportion (binary outcome) for each category. Using a categorical offset variable with a continuous outcome is strange and you probably shouldn't try it unless you know what you're doing.

Value

A list containing the model input (in the order passed to the model), estimated correlation matrix, the pglmm fit object, and (if loo_comparison is on) the base fit object and the associated loo objects.

Examples

meta = data.frame(x = rnorm(100), sample_id = paste0("t", 1:100))
tr = ape::rtree(100)
anpan_pglmm(meta, tr,
outcome = "x",
iter_sampling = 10, # Use more in practice!
iter_warmup = 10,
show_plot_cor_mat = FALSE,
show_plot_tree = FALSE)

biobakery/anpan documentation built on Jan. 3, 2025, 9:11 a.m.