anpan_pglmm: Run a phylogenetic generalized linear mixed model

View source: R/pglmm.R

anpan_pglmmR Documentation

Run a phylogenetic generalized linear mixed model

Description

Run a phylogenetic generalized linear mixed model

Usage

anpan_pglmm(
  meta_file,
  tree_file = NULL,
  cor_mat = NULL,
  outcome,
  covariates = NULL,
  offset = NULL,
  out_dir = NULL,
  trim_pattern = NULL,
  bug_name = NULL,
  omit_na = FALSE,
  ladderize = TRUE,
  family = "gaussian",
  show_plot_cor_mat = TRUE,
  show_plot_tree = TRUE,
  show_post = TRUE,
  show_yrep = FALSE,
  save_object = FALSE,
  verbose = TRUE,
  loo_comparison = TRUE,
  run_diagnostics = TRUE,
  reg_noise = TRUE,
  reg_gamma_params = c(1, 2),
  plot_ext = "pdf",
  beta_sd = NULL,
  int_prior_scale = 1,
  sigma_phylo_scale = 0.333,
  ...
)

Arguments

meta_file

either a data frame of metadata or a path to file containing the metadata

tree_file

either a path to a tree file readable by ape::read.tree() or an object of class "phylo" that is already read into R. Ignored if cor_mat is supplied.

cor_mat

a correlation matrix provided as an alternative to a tree.

outcome

the name of the outcome variable

covariates

covariates to account for (as a vector of strings)

offset

a variable to include as an offset

out_dir

if saving, directory where to save

trim_pattern

optional pattern to trim from tip labels of the tree

omit_na

logical indicating whether to omit incomplete cases of the metadata

ladderize

logical indicating whether to run ape::ladderize() on the tree before running the model

family

string giving the name of the distribution of the outcome variable (usually "gaussian" or "binomial")

show_plot_cor_mat

show a plot of the correlation matrix derived from the tree

show_plot_tree

show a plot of the tree overlaid with the outcome.

show_post

show a plot of the tree overlaid with the outcome and posterior distribution on phylogenetic effects.

show_yrep

show a plot of the tree overlaid with the outcome and the posterior predictive distribution for each observation if plotting the tree

save_object

logical indicating whether to save the model fit object

loo_comparison

logical indicating whether to compare the phylogenetic model against a base model (without the phylogenetic term) using loo::loo_compare()

run_diagnostics

logical indicating whether to run cmdstanr::cmdstan_diagnose() and loo::pareto_k_table() to check the MCMC and loo diagnostics respectively.

reg_noise

logical indicating whether to regularize the ratio of sigma_phylo to sigma_resid with a Gamma prior

reg_gamma_params

the shape and rate parameters of the Gamma prior on the noise term ratio. Default: c(1,2)

plot_ext

extension to use when saving plots

beta_sd

prior standard deviation parameters on the normal distribution for each covariate in the GLM component

int_prior_scale

standard deviation of the 0-centered normal prior for the intercept of the model (with centered covariates)

sigma_phylo_scale

standard deviation of half-normal prior on sigma_phylo for logistic PGLMMs when family = 'binomial'. Increasing this value can easily lead to overfitting.

...

other arguments to pass to cmdstanr::sample()

Details

the tip labels of the tree must be the sample ids from the metadata. You can use the trim_pattern argument to automatically trim off any consistent pattern from the tip labels if necessary.

The default error distribution for the outcome is "gaussian". You could change this to a phylogenetic logistic regression by changing family to "binomial" for example.

The prior for the intercept is a normal distribution centered on the mean of the outcome variable with a standard deviation of 3*sd(outcome variable).

If a prior scale on covariate effects isn't specified (i.e. beta_sd is left at the default NULL), the prior scale is set to 1 / (1 standard deviation) for each covariate. The values will be shown in a message. It's preferable to set beta_sd with domain knowledge. Consider the units/scale of your outcome variable if using family = "gaussian" or the range of plausible coefficient values on the log-odds scale if using family = "binomial".

If supplying beta_sd with a categorical predictor with >2 levels, only specify a single corresponding element in beta_sd. This appropriate element will get repeated as necessary.

sigma_phylo_scale is used to set the width of the half-normal prior on sigma_phylo when family = "binomial" (no effect when family = "gaussian"). Increasing this value can easily lead to overfitting. If sigma_phylo is high, then leafwise phylogenetic effects can get very far from zero (disregarding the correlation structure and covariate effects), which lets the model perfectly fit each observation. The default value keeps most of the prior mass on leafwise phylogenetic effects between +/- 1, which is still pretty liberal on the log-odds scale.

The model indexes the leaves according to the Cholesky factor of the correlation matrix. This likely won't be the same order as the input tree. The model_input that is returned as part of the output will have the samples in a factor with the model index order if needed.

It's normal to see some warnings during warmup, particularly about "Scale vector is inf".

This function tries to get the bug_name argument from the tree file, but if it's not provided you may need to set it yourself.

Common cmdstanr options that one might want to pass in via ... include refresh = 500 (show fewer MCMC progress updates), adapt_delta = .98 (avoid divergences at the cost of possibly needing more iterations to get convergence), show_messages = FALSE , and parallel_chains = 4 (run the MCMC chains in parallel).

If you want to use the PGLMM log-likelihood data frame with functions from the loo package, you'll need to convert it to a matrix with as.matrix(). It's converted to a tibble internally so that it prints nicely.

If int_prior_scale isn't specified, it defaults to 1 for binary outcomes and 1 standard deviation of the outcome for gaussian outcomes.

If offset is categorical, the offset value used is the empirical mean (continuous outcomes) or proportion (binary outcome) for each category. Using a categorical offset variable with a continuous outcome is strange and you probably shouldn't try it unless you know what you're doing.

Value

A list containing the model input (in the order passed to the model), estimated correlation matrix, the pglmm fit object, and (if loo_comparison is on) the base fit object and the associated loo objects.

See Also

anpan_pglmm_batch(), loo::loo(), cmdstanr::sample()

Examples

meta = data.frame(x = rnorm(100), sample_id = paste0("t", 1:100))
tr = ape::rtree(100)
anpan_pglmm(meta, tr,
outcome = "x",
iter_sampling = 10, # Use more in practice!
iter_warmup = 10,
show_plot_cor_mat = FALSE,
show_plot_tree = FALSE)

biobakery/anpan documentation built on Aug. 14, 2024, 8:19 a.m.