anpan_pglmm_batch: Run PGLMMs on a batch of tree files

anpan_pglmm_batchR Documentation

Run PGLMMs on a batch of tree files

Description

This function fits phylogenetic generalized linear mixed models on a batch of tree files, using the same outcome and covariate arguments.

Usage

anpan_pglmm_batch(
  meta_file,
  tree_dir,
  outcome,
  covariates = NULL,
  offset = NULL,
  out_dir = NULL,
  trim_pattern = NULL,
  omit_na = FALSE,
  ladderize = TRUE,
  family = "gaussian",
  show_plot_cor_mat = TRUE,
  show_plot_tree = TRUE,
  save_object = FALSE,
  verbose = TRUE,
  loo_comparison = TRUE,
  run_diagnostics = FALSE,
  reg_noise = TRUE,
  plot_ext = "pdf",
  show_yrep = FALSE,
  show_post = TRUE,
  reg_gamma_params = c(1, 2),
  beta_sd = NULL,
  sigma_phylo_scale = 0.333,
  seed = 123,
  ...
)

Arguments

meta_file

either a data frame of metadata or a path to file containing the metadata

tree_dir

string giving the path to a directory of tree files

outcome

the name of the outcome variable

covariates

covariates to account for (as a vector of strings)

offset

a variable to include as an offset

out_dir

if saving, directory where to save

trim_pattern

optional pattern to trim from tip labels of the tree

omit_na

logical indicating whether to omit incomplete cases of the metadata

ladderize

logical indicating whether to run ape::ladderize() on the tree before running the model

family

string giving the name of the distribution of the outcome variable (usually "gaussian" or "binomial")

show_plot_cor_mat

show a plot of the correlation matrix derived from the tree

show_plot_tree

show a plot of the tree overlaid with the outcome.

save_object

logical indicating whether to save the model fit object

loo_comparison

logical indicating whether to compare the phylogenetic model against a base model (without the phylogenetic term) using loo::loo_compare()

run_diagnostics

logical indicating whether to run cmdstanr::cmdstan_diagnose() and loo::pareto_k_table() to check the MCMC and loo diagnostics respectively.

reg_noise

logical indicating whether to regularize the ratio of sigma_phylo to sigma_resid with a Gamma prior

plot_ext

extension to use when saving plots

show_yrep

show a plot of the tree overlaid with the outcome and the posterior predictive distribution for each observation if plotting the tree

show_post

show a plot of the tree overlaid with the outcome and posterior distribution on phylogenetic effects.

reg_gamma_params

the shape and rate parameters of the Gamma prior on the noise term ratio. Default: c(1,2)

beta_sd

prior standard deviation parameters on the normal distribution for each covariate in the GLM component

sigma_phylo_scale

standard deviation of half-normal prior on sigma_phylo for logistic PGLMMs when family = 'binomial'. Increasing this value can easily lead to overfitting.

seed

random seed to pass to furrr_options()

...

other arguments to pass to cmdstanr::sample()

Details

See anpan_pglmm() for details on most of the arguments.

tree_dir must contain ONLY tree files readable by ape::read.tree()

If any trees cause an error while fitting, these are saved into a data frame in a file pglmm_errors.RData in the output directory.

The Stan model fitting can't be parallelized via futures, so the most effective way to parallelize the model fitting AND the importance weight calculations is a nested future topology (e.g. plan(list(sequential, tweak(multisession, workers = 4))) ) and set parallel_chains = 4 . This will run sequentially over the trees, running the model fits with 4 parallel chains for each tree, then compute the importance weights in the future multisession for each tree.

The tibble result from this function contains a lot of large objects in list columns, so it can be pretty big (several GBs) when saved to disk in an RData file (and pretty ugly when not printed as a tibble). So be careful if you try to save the whole thing.

Value

a tibble listing results for each tree file in input directory that fit successfully. Columns give the number of leaves on the tree, diagnostic values, loo comparison values, formatted input data, correlation matrices, PGLMM and "base" model fits, and loo objects (in list columns where appropriate).

See Also

ape::read.tree, ape::write.tree, anpan_pglmm()


biobakery/anpan documentation built on Aug. 14, 2024, 8:19 a.m.