knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

It's difficult to find associations between microbial strains and host health outcomes due to their fine resolution and non-recurrence across individuals. This package, anpan, aims to make inferring those relationships a bit easier by providing an interface to our strain analysis functionality. This functionality covers three main points:

We'll cover each of these points in a top-level section. There is also an [Advanced topics] section with some additional information on a handful of more complicated diagnostics/methods/techniques.

Libraries

If you need installation instructions, please see the readme on Github.

We'll start by loading the library in this code chunk:

library(anpan)

The startup message points out that you can easily parallelize / show progress bars for most long-running computations in anpan by setting plan() and handlers() after loading the furrr and progressr packages. For most users plan(multisession, workers = 4) and handlers(global = TRUE) are probably close to what you want, though both the parallelization strategy and progress reporting are highly customizable.

A couple points to know about anpan function names:

This code chunk loads some other packages we'll use:

library(data.table)
library(ggplot2)
library(tibble)
library(dplyr)
library(ape)

Gene model

The gene model uses microbial functional profiles and sample metadata to identify which genes in which bugs best explain sample outcome variables, while accounting for additional covariates like age and gender. The functional profiles are typically generated by HUMAnN. Rows are genes, samples are columns.

There are two key problems here:

The gene model (implemented in the function anpan()) handles these issues respectively by:

In this section, you'll get some example data (either by reading the data included with the package or by simulating it yourself), run both steps with a single call to anpan() (or anpan_batch()), examine the filtering diagnostics, then examine the model results.

Get example data

In this section we'll obtain a small simulated dataset for a single bug that comes prepackaged with anpan. The simulated data doesn't look realistically noisy (see the plots in the README / manuscript for that), but it incorporates the generative features anpan aims to pick up on. The [Simulate example data] sub-section further down shows exactly how this data was simulated if you would like to improve your understanding of the relationships in the data. You can get the paths to the data included with the package with these two commands:

meta_path = system.file("extdata", "fake_metadata.tsv", 
                        package = "anpan", mustWork = TRUE)

bug_path  = system.file("extdata", "g__Madeuppy.s__Madeuppy_Fakerii.genefamilies.tsv.gz", 
                        package = "anpan", mustWork = TRUE)

Let's take a peek at the top of the metadata and the top-left corner of the gene profile data to see what they look like:

fread(meta_path, nrows = 5)

Notably, the metadata needs to have a column called "sample_id" that matches the column names of the gene file and any outcome / covariate information we might use. Later we will be using only is_case as the outcome and no additional covariates, but you might as want to include age, gender, etc. in which case they need to be present in this file.

Now let's look at the gene data:

fread(bug_path, nrows = 5, select = 1:4)

The gene file needs to have gene IDs in the first column, formatted as gene_id|taxon_id e.g. UniRef90_ABC000001|g__Madeuppy.s__Madeuppy_Fakerii. The names of the remaining columns need to exactly match the entries in the sample_id column of the metadata. By default anpan will trim "_Abundance-RPKs" from the column names of the gene file since that part of the ID is usually added by HUMAnN and not present in the metadata.

Run the gene model

Now that we have the path to the gene data set in bug_path and the path to the metadata set in meta_path, we can run the filtering and modeling with anpan(). If you have multiple bug files in a single directory, use anpan_batch() instead and set the bug_dir argument instead of bug_file.

The code block below shows the anpan() function call. The outcome argument is set to is_case the logical indicator of case status (which means that anpan() will be running a logistic regression model). Note that this means anpan will be running logistic regression models for each gene. In this case, we have no covariates to include, so we set that argument to NULL. The output directory is set to a sub-directory of the tempdir() we've been using so far, but it could be somewhere more permanent if you like.

anpan_res = anpan(bug_file          = bug_path,
                  meta_file         = meta_path,
                  out_dir           = file.path(tempdir(), "anpan_output"),
                  covariates        = NULL,
                  outcome           = "is_case",
                  model_type        = "fastglm",
                  discretize_inputs = TRUE)

You can see from the first set of messages that it seems to have identified most of the ~50 samples that didn't have the bug present in the simulation.

A lot of stuff gets written out to out_dir, but let's look at the result that we get in R:

head(anpan_res[order(p.value)])

anpan() returns a data.table with regression statistics for the effect of each gene on the outcome. We can see that the two top-ranked hits (and the only two that pass a reasonable q-value threshold) are the strongest genes from the list of true effects. We don't get the second through fourth genes with (weaker) true effects, meaning we likely didn't have enough power in this simulation to detect those. The regression statistics for all terms in the model (including non-gene covariate terms) are written to the output directory in *all_terms.tsv.gz. The stats for just the gene terms (i.e. the same data frame returned by anpan() in R) is written to *gene_terms.tsv.gz. For anpan_batch() those will both be in the model_stats/ output directory, with a file called all_bug_gene_terms.tsv.gz in the top level of the output directory.

Examine filtering diagnostics

Before interpreting the model results, it's important to ensure that the filtering step didn't go awry. anpan() calls the function read_and_filter() internally, which applies:

The outputs of the filtering step goes into the filter_stats/ directory inside the specified out_dir. The final filtered data (which is the input to the modeling step) is in filter_stats/filtered_*.tsv.gz where * represents the bug name. The filter_stats/labels/ directory contains the calls of which samples contained the bug. The filter_stats/plots/ directory contains diagnostic plots from the filtering step. For a given bug, there are two plots to inspect: the gene profile lines plot and the k-means filtering dot plot. We'll look at the dot plot first:

knitr::include_graphics("vignette_figs/g__Madeuppy.s__Madeuppy_Fakerii_kmeans.png") #show lines and kmeans plots

Recall the original input file: genes are rows and columns are samples. Here, each dot represents a sample, meaning a column in the original input. The x axis is the number of non-zero entries in that column, and the y axis is the median log abundance of the non-zero entries. Meaning if a sample has many non-zero entries and high median log abundance, the bug in question is probably well-covered in the sample. A simple k-means clustering is applied to the dots on this scatterplot, which is used to define the presence/absence calls (i.e. provide the coloring of the dots). In this particular example, only two samples are misclassified. If the dot plot doesn't show two well separated populations, it might make sense to do some initial manual curation of the data as appropriate and then set filtering_method = "none" when calling anpan() to turn off the kmeans filtering.

Next we'll look at the gene profile lines plot:

knitr::include_graphics("vignette_figs/g__Madeuppy.s__Madeuppy_Fakerii_labelled_lines.png") #show lines and kmeans plots

This shows a monotonic decreasing line for each sample. The x-axis is the rank of genes in that sample, while the y-axis is the log-abundance value of the ith-ranked gene in that sample. Ideally there should be good separation between red and blue lines. In this simulated data they seem to decrease linearly, but in more realistic data it usually looks more like a plateau. If a particular sample contains a mixture of strains, there may even be multiple plateau steps as the profile decreases, representing different groups of genes dropping out between strains.

If the two groups of samples identified by the filtering step aren't clearly separated, that's cause for concern. That means you're potentially throwing out or including samples you shouldn't. If the automated filtering doesn't work out, you can filter your samples manually on your own then turn off the automated filtering with filtering_method = "none".

Examine gene model results

We can visualize the results with plot_results() as in the code block below. This function is called automatically on each bug when using anpan_batch(). We first read in the model inputs from the filter_stats directory, then pass that and the model results to plot_results():

input_path = file.path(tempdir(), 
                       "anpan_output", "filter_stats",
                       "filtered_Madeuppy_Fakerii.tsv.gz")

model_input = fread(input_path)

plot_results(res            = anpan_res,
             model_input    = model_input,
             covariates     = NULL, 
             outcome        = "is_case",
             bug_name       = "Madeuppy_Fakerii",
             cluster        = "both",
             show_trees     = TRUE,
             n_top          = 20,
             q_threshold    = NULL,
             beta_threshold = NULL)

Things to know about the results plot:

Some additional outputs you'll get with anpan_batch():

Additional details

Additional details on the gene model.

Simulate example data

This section simulates some data for 200 samples in a synthetic bug with 500 genes. The simulation doesn't look realistically noisy (see the plots in the README / manuscript for that), but it incorporates the generative features anpan aims to pick up on.

First we set some of the simulation parameters and a random seed:

n_per_group = 100
n_gene      = 500

out_dir = tempdir()

set.seed(123)

Then we generate the simulated metadata. The has_bug column is an indicator if a given sample has the bug at all. Roughly a quarter of our samples don't. Of course you won't have this indicator in real data -- this is what the adaptive filtering step tries to figure out.

sim_meta = data.table(sample_id = paste0('sample_', 1:(2 * n_per_group)),
                      is_case   = rep(c(FALSE, TRUE), 
                                      each = n_per_group),
                      has_bug   = sample(x       = c(TRUE, FALSE),
                                         prob    = c(.75, .25), 
                                         size    = 2 * n_per_group,
                                         replace = TRUE))

meta_path = file.path(out_dir, "fake_metadata.tsv")

fwrite(sim_meta,
       file = meta_path,
       sep = "\t")

We simulate the gene-level data in the blocks below. A couple notes on what's going on here:

First we create the simulated metadata:

true_effects = data.table(gene   = paste0("UniRef90_ABC", 1:(n_gene)),
                          effect = c(rnorm(5, sd = 2), 
                                     rep(0, (n_gene - 5))))

sim_grid = expand.grid(gene      = paste0("UniRef90_ABC", 1:(n_gene)),
                       sample_id = paste0(     "sample_", 1:(2*n_per_group))) |>
  as.data.table()

Then we simulate the log-abundances. A couple notes on how this is done:

rskewnormal = function(n, location = 0, scale = 1, skew = 0) {
  # adapted from sn::rsn()

  delta = skew / sqrt(1 + skew^2)
  chi   = abs(rnorm(n))
  nrv   = rnorm(n)

  z = delta * chi + sqrt(1 - delta^2) * nrv

  return(location + scale * z)
}

sim_meta[, bug_labd := rskewnormal(n        = nrow(sim_meta),
                                   location = -2,
                                   skew     = -3) +
                       -2.5 * (!has_bug)]

sim_df = sim_grid[sim_meta, on = 'sample_id'][true_effects, on = 'gene']

sim_df[, gene_labd := rskewnormal(n        = nrow(sim_df),
                                  location = -1 + 0.5 * bug_labd + is_case * effect,
                                  skew     = -3,
                                  scale    = 3)]

sim_df$gene_labd[sim_df$gene_labd < quantile(sim_df$gene_labd,
                                             probs = .333)] = -Inf

The block below performs a bit of formatting to make the simulated data look like real HUMAnN data. It:

sim_df[, gene_abd := exp(gene_labd)]

sim_wide = dcast(sim_df[, .(gene, sample_id, gene_abd)],
                 gene ~ sample_id,
                 value.var = "gene_abd")

sim_wide[, gene := paste0(gene, "|g__Madeuppy.s__Madeuppy_Fakerii")]

names(sim_wide)[1] = "# Gene Family"

Then we write out the simulated data. The name of the file isn't important, but here we've chosen to follow the naming conventions of HUMAnN.

bug_path = file.path(out_dir, "g__Madeuppy.s__Madeuppy_Fakerii.genefamilies.tsv.gz")

fwrite(sim_wide,
       file = bug_path,
       sep  = "\t")

PGLMMs

Phylogenetic generalized linear mixed models (PGLMMs) are probabilistic models that account for phylogenetic structure. For a PGLMM with a linear regression as the base model, the model structure is as follows:

$$ y = X \beta + (1|\text{leaf}) + \epsilon $$

$$(1|\text{leaf}) \sim \text{MVNormal}(0, σ_p^2\Omega)$$

$$\epsilon \sim \text{Normal}(0, σ_R^2)$$

The outcome $y$ is modeled as a function of some familiar terms: covariates $X$, coefficients $\beta$, and residual noise $\epsilon$. The key addition is the phylogenetic term $(1|\text{leaf})$, which contains a "random effect" for each observation in $y$. Unlike typical random effects (which are usually independent between different levels of the random effect variable), the values in the phylogenetic term follow a pre-specified correlation structure $\Omega$, which is derived from a tree. The variability of the phylogenetic term is scaled by the "phylogenetic noise" parameter $\sigma_{phylo}$.

To understand how a tree implies a correlation structure, consider the plot below showing a phylogenetic tree and the lower triangle of the correlation matrix it implies. You can see that the clade on the left (with low inter-leaf distances between its members) forms a bright sub-block of high correlation in the matrix. The tiny clade on the far right has high inter-leaf distance between its members and the rest of the tree, so its members generally have low correlation with the rest of the leaves, exhibited by the dark band on the right of the triangle. Note that the matrix is derived solely from the structure of the tree (the black lines) -- it is not influenced by the colored outcome dots. The correlation matrix numerically quantifies the "higher inter-leaf distance → lower correlation" principle.

knitr::include_graphics("vignette_figs/tree_hcm.png")

We can see in the synthetic example above that the tree clearly does impact the outcome: the two well-separated clades clearly show very different outcome values. A PGLMM evaluated on this data would easily detect such a blazing signal, so in the remainder of this section we'll simulate a tree with an outcome that's realistically noisier and less visually obvious. After that we'll fit the PGLMM with anpan_pglmm() then examine the results.

Simulate data

To demonstrate the PGLMM functionality, we'll simulate a tree, a single covariate x, and an outcome variable y. We'll use n = 120, sigma_phylo = 1 and sigma_resid = 1.

The best way to understand a generative model is to use it to simulate data. If you feel like you don't understand PGLMMs, step through these simulation blocks line by line (10 lines in total), look at the outputs, and make sure you understand what happens on that line. Try modifying it in different ways and see how the outputs and downstream inference change.

The code chunk below sets a randomization seed and the parameters we'll use, then generates a random tree:

set.seed(123)

n = 120
sigma_phylo = 1
sigma_resid = 1

tr = ape::rtree(n)

You can try plot(tr) if you want to take a quick look at your tree, but we'll examine it with some additional detail later.

The chunk below derives the correlation matrix implied by the tree:

cor_mat = ape::vcv.phylo(tr, corr = TRUE)

cor_mat[1:5,1:5]

The "t#" labels are the default sample IDs that ape::rtree() puts as tip labels.

The chunk below generates a normally distributed covariate covariate, the linear term contribution, the true phylogenetic effects we'll use, and the outcome variable outcome, all stored in a tibble called metadata. Note that the effect size of the linear covariate is explicitly set to 1:

covariate = rnorm(n)

linear_term = 1 * covariate + rnorm(n, mean = 0, sd = sigma_resid)

true_phylo_effects = sigma_phylo * MASS::mvrnorm(1, mu = rep(0, n), Sigma = cor_mat)

metadata = tibble(sample_id = colnames(cor_mat),
                  covariate = covariate,
                  outcome   = linear_term + true_phylo_effects)
# For some reason, putting the PGLMM random data generation after the gene model section causes the
# random number generation to not reproduce correctly. I think the future_map() call in anpan() is
# messing with the R sessions Rmarkdown is using.

# Just silently load the right objects as a hack.
load('vignette_figs/pglmm_objects.RData')
metadata

Now we have our tree and the metadata that goes along with it. A quick plot shows that the outcome correlates with the simulated covariate as desired:

ggplot(metadata, aes(covariate, outcome)) + 
  geom_point() + 
  labs(title = "The simulated covariate correlates with the outcome") + 
  theme_light()

We can use the function anpan::plot_outcome_tree() to visually inspect the tree and confirm that it also appears related to the outcome. The dot on each leaf is shaded according to the outcome:

plot_outcome_tree(tr,
                  metadata, 
                  covariates = NULL,
                  outcome    = "outcome")

You can sort of see by eye that the phylogeny correlates with the outcome. The middle clade is generally brighter, and close neighbors usually have about the same shade of color.

How do we quantitatively assess the impact of the phylogeny on the outcome while also building in the relationship with the covariate? Use a PGLMM.

Exercise: Change the simulation to produce a binary outcome. Hint: If you don't want to define an inverse logit function, you can use stats::plogis(). You may also want to use rbinom() and start by first simulating a standard logistic regression. The answer is given in the source Rmarkdown.

# One possible answer to the exercise. Note that we use the covariate by itself without the normal
# error (i.e. the term containing sigma_resid) because the error distribution is fixed (i.e. the
# binomial distribution).

# If you leave the other parameters at the values used in the tutorial, this usually won't yield detectable phylogenetic signal since binary outcomes contain much less information. To get that you would need to increase n and/or sigma_phylo.

inv_logit = function(x) 1 / (1 + exp(-x))

metadata = tibble(sample_id = colnames(cor_mat),
                  covariate = covariate,
                  prob      = inv_logit(true_phylo_effects + 1*covariate), 
                  outcome   = as.logical(rbinom(n    = n, 
                                                p    = prob, 
                                                size = 1)))

# stats::plogis does the same thing as inv_logit
# drop the `true_phylo_effects` from the `prob = ...` line and you have a standard logistic regression.

Fit the PGLMM

The code chunk below shows how to use anpan_pglmm() to fit a PGLMM that examines this data for phylogenetic patterns while adjusting for our covariate. It will take a couple minutes to run. By default anpan_pglmm regularizes the noise ratio sigma_phylo / sigma_resid with a Gamma(1,2) prior and runs a leave-one-out model comparison against a "base" model that doesn't have the phylogenetic component. The default family = "gaussian" argument means the residual error is normally distributed. This can be changed to family = "binomial" with a binary outcome to run a phylogenetic logistic regression.

result = anpan_pglmm(meta_file       = metadata,
                     tree_file       = tr,
                     outcome         = "outcome",
                     covariates      = "covariate",
                     family          = "gaussian",
                     bug_name        = "sim_bug",
                     reg_noise       = TRUE,
                     loo_comparison  = TRUE,
                     run_diagnostics = FALSE,
                     refresh         = 500,
                     show_plot_tree  = FALSE,
                     show_post       = FALSE)

Aside from fitting the model, this command prints a plot (and a lot of console output with messages about the progression of the MCMC sampler). This plots shows the correlation matrix implied by the tree. This is how the PGLMM sees the correlation structure of the leaves. (Side note: the randomized tree name shows up because we didn’t provide one explicitly and the function wasn’t able to pull one from the tree input either).

If you want to tl;dr the rest of the PGLMMs section, look at the model comparison result to decide if the phylogeny is informative beyond the base model:

result$loo$comparison

Leave one out model comparison is NOT a null hypothesis significance test.

The PGLMM is the best fitting model, the difference in ELPD (predictive performance score) is relatively large (> -4), and the difference in ELPD for the base model is more than two standard errors below zero. So in this case you can say the phylogeny clearly contributes to the outcome. Given that we simulated the tree with $\sigma_{phylo} > 0$, that is the correct conclusion.

If you want to see how the fit sees the effect of the tree, you can examine the posterior distribution on the phylogenetic with a simple interval plot laid out below the tree using plot_tree_with_post() (or leave show_post at its default TRUE value in the above call to anpan_pglmm()):

p = plot_tree_with_post(tree_file  = tr,
                        meta_file  = metadata,
                        fit        = result$pglmm_fit,
                        covariates = "covariate",
                        outcome    = "outcome",
                        color_bars = TRUE,
                        labels     = metadata$sample_id)
p

You can see that the model fit confirms what we observed visually earlier: the phylogenetic effect generally shifts the outcome up in the middle clade and tightly correlated neighbors generally have very similar phylogenetic effects.

But how do we estimate clade effects? It's important to contextualize the results in terms of the model formulation. A PGLMM doesn't see the tree, it sees the correlation matrix $\Omega$. Human eyes can easily pick out blocks in the matrix as discrete clades, but the model sees only a single, constrained multivariate distribution. So a presentation like this where the phylogenetic effects are set against the tree is a simple presentation of the raw model fit, but can't draw the sharp dividing lines we might want.

If you are able to justifiably define a clade manually (with some articulable, non-circular reasoning beyond "this clade looks like it has a higher outcome"), you can use the function compute_clade_effects() to compare the average phylogenetic effect of clade members against that of non-members. However automating the selection of clades is non-trivial and beyond the scope of this package, so we won't be demonstrating that function here.

Exercise: Fit a PGLMM with a binary outcome. If you weren't able to modify the simulation to generate binary outcomes in the previous exercise, just set the outcome to random draws of TRUE/FALSE (in which case you should observe zero phylogenetic signal). The answer is given in the source Rmarkdown.

# Either re-run the simulation to get a binary outcome (see earlier exercise answer) or set metadata$outcome = sample(c(TRUE, FALSE), size = n, replace = TRUE). Then re-run anpan_pglmm() changing the following arguments:
# * family = "binomial" 
# * reg_noise = FALSE

# Generate a random binary outcome. Skip this if you already have a binary result from the previous exercise.
metadata$outcome = sample(c(TRUE, FALSE), 
                          size = n, replace = TRUE)
# ^ There is no relationship to the covariate nor phylogeny here. So the estimates for the covariate coefficient and sigma_phylo should be close to 0.

result = anpan_pglmm(meta_file       = metadata,
                     tree_file       = tr,
                     outcome         = "outcome",
                     covariates      = "covariate",
                     family          = "binomial",
                     bug_name        = "sim_bug",
                     reg_noise       = FALSE,
                     loo_comparison  = TRUE,
                     run_diagnostics = FALSE,
                     refresh         = 500,
                     show_plot_tree  = FALSE,
                     show_post       = FALSE)

Detailed interpretation

PGLMMs have a lot of moving parts. Let's look at some of the results more closely.

Messages

A couple points to mention on the messages produced by anpan_pglmm():

(1/3) Checking inputs.
Plotting correlation matrix...
Prior scale on covariate effects aren't specified. Setting to 1 standard deviation for each centered covariate. These values are:
   linear_term prior_sd
        <char>    <num>
1:   covariate    1.019
It would be better to set the beta_sd argument based on scientific background knowledge.

This message is stating the empirical priors on the covariate coefficients which anpan inferred from the data. It's usually better to set these yourself based on background information with the beta_sd argument.

Init values were only set for a subset of parameters. 
Missing init values for the following parameters:
 - chain 1: beta, centered_cov_intercept, std_phylo_effects
 - chain 2: beta, centered_cov_intercept, std_phylo_effects
 - chain 3: beta, centered_cov_intercept, std_phylo_effects
 - chain 4: beta, centered_cov_intercept, std_phylo_effects

These are benign -- the randomized initialization that Stan provides works fine for these parameters.

Chain 1 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 1 Exception: normal_id_glm_lpdf: Scale vector is inf, but must be positive finite! (in '/var/folders/g8/22sdpptx451fng3nbj7c45vw0000gq/T/RtmpSaBGri/model-2e4326e878c5.stan', line 37, column 2 to column 62)
Chain 1 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 1 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

This means that the noise parameter during one of the warmup iterations escaped the weak prior and got set to a really large value and the model didn't like it. As the message states, this isn't a problem if it only occurs sporadically during the warmup iterations. You can check the Stan outputs or set refresh = 1 to confirm this, but I've personally never seen this message occur outside of warmup.

Elements of the result value

result is a list with five elements:

The pglmm_fit part of the result is the model fit produced by cmdstanr for the PGLMM. Any method applicable to CmdStanMCMC objects works on this, but of particular interest is the summary() method which we'll use to inspect model fit.

Check diagnostics

If you run anpan_pglmm(run_diagnostics = TRUE), the function will run the diagnostic checking functions for the MCMC and loo results as applicable. In the example above, the output will look something like this:

(4/4) Running diagnostics:
Processing csv files: /var/folders/g8/22sdpptx451fng3nbj7c45vw0000gq/T/RtmpZqhitZ/cont_pglmm_noise_reg-202211071429-1-59c385.csv, /var/folders/g8/22sdpptx451fng3nbj7c45vw0000gq/T/RtmpZqhitZ/cont_pglmm_noise_reg-202211071429-2-59c385.csv, /var/folders/g8/22sdpptx451fng3nbj7c45vw0000gq/T/RtmpZqhitZ/cont_pglmm_noise_reg-202211071429-3-59c385.csv, /var/folders/g8/22sdpptx451fng3nbj7c45vw0000gq/T/RtmpZqhitZ/cont_pglmm_noise_reg-202211071429-4-59c385.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.
Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     119   99.2%   720       
 (0.5, 0.7]   (ok)         1    0.8%   1654      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
Warning message:
Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.

You can see that the MCMC diagnostics all came out well in this case. You'll get a link to the Stan warnings guide if your sampler hits any issues. If you do get convergence issues, the two simplest things to try first are usually setting adapt_delta to a higher-than-default value (e.g. 0.95) and increasing the number of iterations with e.g. iter_warmup = 2000 and iter_sampling = 2000. Another common issue is that if you accidentally use family = "gaussian" where you meant family = "binomial", you'll get E-BFMI warnings.

In the example above we got one ok (but not good) Pareto k value, which is more-or-less fine. Running more iterations with a higher adapt_delta can sometimes help improve this, but for anpan_pglmm() the occurrence of bad / very bad Pareto k values are usually more indicative of a dataset problem than a model problem. Low N, correlation matrices that are close to the identity, and/or imbalanced binary outcomes can make it impossible to estimate accurate loo results. The Cross-validation FAQ has more information. The parameter of interest in a PGLMM is really sigma_phylo anyway (see next section), so loo results aren't absolutely necessary to assess the impact of the phylogeny on the outcome. If you aren't interested in the model comparison, you can just set loo_comparison = FALSE and ignore the issue.

Inspect model fit

We'll use the summary() method here to pull out the posterior means for several parameters (black) and compare them to their true values (red). These parameters include for the covariate effects beta, the intercept, residual noise term sigma_resid, the spread of phylogenetic effects sigma_phylo and the per-leaf phylogenetic effects (which are indexed in the same order as model_input):

post_summary = result$pglmm_fit$summary() |> 
  filter(grepl("beta|^intercept|sigma|^phylo_effect", variable)) |> 
  mutate(true_value = c(1,1,1, true_phylo_effects, 0))

post_summary[-c(4:113),] |> 
  ggplot(aes(mean, variable)) + 
  geom_point() + 
  geom_segment(aes(x = q5, xend = q95,
                   y = variable, yend = variable)) + 
  geom_point(aes(x = true_value), 
             color = 'firebrick1',
             size = 3) + 
  labs(x = 'value',
       y = 'parameter',
       title = 'Posterior mean & 90% interval in black, true values in red') + 
  theme_light()

Most of the phylogenetic effects are cut from the plot for the sake of visibility. We can see that the posterior mean for beta[1] (the effect of the covariate) is fairly close to the true value (1), as are the intercept (0) and noise terms (1 and 1). Importantly, the 90% posterior intervals capture the true values.

Another thing to note on the above plot is the posterior interval of sigma_phylo. This parameter is the estimated marginal "spread" of leaf effects. It's about 1, and clearly quite far from zero. This means that leaf effects are allowed to be quite large, depending on where a leaf is in the tree. If the sigma_phylo estimate is near zero, that means leaf effects are all small, meaning the tree structure doesn't explain variation in the outcome well.

So we mostly recaptured our simulation parameters, but how well does the model really fit the data overall? We can use anpan::plot_tree_with_post_pred() to overlay the posterior predictive distribution for each leaf and show that the predictive distribution for each observation usually covers the true value:

anpan::plot_tree_with_post_pred(tree_file  = tr,
                                meta_file  = metadata,
                                covariates = "covariate",
                                outcome    = "outcome",
                                fit        = result$pglmm_fit,
                                labels     = metadata$sample_id,
                                verbose    = FALSE)

So when observations (the colored dots) are held out, the posterior predictive distribution for each leaf (the boxplots) generally capture the true value, showing that our model generally fits the data well.

What about the posterior on the linear model component? We can visualize that by taking some of the posterior draws (with cmdstanr::draws()) of the line (the parameters beta[1] and centered_cov_intercept) and overlay those onto the outcome ~ covariate scatterplot. The covariate is centered with scale(scale=FALSE) since that is how the Stan model uses the data:

line_draws = result$pglmm_fit$draws(format = 'data.frame') |> 
  as_tibble() |> 
  select(beta = `beta[1]`, intercept = centered_cov_intercept) |> 
  slice_sample(n = 40)

result$model_input |> 
  ggplot(aes(scale(covariate, scale = FALSE), outcome)) + 
  geom_point() + 
  geom_abline(data = line_draws,
              aes(slope = beta,
                  intercept = intercept),
              alpha = .2) + 
  theme_light()

There's a bit more uncertainty in the slope and intercept than you might expect given how visually clear the relationship is. This is because the model has to assess this relationship in the context of the very flexible phylogenetic component. If $\sigma_{phylo}$ is high and the phylogenetic component if explaining most of the residual variation here, the linear component is free to relax toward the prior distribution. Vice versa if the phylogenetic component explains the variation poorly.

loo interpretation

Let's look at the pglmm_loo result:

result$loo$pglmm_loo

PGLMM models are very flexible because they have a parameter for every leaf. In some cases, this can cause the naive leave-one-out importance weights calculated from raw log-likelihood values to be unstable. anpan generates importance weights for each observation by integrating the conditional likelihood for each observation at each posterior iteration as described in section 3.6.1 of @vehtari_glvm. This produces stable importance weights that we can use for model comparison. Note that unstable importance weights can still occur if the posterior on phylogenetic effects are poorly constrained by the tree (i.e. the correlation matrix is close to the identity matrix) and/or data (i.e. n is low, typically <50 for continuous outcomes or <100 for binary outcomes, though these cutoffs depend heavily on the tree as well).

The loo() result provides some diagnostics that can help confirm that the importance weights are stable and we can trust the downstream model comparison. In this case, we get a small number of Pareto k diagnostic values that are only "okay". If there are bad or "very bad" k values, that indicates that individual observations are having a strong effects on the model fit, and hence that the loo-based model comparison shouldn't be trusted. That scenario can happen more frequently without regularization when reg_noise = FALSE.

We only got a small number of non-good Pareto k value here (and none that were bad), so in this case the model comparison should be fine.

Let's look at the model comparison:

result$loo$comparison

The loo package prints model comparison objects by placing the model with the best leave-one-out predictive performance on the first row, in this case the PGLMM fit. The difference in expected log pointwise predictive density (ELPD) compared to the other model is shown in the first column along with a standard error of the difference in the second column. Here the ELPD difference is about r round(result$loo$comparison[2,1], digits = 1) with a SE around +/- r round(result$loo$comparison[2,2], digits = 1), which is both large and clearly non-zero. So here we would conclude that the phylogenetic component of the PGLMM clearly fits better than the base linear model. Given that we simulated the outcome based on the tree with $\sigma_{phylo} \neq 0$, that is the correct conclusion.

There's a lot more to be said about loo model comparison that is beyond the scope of this vignette. You can read more interpretation of loo results on the Cross-validation FAQ written by the loo authors.

Pathway random effects model

The pathway model uses a random effects model applied to the pathways in a single bug. It looks for pathways that differ substantially in abundance between two groups while accounting for the correlation with species abundance. Using the standard lme4 formula syntax, the model is:

$$ \textbf{log10(pwy_abd)} \sim \textbf{log10(species_abd) + (1|pwy) + (0 + group|pwy) + intercept}$$ The relevant functions to fit the model are anpan_pwy_ranef() and anpan_pwy_ranef_batch().

A few notes on this model:

Simulate data

In this section we'll simulate pathway level data. For the sake of simplicity we'll simulate only 6 pathways, though many bugs usually have sufficient data for a dozen or more. Throughout this section we'll include "log10_" in the name of some variables because that's how they'll be calculated in practice (by taking logarithms of real abundance measurments), even though we're not calculating any logarithms here.

First we generate the log10 species abundance with a simple call to rnorm(n):

n = 200
n_pwy = 6

set.seed(123)

species_abd_df = data.table(log10_species_abd = rnorm(n),
                            group             = sample(c("case", "control"), 
                                                       size = n, 
                                                       replace = TRUE),
                            sample_id         = paste0("sample", 1:n))

Here we generate the true effects of the pathway intercepts/effects. We apply a true pathway effect of +/-0.67 to the first two pathways (these will be the effects we try to detect) and 0 to everything else.

pwy_true_params = data.table(pwy              = paste0("pwy", 1:n_pwy),
                             pwy_intercept    = rnorm(n_pwy),
                             pwy_group_effect = c(c(0.67, -0.67),
                                                  rep(0, n_pwy - 2)))
pwy_true_params

Then we generate the log pathway abundances. Some things to note:

pwy_grid = expand.grid(sample_id = paste0("sample", 1:n),
                       pwy       = paste0("pwy", 1:n_pwy)) |> 
  as.data.table()

bug_pwy_dat = pwy_grid[species_abd_df, on = "sample_id"][pwy_true_params, on = "pwy"]

bug_pwy_dat[, log10_pwy_abd := rnorm(n * n_pwy,
                                     mean = -3 + 
                                            0.85 * log10_species_abd +
                                            pwy_intercept + 
                                            (group == "case") * pwy_group_effect)]

Let's look at a plot of the simulated data:

bug_pwy_dat |> 
  ggplot(aes(log10_species_abd,
             log10_pwy_abd)) + 
  geom_point(aes(color = group), size = 1) + 
  facet_wrap("pwy") +
  scale_color_manual(values = c("case" = "red", 
                                "control" = "lightblue")) + 
  theme_bw()

Both effects are fairly visually obvious. We'll see how well the model can detect these effects in the next section.

Fit the pathway model

We first create a 0/1 indicator variable, then fit the model with anpan_pwy_ranef():

bug_pwy_dat[, group01 := as.numeric(group == "case")]

pwy_result = anpan_pwy_ranef(bug_pwy_dat = bug_pwy_dat,
                             group_ind = "group01")

Our result is a data frame with columns containing the fit and the summary. Let's look at the summary:

pwy_result$summary_df[[1]]

We can see in the hit column that pwy1 and pwy2 were both correctly called as hits.

Examine the results

There are two functions for plotting the results of the pathway model. plot_pwy_ranef_intervals() compactly plots 98% intervals for pathways (faceted by bug as appropriate). It will automatically highlight "hits" (as defined by the conditions in the help documentation of anpan_pwy_ranef()) in red.

plot_pwy_ranef_intervals(pwy_result)

The next plotting function is plot_pwy_ranef(). This creates a plot of the data overlaid with posterior draws of the estimated species:pwy relationship by pathway.

plot_pwy_ranef(bug_pwy_dat   = bug_pwy_dat,
               pwy_ranef_res = pwy_result,
               group_ind     = "group01",
               group_labels  = c("control", "case"))

You can see the clear visual separation between the two groups in the first two bugs, while the draws from the other four pathways tend to overlap.

Advanced topics

The devil is in the details.

Genome comparisons

Sometimes then gene model gives too many hits from a single bug. This can happen if two conditions are met:

This will cause all of the genes characteristic of the sub-species to show up as significantly associated with the outcome. That is true in the statistical sense, but it's not a helpful result when looking for potentially causal genes.

The function plot_genome_intersections() can be used to plot overlaps of genes across isolate genomes in a unistrain files (which are used to define which genes belong to which species in the upstream gene profiling step). Multiple distinct blocks (as shown in the sample below) can cause the problematic "too many hits" phenomenon described above.

unistrain_path = "/path/to/g__Faecalibacterium.s__Faecalibacterium_prausnitzii.unistrains.tsv.gz"
plot_genome_intersections(unistrain_path)
knitr::include_graphics("vignette_figs/fprau_ints.png")

Another method for detecting this phenomenon is to compute a phylogeny based on the gene profiles (see the [Estimating phylogenies from gene matrices] section), then look for a large-scale, species-wide pattern in the distribution of phylogenetic effects.

We leave the approach to handling this to the user. If the blocking is driven by a single outlier genome, it may make sense to remove the genes unique to that outlier. More complicated cases like the example shown above may require more ad hoc measures, such as separating out the separate genomes and running them through the gene model separately.

Splitting merged HUMAnN outputs

HUMAnN UniRef90 profiles may contain all species merged into a single huge tsv. anpan includes a small shell script that will split a file like this into the per-bug files that are needed to run the gene model. You can find this script on your system like so:

system.file('fsplit', package = 'anpan')

Depending on how you installed the package, you may need to chmod +x /path/to/fsplit in order to give it execute access.

In a shell (not R) would then navigate to your big merged_humann_output.tsv and run

/path/to/fsplit merged_humann_output.tsv

This will:

  1. create an output folder called merged_humann_output_split/
  2. create an output file for each unique bug in the input
  3. add the original header line to each
  4. then scan the input with awk and append each line to the appropriate bug-specific file

It will probably take a few minutes for large input files. The split output folder can then be used as the bug_dir argument in anpan_batch().

The identifiers in the first column of the input must be formatted as gene_id|bug_id e.g. UniRef90_A0A009H1T1|g__Escherichia.s__Escherichia_coli

Lines corresponding to overall gene abundance (where the first column is just gene_id) or where the taxa is unclassified are discarded. The fsplit script isn't too complicated, so if you're reading this you can probably figure out how to tweak it if you need those.

Horseshoe gene model

The default model type for the gene model runs a simple glm for every gene in every bug, then FDR corrects the p-values after the fact. While conceptually and computationally simple, this approach has a few disadvantages:

By setting model_type = "horseshoe" in a call to anpan(), it is possible to use a "regularized horseshoe" model instead. This type of model is described in detail in @piironen_sparsity_2017. Basically, instead of running many small models then FDR correcting, it fits a single model per bug. This has a number of advantages:

The horseshoe model does this by fitting a single regression model that includes all genes as predictors (as well as any additional covariates like age and gender), with a regularized horseshoe prior on the gene coefficients.

A full explanation of the regularized horseshoe is available in @piironen_sparsity_2017. The coefficients for the genes $\beta_j$ use the following model:

$$\beta_j \vert \lambda_j, \tau , c \sim N\left(0 , \tau^2 \tilde{\lambda}^2_j \right) \ \tilde{\lambda}^2_j = \frac{c^2 \lambda^2_j}{c^2 + \tau^2\lambda^2_j} \ \lambda_j \sim Cauchy^+(0,1)$$

The Stan code used under the hood is adapted from code originally produced by the brms package (@burkner_brms).

anpan(model_type = "horseshoe")

The outputs of this model will look largely the same as model_type = "fastglm" except:

The main disadvantage of the horseshoe model is that it is much more computationally expensive.

Repeated measures models

If you have data with multiple samples per subject, you can run a subject-wise gene model using anpan_repeated_measures(), and a subject-wise PGLMM using anpan_subjectwise_pglmm(). These functions require providing a subject_sample_map data frame listing the subject_id that goes with each sample_id.

anpan_repeated_measures() performs the standard anpan filtering on all samples, then uses the subject-sample map to compute the proportion of samples with the bug. This gives a gene proportion matrix (instead of a presence/absence matrix) which is then passed to anpan_batch(..., filtering_method = "none", discretize_inputs = FALSE).

If you have a dataset where multiple samples come from the same individual, running a PGLMM on all the samples without accounting for this will return a strong, spurious signal because samples from the same individual will (usually) be right next to each other on the tree and all show the same outcome value. anpan_subjectwise_pglmm() aggregates samples by subject to derive a subject-wise correlation matrix, then runs a PGLMM on that.

Both of these functions aggregate covariate/outcome metadata by subject when applicable. Numerical covariates are averaged, while character, logical, and factor covariates are tabulated, with the most common value selected as the representative. Ties can be re-ordered with factor levels.

Estimating phylogenies from gene matrices

If you don't have a phylogeny for your data, you can get a very rough estimate in the following manner:

gene_tree = gene_pres_abs_matrix |> 
  pca() |>
  stats::dist() |> 
  ape::nj() # or ape::fastme.bal() or ape::bionj()

This process is sub-optimal in a number of ways (the gene matrix is low signal, Jaccard dissimilarity isn't the most thoughtful thing to use here, the correlation matrices can end up close to the identity, tree distance is based on functional profile similarity instead of real evolutionary relatedness, etc), but it will give you something that you can pass to anpan_pglmm(). Identifying which genes explain which branching points is left as an exercise for the reader.

The code chunk above won't run out of the box, you'll need to plug in your own matrix and pca() function.

If your gene measurements are noisy, it can also be helpful to run some sort of dimension reduction (e.g. PCA) on the input to the dist() function in case the noise in your input matrix is overwhelming phylogenetic differences. The number of principle components to use is debatable, but choosing based on how the of proportion of variance explained dies off can be reasonable. One might choose eigenvalues up to the first eigenvalue that is less than one tenth of the first eigenvalue.

That being said, it's preferable to estimate phylogenies from the raw data themselves using any of the many tools available in the literature (e.g. StrainPhlAn) rather than doing something ad hoc from the gene matrices. There is a rich literature on the topic of phylogeny estimation, but the topic falls outside the scope of anpan. The preceding code block is mainly intended as a quick-and-dirty, "good enough" method to get a quick look at how relatedness affects the outcome overall.

Offset variables

anpan_pglmm() is able to take a single variable from the metadata and use it as an "offset". See ?offset() for additional description. Basically, an offset is a variable to be included as a covariate, but the coefficient is known to be exactly 1. The coefficient is not estimated as part of the model. If a given observation has an offset of +0.5, the linear predictor for that observation goes up by +0.5.

Categorical offsets can also be used with binary outcomes, in which case the offset value is taken as the logit(proportion) of the outcome variable by category. In data.table it's roughly model_input[, offset_value := logit(mean(outcome)), by = offset]. For example an observation from a study with 85% cases will start off with a linear predictor of logit(0.85) = 1.73 before the terms for the global intercept & other covariates get added on.

We originally developed this functionality for study_id variables. The proportion of cases in a given study isn't really an uncertain quantity, it's something set by the study designers. So if you want to include one study with mostly cases and another with mostly controls, it makes sense to build that difference into the model without adding additional parameters to estimate. You can set offset = "study_id" in the call to anpan_pglmm() and it will estimate the proportions and display them in a message.

Offsets show up on the color bars on plots after the covariates. They're always on a continuous scale because the offset value is always a continuous number.

Picking out clade members

This section not done

Tweaking patchwork plots

Most of the theme parameters used in anpan's plotting functions are set to reasonable defaults, but sometimes real data makes the plots look a bit wonky. Take the PGLMM posterior plot we generated in the [PGLMMs] section:

p

The labels are a bit too small. Another common problem with these plots is that for large datasets the tree gets too crowded to see the individual tips.

These can mostly be addressed after the fact with the usual means of altering ggplot. However, many of the plots produced by anpan are composite figures produced by the patchwork package. This adds an additional step to tweaking the plots and making them publication ready. patchwork plots can essentially be accessed as a list with the sub-plots as elements. So the code block below shows three common modifications:

# Thinner lines on the tree
p[[1]]$layers[[1]]$aes_params$size = .2

# Make the sample labels bigger
p[[3]]$theme$axis.text.x = element_text(size = 4.5, angle = 90, hjust = 0)

# Make the points bigger
p[[1]]$layers[[2]]$aes_params$size = 3

p

The details of finding which theme/aes parameters that are needed for additional tweaks are left as an exercise for the reader, but generally typing a dollar sign on a sub-plot p[[2]]$, then looking at the names suggested by RStudio auto-complete can get you pretty far.

You can also use the sub-plot accessor to pull out sub-plots that can then be rearranged in new ways with patchwork::plot_layout(). We have occasionally made huge mega plots with something like the code chunk below.

outcome_tree + 
effect_boxplots + 
gene_heatmap + 
half_correlation_matrix + 
plot_layout(ncol = 1)

Editing color bar palettes

The palettes used in the color bar can be edited by setting the appropriate scale on the patchwork object. Below I give an example where I change the colors used for gender on the color bar from red/blue to purple/green. With some slight edits to this you could change the color bar on tree plots as well. This github exchange may also be helpful if you're looking for more information.

p = plot_results(res         = res,
                 covariates  = c("age", "gender", "study_name"),
                 outcome     = "crc",
                 model_input = mi)


color_bar_i = 1 # the color bar subplot in the patchwork
gender_i    = 4 # the scale used for gender on the plot

p[[color_bar_i]] # this should print the color bar only
p[[color_bar_i]]$scales$scales # examine the list of scales to find the one you need to set gender_i

aes_name = p[[color_bar_i]]$scales$scales[[gender_i]]$aesthetics
guide = p[[color_bar_i]]$scales$scales[[gender_i]]$guide
guide$available_aes = aes_name

p[[color_bar_i]]$scales$scales[[gender_i]] = scale_fill_discrete(aesthetics = aes_name,
                                                                 guide = guide,
                                                                 type = c("purple", "green")) 

p[[color_bar_i]] # Now gender is purple/green

Configuring parallelization and progress reporting

anpan uses the packages furrr and progressr to parallelize and show progress on long calculations, respectively. That means that internally some anpan calculations are written conceptually as "this code might run in parallel" and "this code might display progress". The user can configure the parallelization plan and progress reporting to their preference.

furrr

See the details on ?future::plan(), but the most common way to parallelize is via multisession:

plan(multisession, workers = 4)

This will start up 4 separate R sessions evaluating the appropriate parallelized code.

The exception to this is anpan_pglmm_batch(). This function parallelizes both over bugs and loo calculations for each bug, so typically the fastest way to run is to use Stan's parallel_chains argument to parallelize the MCMC and a so-called "nested future topology" to parallelize the loo:

plan(list(sequential, tweak(multisession, workers = 6)))
anpan_pglmm_batch(..., 
                  parallel_chains = 4)

This is sequential over bugs but parallelizes the 4 MCMC chains for each model and parallelizes the loo calculations over iterations. However depending on the ratio of MCMC time to loo time (and the amount of memory you have available), it might be faster to use a non-nested future topology and simply parallelize over bugs.

There are many future configuration methods that I haven't tried (e.g. future.batchtools::batchtools_slurm()), so let me know how it goes if you try out something exotic.

progressr

Turning on all global progress reporting is done with this command:

handlers(global = TRUE)

From there, you can set up additional handlers to customize the progress reporting. My favorite is:

handlers(handler_progress(format = "[:bar] :current/:total (:percent) in :elapsed ETA: :eta"),
         handler_beepr(initiate = NA,
                       update   = NA,
                       finish   = 1))

There are also some that are more exciting:

# let's do the fork in the garbage disposal!
handlers(handler_beepr(initiate      = NA,
                       update        = 1,
                       finish        = NA,
                       intrusiveness = 1)) 

Once you've set your preferred handler(s), you can test the reporting with this command:

progressr::slow_sum(1:5)

Other anpan functions

This vignette doesn't overview the usage of every single anpan function (it's already too long). Here are the other user-accessible anpan functions. They should each have a help page, and you can of course ask additional questions on the bioBakery help forum.

getNamespaceExports("anpan") |> sort()

References

Session Info

sessionInfo()


biobakery/anpan documentation built on July 26, 2024, 11:19 p.m.