knitr::opts_chunk$set(fig.width=12, fig.height=7,
                      echo=TRUE, warning=FALSE, message=FALSE,
                      tidy=TRUE, collapse = TRUE, comment = "#>")
set.seed(12345)
library(panstripe)
library(ggplot2)
library(patchwork)

Simulate a pangenome

We first simulate a pangenome presence/absence matrix and corresponding core genome phylogeny.

sim <- simulate_pan(rate = 1e-3)

Fit gene gain/loss model

We can now estimate the expected number of gene gain and loss events along each branch of the phylogeny and fit a tweedie regression model.

fit <- panstripe(sim$pa, sim$tree, quiet=TRUE)

Investigate results

We can test whether the expected gene gain/loss rate is significantly different from a model consisting of just errors by looking at the p-value of the fit.

fit$summary

We can also generate a plot of the resulting fit with confidence intervals.

plot_pangenome_params(fit)

It is useful to compare this with a plot of the data. Panstripe includes the option to generate a plot similar to that created by TempEst.

plot_pangenome_cumulative(fit)

We can use the patchwork package to combined these plots

plot_pangenome_params(fit) + plot_pangenome_cumulative(fit) + patchwork::plot_layout(nrow = 1)

Comparing pangenomes

For simplicity we consider a comparison with a simulated pangenome with very little accessory variation. This could reflect a species with a very stable set of genes such as M. tuberculosis.

sim_slow <- simulate_pan(rate = 1e-4)

Again we fit a Tweedie regression model after first estimating the gene gain/loss events per branch.

fit_slow <- panstripe(sim_slow$pa, sim_slow$tree)
fit_slow$summary

We can now test to see whether there is a significant difference between the dynamics of these two pangenomes

comp <- compare_pangenomes(fit, fit_slow, nboot = 100)
comp$summary

We can also plot the difference between the fits.

plot_pangenome_params(list(
  fast=fit, 
  slow=fit_slow)) +
  plot_pangenome_cumulative(list(
    fast=fit, 
    slow=fit_slow))

Investigate differences in the average size of gene gain/loss events

Sometimes we would like to know if it is the size rather than the rate of recombination events that is driving an apparent different in the rate of gene gain/loss. As we are fitting a tweedie regression model it is possible to tease apart the driving force of apparent accessory rate differences.

Simulate

First lets simulate and fit two pangenomes. One with a larger average recombination size (in number of genes).

sim_small_rec <- simulate_pan(rate = 1e-3, mean_trans_size = 3)
sim_large_rec <- simulate_pan(rate = 1e-3, mean_trans_size = 5)

res_small_rec <- panstripe(sim_small_rec$pa, sim_small_rec$tree)
res_large_rec <- panstripe(sim_large_rec$pa, sim_large_rec$tree)

We can first test whether the overall rate of gene gain/loss is different between the two pangenomes.

comp <- compare_pangenomes(res_small_rec, res_large_rec)
comp$summary

As expected given the large difference in the average size of gene gain/loss events there is a significant difference in the total rate between the two pangenomes.

We can now look closer into what is driving this difference.

plot_pangenome_params(list(
  large=res_large_rec, 
  small=res_small_rec))

Appendix

sessionInfo()


gtonkinhill/panstripe documentation built on Feb. 27, 2025, 9:01 p.m.