knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

polyafit

polyafit is an R package that provides tools to fit count data to a Dirichlet-multinomial distribution.

This distribution is sometimes called the multivariate Pólya distribution, from which the package name is derived. The package was created specifically for 16S rRNA marker gene sequence data from the human microbiome, but in principle could be applied to other types of data.

If you use this software in your research, please cite:

Charlson ES, Bittinger K, Chen J, Diamond JM, Li H, Collman RG, Bushman FD. Assessing bacterial populations in the lung by replicate analysis of samples from the upper and lower respiratory tracts. PLoS One. 2012;7(9):e42786. doi: 10.1371/journal.pone.0042786. Epub 2012 Sep 6. PMID: 22970118; PMCID: PMC3435383.

Installation

You can install polyafit from GitHub with:

# install.packages("devtools")
devtools::install_github("kylebittinger/polyafit")

Example

devtools::load_all()
library(tidyverse)

We will use a built-in example dataset, polyafit_example_data, to demonstrate how the package works. The example data is in the form of a matrix, with three observations and 50 features. In this example, each observation corresponds to a microbiome sample and each feature corresponds to a bacterial species. The numbers in the matrix represent the number of times each species was observed in the sample. Let's print out the example data to see how it looks.

polyafit_example_data

Looking down each column, we can see that the number of counts per species is approximately the same for the three samples, except for species_4. This species seems to be over-represented in sample_3 (150 counts) relative to the other samples (8 and 9 counts).

Our goals are to measure the overall correspondence in species abundances between the samples, and then to use this as a yardstick to determine if a species is over-represented in one of the samples.

Analysis: no features enriched

To start, we will select the first two samples and run a fit. When the example data set was generated, these samples were drawn from the same multinomial distribution. Therefore, we should not detect any features that are enriched in one sample relative to the other. Furthermore, we expect to observe a high degree of correspondence between the relative abundances, or in other words, a low degree of overdispersion relative to the multinomial distribution.

We select the two samples by name, giving a matrix with two rows and 50 columns.

example12 <- polyafit_example_data[c("sample_1", "sample_2"),]

We find the best-fit parameters of the distribution using the pfit() function.

fit12 <- example12 %>%
  pfit()

We can call plot() on this object to see a nice plot of the data.

fit12 %>%
  plot()

The feature_enrichment() function gives p-values and parameter estimates for enrichment of each feature relative to the overall distribution. For our first example, all p-values are greater than 0.05.

fit12 %>%
  feature_enrichment() %>%
  filter(p.value < 0.05)

The fitted object comes with an overdispersion parameter, theta. Theta tells us how much "extra variance" the estimated distribution has, beyond what we expect from a multinomial distribution. Because the data here were generated directly from a multinomial distribution, the estimated overdispersion parameter should be very small.

fit12$theta

Analysis: one outlier

In sample 3, we've adjusted one of the species to increase the number of counts. We should see this species, species_4, to be enriched in our analysis.

example13 <- polyafit_example_data[c("sample_1", "sample_3"),]

We'll go through the same sequence of steps to run the fit, plot the data, and test for feature enrichment.

fit13 <- example13 %>%
  pfit()
fit13 %>%
  plot()

Here, we can see that the abundance of species_4 is higher than expected, based on the overall distribution.

fit13 %>%
  feature_enrichment() %>%
  filter(p.value < 0.05)

Now, we re-calculate the overdispersion parameter, theta. It has increased by four orders of magnitude.

fit13$theta


kylebittinger/polyafit documentation built on Jan. 11, 2023, 8:53 a.m.