knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
polyafit
is an R package that provides tools to fit count data to a
Dirichlet-multinomial distribution.
This distribution is sometimes called the multivariate Pólya distribution, from which the package name is derived. The package was created specifically for 16S rRNA marker gene sequence data from the human microbiome, but in principle could be applied to other types of data.
If you use this software in your research, please cite:
Charlson ES, Bittinger K, Chen J, Diamond JM, Li H, Collman RG, Bushman FD. Assessing bacterial populations in the lung by replicate analysis of samples from the upper and lower respiratory tracts. PLoS One. 2012;7(9):e42786. doi: 10.1371/journal.pone.0042786. Epub 2012 Sep 6. PMID: 22970118; PMCID: PMC3435383.
You can install polyafit from GitHub with:
# install.packages("devtools") devtools::install_github("kylebittinger/polyafit")
devtools::load_all()
library(tidyverse)
We will use a built-in example dataset, polyafit_example_data
, to demonstrate
how the package works. The example data is in the form of a matrix, with three
observations and 50 features. In this example, each observation corresponds to
a microbiome sample and each feature corresponds to a bacterial species. The
numbers in the matrix represent the number of times each species was observed
in the sample. Let's print out the example data to see how it looks.
polyafit_example_data
Looking down each column, we can see that the number of counts per species is
approximately the same for the three samples, except for species_4
. This
species seems to be over-represented in sample_3
(150 counts) relative to the
other samples (8 and 9 counts).
Our goals are to measure the overall correspondence in species abundances between the samples, and then to use this as a yardstick to determine if a species is over-represented in one of the samples.
To start, we will select the first two samples and run a fit. When the example data set was generated, these samples were drawn from the same multinomial distribution. Therefore, we should not detect any features that are enriched in one sample relative to the other. Furthermore, we expect to observe a high degree of correspondence between the relative abundances, or in other words, a low degree of overdispersion relative to the multinomial distribution.
We select the two samples by name, giving a matrix with two rows and 50 columns.
example12 <- polyafit_example_data[c("sample_1", "sample_2"),]
We find the best-fit parameters of the distribution using the pfit()
function.
fit12 <- example12 %>% pfit()
We can call plot()
on this object to see a nice plot of the data.
fit12 %>% plot()
The feature_enrichment()
function gives p-values and parameter estimates for
enrichment of each feature relative to the overall distribution. For our first
example, all p-values are greater than 0.05.
fit12 %>% feature_enrichment() %>% filter(p.value < 0.05)
The fitted object comes with an overdispersion parameter, theta. Theta tells us how much "extra variance" the estimated distribution has, beyond what we expect from a multinomial distribution. Because the data here were generated directly from a multinomial distribution, the estimated overdispersion parameter should be very small.
fit12$theta
In sample 3, we've adjusted one of the species to increase the number of
counts. We should see this species, species_4
, to be enriched in our
analysis.
example13 <- polyafit_example_data[c("sample_1", "sample_3"),]
We'll go through the same sequence of steps to run the fit, plot the data, and test for feature enrichment.
fit13 <- example13 %>% pfit()
fit13 %>% plot()
Here, we can see that the abundance of species_4
is higher than expected,
based on the overall distribution.
fit13 %>% feature_enrichment() %>% filter(p.value < 0.05)
Now, we re-calculate the overdispersion parameter, theta. It has increased by four orders of magnitude.
fit13$theta
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.