knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "doc/"
)

PIME: Prevalence Interval for Microbiome Evaluation

PIME removes the within group variation found in metataxonomic surveys (16S rRNA datasets) by capturing only biological differences at high samples prevalence levels.

First steps: installing 'phyloseq' and creating a 'phyloseq' object

To install phyloseq start R and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("phyloseq")

Phyloseq aggregates, mainly, an otu_table (with OTUs abundances), tax_table (taxonomy), and a sample_data (sample information) objects.

Importing output from DADA2

seqtab = readRDS("path_to_file/sequence_table_final.rds")
tax= readRDS("path_to_file/tax_final.rds")
map <- "path_to_file/sample_data.txt"
ps <- phyloseq(otu_table(seqtab, taxa_are_rows=FALSE), 
               tax_table(taxa))
sample_metadata = import_qiime_sample_data(map)
physeq =merge_phyloseq(ps, sample_metadata)

Importing otu_table as a biom file with taxonomy

jsonbiomfile = "path_to_file/otu_table_fix.biom"
mapfile = "path_to_file/v35_map_uniquebyPSN.txt"
biom = import_biom(jsonbiomfile, mapfile, parseFunction=parse_taxonomy_greengenes)
map = import_qiime_sample_data(mapfile)
input = merge_phyloseq(biom,map)

For further data input methods and examples go to https://joey711.github.io/phyloseq/import-data.html

How to install PIME package

To install PIME first install the devtools package. Then load the library(devtools) and run install_github using the following commands.

#install.packages("devtools")
library(devtools)
install_github("microEcology/pime")

PIME uses a Phyloseq object as input. A description of the phyloseq object and a tutorial on how to create this file in R using OTU tables in many different formats is detailed into the Phyloseq website https://joey711.github.io/phyloseq/

Step-by-step example

The first step in PIME is to define if the microbial community presents a high relative abundance of taxa with low prevalence, which is considered as noise in PIME analysis. This is calculated by random forests analysis. In this example we run PIME using the restroom dataset (https://doi.org/10.1007%2Fs10482-017-0976-6) against the metadata variable called Environment (a variable with two categories: men’s and women’s restroom).

Prediction using random forests on full dataset. Results in Out of Bag error rate. The input file was rarefied at 500 sequences for the purpose of this example (speed up the analysis). Using a rarefied dataset is recommended at this step.

library(pime)
data("restroom")
pime.oob.error(restroom, "Environment")

The OOB error rate <=0.1, indicated the dataset present large differences, and pime might not remove much of the noise. Higher OOB error rate indicates that the next functions should be run to find the best prevalence interval for the dataset.

This function takes two parameters: The phyloseq object (restroom) and the predictor variable (Environment).

Split the dataset by predictor variable

Two parameters are required to run this function: The phyloseq object (restroom) and the predictor variable (Environment).

per_variable_obj= pime.split.by.variable(restroom, "Environment")
per_variable_obj

Calculate the highest possible prevalence intervals

This function calculates prevalence for different intervals by increments of 5. The input file is the output from the pime.split.by.variable(per_variable_obj)

prevalences=pime.prevalence(per_variable_obj)
head(prevalences)

Calculate the best prevalence interval for the dataset

This function will return a table with Out of Bag error from random forests for each prevalence interval. The number of taxa and the number of remaining sequences for each prevalence interval are also computed. The best prevalence interval value provides the clearest separation of communities while still including a majority of the taxa in the analysis. If true differences are present. It will be represented by the first interval in which the OOB error rate is zero or close to zero. The input file is the list of prevalences generated by the pime.prevalence (prevalences) and the predictor variable ("Environment"").

set.seed(42)
best.prev=pime.best.prevalence(prevalences, "Environment")

In addition, it also returns the results from the random forests classification for each prevalence level. It includes SequenceID (OTU/ASV), Mean Decrease Accurracy (MDA) for each sample group, that is how much that SequenceID was important for classification of that group. The Mean Decrease Impurity (Gini Importance) and taxonomy are also included.

To get the table with OTU/ASV importance of the chosen prevalence interval. Pime keeps only the top 30 OTUs/ASVs, whith highest MDA.

imp65=best.prev$`Importance`$`Prevalence 65`
head(knitr::kable(imp65, format="markdown"))

#To get the table with OOB error results.
#best.prev$`OOB error`

Within this dataset the best prevalence interval was 65%

To obtain the phyloseq object at this cutoff use the following command.

prevalence.65 = prevalences$`65`
prevalence.65

To obtain the confusion matrix from random forests classification use the folowing:

best.prev$Confusion$`Prevalence 65`

Estimating prediction error

To estimate error in prediction, we will use pime.error.prediction() to randomly assign treatments to samples and run random forests classification on each prevalence interval. The function returns a boxplot and a table with results of each classification error. For the purposes of this example we are running only 10 randomizations for saving time but we recommend at least 100 randomizations to obtain reliable results.

randomized=pime.error.prediction(restroom, "Environment", bootstrap = 10, parallel = TRUE, max.prev = 95)
randomized$Plot
randomized$'Table results'

It is also possible to estimate the variation of OOB error with each prevalence interval filtering. This is done by running the random forests classification for n times, determined by the user. This function will return a boxplot figure and a table for each classification error.

replicated.oob.error= pime.oob.replicate(prevalences, "Environment", bootstrap = 10, parallel = TRUE)

Getting Help

Please contact us if you need any help: contact@brmicrobiome.org

PIME Team

Luiz F. W. Roesch (Universidade Federal do Pampa - Brazil)

Priscila T. Dobbler (Universidade Federal do Pampa - Brazil)

Victor S. Pylro (Universidade Federal de Lavras - Brazil)

Bryan Kolaczkowski (University of Florida - United States of America)

Jennifer C. Drew (University of Florida - United States of America)

Eric W. Triplett (University of Florida - United States of America)

Citation

Roesch et al. (2018), PIME: including the concept of prevalence for uncovering differences in microbiome noised data. Frontiers, submitted.



microEcology/pime documentation built on Nov. 13, 2019, 11:16 p.m.