README.md
In nickilott/MDAT: microbe directory association tester

mdat_logo

It is often of interest to evaluate the characteristics of differentially abundant microbes in a system of interest. The Microbe Directory is a valuable resource that contains multiple pieces of information regarding growth preferences, antimicrobial susceptibilty, gram staining etc. of a large collection of microbes. It is browsable at the level of individual species and the data are made available on GitHub for more systematic bioinformatics analyses. The package described here makes use of the data that are available on GitHub to perform enrichment analyses based on the characteristics present in The Microbe Directory.

The Microbe Directory contains both categorical (gram_stain, microbiome_location, antimicrobial_susceptibility, extreme_environment, biofilm_forming, animal_pathogen, spore_forming, plant_pathogen, pathogenicity) and quantitative (optimal_ph, optimal_temperature) characteristics and these are treated differently in terms of the statistics used to define enrichment. MDAT takes as input two R vectors:

A list of bacterial species of interest e.g. those that are more abundant in a given setting (test_set_file)
A list of bacterial species that represent the comparator e.g. those that do not change in the setting of interest (background_set_file)

These species lists are in the form shown below:

pFirmicutes;cClostridia;oClostridiales;fRuminococcaceae;gRuminococcus;sRuminococcus_bromii pBacteroidetes;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;sBacteroides_uniformis pBacteroidetes;cBacteroidia;oBacteroidales;fRikenellaceae;gAlistipes;sAlistipes_shahii pFirmicutes;cClostridia;oClostridiales;fRuminococcaceae;gSubdoligranulum;sSubdoligranulum_unclassified pBacteroidetes;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;sBacteroides_thetaiotaomicron pBacteroidetes;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;sBacteroides_fragilis pBacteroidetes;cBacteroidia;oBacteroidales;fPorphyromonadaceae;gParabacteroides;sParabacteroides_johnsonii pBacteroidetes;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;sBacteroides_finegoldii pBacteroidetes;cBacteroidia;oBacteroidales;fBacteroidaceae;gBacteroides;sBacteroides_vulgatus pBacteroidetes;cBacteroidia;oBacteroidales;fRikenellaceae;gAlistipes;sAlistipes_finegoldii ...

where each element in the list is a species that contains all of the taxonomic information up to phylum level for that species (this is to be compatible with The Microbe Directory taconomic information). Often taxonomic naming schemes vary between analyses done by a user and those that are present in The Microbe Directory database. The best way to deal with this is to laboriously go through and check that the names in the two lists that are given passed to the function are consistent with The Microbe Directory (output of metaphlan). MDAT can do a dirty guess and reconciliation of names (use guess_names=TRUE) by assuming that the names in the list and The Microbe Directory are the same up to family level and the species suffix (e.g. the coli of Escherchia coli) are the same and then reconciles any genus-level discrepancies (this is where we have seen a lot of the discrepancies). This is not completely ideal and may leave a number of species unannotated that are actually present in The Microbe Directory.

The steps of the analysis are straightforward - the test_set is compared to the background set in terms of annotation. For categorical variables this is done by building a 2 x n contingency table (n = the number of levels for the categorical variable) and statistical testing performed using a Fisher's exact test. For quantitative variables, the values for the test_set are compared to the values of the background set using a Wicoxon Rank Sum test. Plots are produced for visualising the results.

The package can be installed using the devtools package.

    install.packages("devtools")
    devtools::install_github("nickilott/MDAT")

To perform enrichment analysis with MDAT simply provide the test_set and the background_set vectors:

    library(MDAT)
    results.mdat <- run_associations(test_set=test_set, background_set=background_set, guess_names=FALSE)

This will return a list where the first element is a dataframe with the results of the statistical analyses and the second is a grid of plots. To access each element you can type:

    get_results(results.mdat)

Which will print the results to the console:

|variable |test | statistic| pvalue| qvalue| |:----------------------------|:-----------------|---------:|---------:|---------:| |gram_stain |Fisher | Inf| 0.0055231| 0.0439658| |microbiome_location |Fisher | 7.149923| 0.0679340| 0.2264466| |antimicrobial_susceptibility |Fisher | 0.000000| 1.0000000| 1.0000000| |extreme_environment |Fisher | Inf| 1.0000000| 1.0000000| |biofilm_forming |Fisher | 0.000000| 1.0000000| 1.0000000| |animal_pathogen |Fisher | 0.000000| 0.4857143| 0.8095238| |plant_pathogen |Fisher | 0.000000| 0.1000000| 0.2500000| |pathogenicity |Fisher | 0.000000| 1.0000000| 1.0000000| |optimal_ph |Wilcoxon rank sum | 11.500000| 0.3528967| 0.7057934| |optimal_temperature |Wilcoxon rank sum | 41.500000| 0.0087932| 0.0439658|

and:

    plot_results(results.mdat)

Which will display the results for each variable analysed (stacked bar for categorical variables and boxplots with jittered points for quantitative variables).

plots