1.0 Introduction to the Marker Gene Data Profiling Module

Typically, the first question of microbiome data analysis is to determine if there are any patterns within the data. Such exploratory analyses are conducted via commonly used ecological techniques including alpha and beta diversity. Multivariate statistics such as PERMANOVA and ANOSIM can be employed to assess the robustness of such patterns in the microbial community composition. Once global differences are identified, the next logical step is to identify which taxa are responsible for this difference. Identification of important taxa is assessed with univariate statistics. Univariate statistics include classical and model-based approaches, which vary on their assumptions of microbiome data structure but ultimately compare taxa abundance between groups. While it is important to know which taxa are associated with phenotypic differences between samples, users may also wish to know if there are also potential functional differences between groups. MicrobiomeAnalyst implements two popular methods for predictive functional profiling - PICRUSt (1) and Tax4Fun (2), resulting in gene abundance tables that can be inputted in the SDP module.

1.1 Visual Exploration

Stacked bar/area plot

MicrobiomeAnalystR supports the creation of stacked bar (PlotTaxaAlphaBar) or area plots (PlotTaxaAlphaArea). Further color customization is available using the viridis R package. These color palettes (e.g. "viridis", "magma", "plasma", and "inferno") are robust to color blindness.

# Stacked bar plot, percentage abundance, genus level, viridis color palette
mbSet<-PlotTaxaAundanceBar(mbSet, "taxa_alpha_1","Genus","Group", "none", "barraw",10, "set3","sum",10, "bottom", "F", "png");

# Stacked area plot, class level, viridis color palette
mbSet<-PlotTaxaAbundanceArea(mbSet, "taxa_alpha_3","overview","Genus","Group", 10, "viridis","sum",10,"bottom", "F", "png");

Pie Charts

MicrobiomeAnalystR provides several functions to create pie charts from microbiome data. First, the PlotOverallPieGraph function plots a pie-chart of all samples at the user-specified taxonomic level. Second, the PlotGroupPieGraph plots a group-wise pie-chart of all samples within the user-specified group at the user-specified taxonomic level. Finally, the PlotSamplePieGraph function creates a pie-chart for a specific sample. To save all pie charts, first use the GetSeriesColors function followed by the SavePiechartImg function.

# Overall pie chart
mbSet<-PlotOverallPieGraph(mbSet, "Phylum", 10,"sum", 10, "bottom");
GetSeriesColors()
mbSet<-SavePiechartImg(mbSet, "Phylum","primary_piechart_0","png");

Heat Tree

The heat tree analysis (PrepareHeatTreePlot) uses the hierarchical structure of taxonomic classifications to quantitatively (using the median abundance) and statistically (using the non-parametric Wilcoxon Rank Sum test) depict taxonomic differences between microbial communities. It generates a differential heat tree using the metacoder R package that shows which taxa are more/less abundant in each group. The statistical comparison can be obtained from the "tax_diff_dm.csv" file outputted in your working directory. Please refer to the paper by Foster et al. for more details (PMID 28222096). The PrepareHeatTreePlot also incorporates the viridis R package for additional color palettes.

# Heat tree at the genus level comparing pediatric CD vs healthy controls. Only
# significant taxon (p < 0.05) are labelled. 
mbSet<-PrepareHeatTreePlot(mbSet, meta="Class", taxalvl="Genus", color="plasma", layoutOpt="reda", comparison="CD_vs_Control", 
                          wilcox.cutoff = 0.05, imgName="heat_tree_0", format="png", dpi=300)

1.2 Community Profiling

Alpha-diversity is a measure of within sample diversity, while beta-diversity is a measure of between/among samples diversity. Alpha-diversity measures can be considered summary statistics of the diversity of single samples, while beta-diversity estimates can be considered similarity scores between pairs of samples. For the latter, these measures permit further analyses via clustering or dimension reduction techniques. Various statistical tests can be applied to the sample clusters to evaluate significant differences. More details are available below.

1.21 Alpha Diversity

Alpha-diversity summarizes both the species richness (feature variety) and/or evenness (feature distribution) within a sample (59, 60). Six alpha-diversity measures are currently supported in MicrobiomeAnalyst, each assessing different aspects of the community. Observed calculates the total number of features per sample, while ACE and Chao1 estimates taxa richness by accounting for undetected features due to low abundance. Meanwhile, Shannon and Simpson take species richness and evenness into account, with varying weight given to evenness. Finally, Fisher models the community abundance structure as a logarithmic series distribution.

# Create a dot plot of alpha diversity measures. Samples are grouped by the "Class" experimental factor and colored
# using the viridis color palette.
mbSet<-PlotAlphaData(mbSet, "filt","alpha_diver_0","Chao1","Class","OTU", "default", "png");

# Create summary box plot of alpha diversity measures using the viridis color palette.
mbSet<-PlotAlphaBoxData(mbSet, "alpha_diverbox_0","Chao1","Class","default", "png");

# Calculate alpha-diversity significance using the parametric t-test (two groups).
mbSet<-PerformAlphaDiversityComp(mbSet, "tt","Class");

1.22 Beta Diversity

Beta-diversity evaluates differences in the community composition between samples (i.e. distance). Resulting beta-diversity estimates can be combined into a distance matrix and used for ordination to visualize how the microbial diversity of samples between groups have changed. MicrobiomeAnalyst supports the five most commonly used beta-diversity measures. Bray-Curtis dissimilarity uses abundance data and calculates differences in feature abundance. Jensen-Shannon divergence assess the distance between two probability distributions that account for the presence and abundance of microbial features. Jaccard distance instead uses just the presence or absence of features to calculate differences in microbial composition. Unweighted and weighted UniFrac uses the phylogenetic distance between features. The former is based purely on phylogenetic distance, while the latter is weighted by the relative abundance of features.

Beta-diversity measures can be visualized using either principal coordinates analysis (PCoA) or nonmetric multidimensional scaling (NMDS). Both methods take the distance matrix as input, where PCoA maximizes the linear correlation between samples and NMDS maximizes the rank order correlation between samples (61). Note that NMDS is iterative and may return different results for the same dataset. Furthermore, MicrobiomeAnalyst calculates a stress value for the NMDS plot, which is a measure of goodness-of-fit. Generally, values greater than 0.2 suggest a poor fit while values less than 0.1 are good. Users should use PCoA if distances between samples are so close that a linear transformation would suffice. Also, NMDS is suggested if users wish to highlight the gradient structure within their data (61, 62).

Ordination measures between the groups are assessed for statistical significance using either permutational MANOVA (PERMANOVA), Analysis of group similarities (ANOSIM) (63), or Homogeneity of group dispersions (PERMDISP). Generally, these tests evaluate global differences in microbiome composition between groups. PERMANOVA tests if the centroids of all groups are equivalent, or rather that the sample clusters are in the same location. It uses the distances (or dissimilarity) between samples of the same group and compares it to the distances between groups (64, 65). This method is sensitive to multivariate dispersions, therefore PERMDISP should also be used to evaluate if the dispersion (or variation) between samples differs from the dispersion between groups (64, 65). Meanwhile, ANOSIM tests whether within-group distances are greater or equal to between-group distances, using the ranks of all pair-wise sample distances. In other words, it evaluates whether samples within a group are as clustered as samples belonging to different groups.

# Create a PCoA score plot using the viridis custom color palette, data points are colored
# by their group label.
mbSet<-PlotBetaDiversity(mbSet, plotNm="beta_diver_0", ordmeth="PCoA", distName="bray", colopt="expfac", metadata="Class",
                         showlabel="none", taxrank="OTU", taxa="null", alphaopt="Chao1", ellopt="yes", format="png",
                         dpi=72, custom_col="viridis");

# Creates a json file for 3D PCoA, use the web to view the interactive 3D PCoA plot
    mbSet<-PCoA3D.Anal(mbSet, "PCoA","bray","OTU","expfac","Group","","Chao1","beta_diver3d_0.json")

# Calculate the beta-diversity significance
mbSet<-PerformCategoryComp(mbSet, "OTU", "adonis","bray","Group");

1.23 Core Microbiome

The core microbiome refers to the set of taxa that are detected in a high fraction of the population above a given abundance threshold. The count data is transformed to compositional (relative) abundances in order to perform such analysis. Use the CoreMicrobeAnalysis to calculate and plot the core microbiome analysis. The function also outputs a "core_microbiome.csv" file in your working directory.

# Create a core microbiome plot using the viridis color palette
mbSet<-CoreMicrobeAnalysis(mbSet, "core_micro_0",0.2,0.01,"OTU","bwm","overview", "all_samples", "Group", "CD", "png");

1.3 Clustering Analysis

1.31 Heatmap Clustering

Use the PlotHeatmap to create a heatmap of microbiome data at a preferred taxonomic level. The function supports various clustering algorithms and distance measures for performing hierarchical clustering.

# Create a heatmap at the OTU level using the plasma color palette
mbSet<-PlotHeatmap(mbSet, "heatmap_0","euclidean","ward.D","bwm","Group","OTU","overview","F", "png","T","F");

1.32 Dendogram Analysis

# Create a dendogram of the microbiome data at the OTU level.
mbSet<-PlotTreeGraph(mbSet, "plot_tree_0","bray","ward.D","Group","OTU", "default", "png");

1.33 Correlation Analysis


1.34 Pattern Search

MicrobiomeAnalystR supports the identification of patterns in user's data. Users can define the pattern using a specific taxa, a predefined profile, or a custom profile. First, use the FeatureCorrelation function to identify the patterns. This will output the "correlation_feature.csv" file in your working directory. Next, use the PlotCorr function to plot the results.

# Identify and plot the pattern 
mbSet<-Match.Pattern(mbSet, "pearson", "1-2", "Genus", "Group")
mbSet<-PlotCorr(mbSet, "ptn_1", "png", width=NA)

1.4 Univariate Analysis

MicrobiomeAnalystR supports both classical and model-based methods for differential abundance analysis. All methods test if each taxon is associated with a dichotomous outcome (e.g. case vs. control). As many methods are available, the choice of differential abundance analysis depends on the quality and stricture of a user’s data. Further, it is advisable for users to compare the results from multiple methods and visualize the features with boxplots to increase one’s confidence.

1.41 Classical Univariate Analysis

Both parametric (T-test/ANOVA) and non-parametric (Mann-Whitney/Kruskall-Wallis) algorithms are available, their use dependent on whether microbiome features are normally distributed. The null hypothesis of the T-test and ANOVA is that the means of two or more populations are equal, whereas for Mann-Whitney and Kruskall-Wallis, the null hypothesis is that the mean ranks of the groups are the same (63). As microbiome data is typically skewed from the norm, Mann-Whitney is recommended and widely used (7). However, classical methods do not account for the compositionality of the data which can result in spurious correlations (7).

# Classical non-parametric univariate analysis: Mann-Whitney/Kruskall Wallis
mbSet<-PerformUnivarTest(mbSet, "Group",0.05,"NA","OTU","nonpar");

# Classical parametric univariate analysis: T-test/ANOVA
mbSet<-PerformUnivarTest(mbSet, "Group",0.05,"NA","OTU","tt");

1.42 Model-based Univariate Analysis

For RNA-seq data, differential analysis is applied to identify differentially expressed genes, paralleling the aim of identifying differentially abundant taxa from microbiome data. RNA-seq based methods, including DESeq2 (19) and edgeR (50), have since been readily applied to the microbiome (42). Both methods are parametric, based on generalized linear models (GLMs), and assume read counts follow a Negative Binomial distribution (7). Where they differ is in the way they normalize the data. DESeq2 uses size factors to account for varying library sizes and shrinkage to correct for high dispersion (7). Meanwhile, edgeR can use different normalization methods, the most common being TMM (trimmed mean of M-values). Compared to transcriptomics data however, microbiome data are sparse and RNA-seq methods may not be appropriate as these models cannot account for excess zeros at lower taxonomic resolutions (i.e. OTU level) (64). To overcome such issues, a Zero-inflated Gaussian (ZIG) mixture model, implemented in the metagenomeSeq R package, was proposed (65). Several comparisons of model-based methods have demonstrated that metagenomeSeq tends to have a higher false positive rate as compared to DESeq2 and edgeR (42, 66). However, a large-scale benchmarking study recommended its use as it showed greater retrieval power (67). Moreover, another study found that DESeq2 showed higher sensitivity for small datasets (<20 samples/group) and is more conservative than edgeR, however it has a high FDR with large or uneven library sizes (7). Finally, another benchmarking of differential abundance algorithms on metagenomics data deemed DESeq2 and edgeR were most suitable (66).

1.43 RNAseq Methods

# edgeR at the OTU level
mbSet<-PerformRNAseqDE(mbSet, "EdgeR",0.05,"Group","NA","OTU");

# deseq2 at OTU level
mbSet<-PerformRNAseqDE(mbSet, "EdgeR",0.05,"Group","NA","OTU");

1.44 metagenomeSeq

# metegenomeSeq using the zero-inflated Gaussian fit model
mbSet<-PerformMetagenomeSeqAnal(mbSet, "Group",0.05,"NA","OTU","zigfit");

1.5 Biomarker Analysis

Two commonly used methods to select important features are available in MicrobiomeAnalyst. LEfSe is a biomarker discovery method that was first developed for metagenomics data while Random Forest is an ensemble machine-learning method. More details are available below.

1.51 LEfSe

Linear discriminant analysis effect size (LEfSe) is a non-parametric statistical method to identify microbial taxa that are significantly different between groups. LEfSe first utilizes the Kruskal-Wallis test to identify taxon abundances that are significantly different between groups. Linear discriminant analysis (LDA) is then applied to taxa that meet the significance threshold (i.e. P < 0.05) to estimate their effect size (70, 71). This approach considers all taxonomic levels simultaneously and outputs a ranked list of taxa (by LDA score) in order of taxa most likely to explain group differences. A significance level of P < 0.05 and LDA score of 2 is often used to determine taxa that best characterize each phenotype.

# First perform LefSe analysis
mbSet<-PerformLefseAnal(mbSet, 0.1, "fdr", 2.0, "Group","F","NA","OTU");

# Plot LefSe results
mbSet<-PlotLEfSeSummary(mbSet, 15, "dot", "bar_graph_0","png");

1.52 Random Forest

Random Forest (RF) is a supervised machine-learning algorithm that has been applied to microbiome data to identify microbial taxa that differentiate between phenotypes (72, 73). RF is well-suited for large and noisy data such as from the microbiome as it is able to identify non-linear relationships, deal with variable interactions, and is hard to overfit (74). RF works by constructing multiple decisions trees using a randomly selected subset (i.e. bootstrapped) of the training data. In the end, all trees are merged (i.e. ensemble method) via a majority vote for a more accurate prediction. To evaluate classification accuracy, the “out-of-bag (OOB) error rate” is also calculated. During decision tree construction, 1/3 of samples are omitted from the training samples and are subsequently classified using the models. The OOB error rate is assessed by cross-validating results between the predictions and true groups. Additionally, RF calculates variable importance of all features for classification which is useful for users to identify biomarkers, or variables relevant to groups. Variable importance is calculated as the mean decrease in accuracy when a variable is excluded (74)

# First perform RF analysis
mbSet<-RF.Anal(mbSet, 500,7,1,"Group","OTU")

# Plot the RF classification results
mbSet<-PlotRF.Classify(mbSet, 15, "rf_cls_0","png", width=NA)

# Plot the RF Variable Importance Plot
mbSet<-PlotRF.VIP(mbSet, 15, "rf_imp_0","png", width=NA)

References



xia-lab/MicrobiomeAnalystR documentation built on April 17, 2024, 7:45 p.m.