knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, eval = FALSE)
In this notebook we aim to demonstrate how the MGnifyR
tool can be used to
fetch data of a MGnify microbiome data resource. We then showcase how to analyze
the datausing advanced microbiome data science tools, including estimating alpha
and beta diversity, as well as performing differential abundance analysis.
MGnifyR
is an R/Bioconductor package that provides a set of tools for easily accessing
and processing MGnify data in R, making queries to MGnify databases through the
MGnify API.
The benefit of MGnifyR
is that it streamlines data access, allowing users to
fetch data either in its "raw" format or directly as a
TreeSummarizedExperiment
(TreeSE
)
object. This enables seamless integration with custom workflows for analysis.
Utilizing TreeSE
provides access to a wide range of tools within
Bioconductor's SummarizedExperiment
(SE
) ecosystem. It also integrates
with the
mia
package, which offers
microbiome-specific methods within the SE
framework.
For more information on microbiome data science in Bioconductor, refer to Orchestrating Microbiome Analysis (OMA) online book.
# List of packages that we need packages <- c("ANCOMBC", "MGnifyR", "mia", "miaViz", "scater") # Get packages that are already installed packages_already_installed <- packages[ packages %in% installed.packages() ] # Get packages that need to be installed packages_need_to_install <- setdiff( packages, packages_already_installed ) # Loads BiocManager into the session. Install it if it is not already installed. if( !require("BiocManager") ){ install.packages("BiocManager") library("BiocManager") } # If there are packages that need to be installed, installs them with BiocManager # Updates old packages. if( length(packages_need_to_install) > 0 ) { install(packages_need_to_install, ask = FALSE) } # Load all packages into session. Stop if there are packages that were not # successfully loaded pkgs_not_loaded <- !sapply(packages, require, character.only = TRUE) pkgs_not_loaded <- names(pkgs_not_loaded)[ pkgs_not_loaded ] if( length(pkgs_not_loaded) > 0 ){ stop( "Error in loading the following packages into the session: '", paste0(pkgs_not_loaded, collapse = "', '"), "'") }
To interact with the MGnify database, we need to create a MgnifyClient
object.
This object allows us to store options for data fetching. For instance, we can
configure it to use a cache for improved efficiency.
#| output: false # Create the MgnifyClient object with caching enabled mg <- MgnifyClient( useCache = TRUE, cacheDir = "/home/training" # Set this to your desired cache directory )
In this workflow, we will fetch taxonomy annotations and metadata from the study "MGYS00005154". The dataset focuses on the human gut microbiome, analyzed across different geographic regions.
We can now search for all analyses associated with the certain study. The analysis refers to metagenomic runs performed to samples. Each sample can have multiple runs made, which is why we work with analyses and not with samples; analysis identifier points to a single entity.
#| output: false study_id <- "MGYS00005154" analysis_id <- searchAnalysis(mg, "studies", study_id)
Now we are ready to load the metadata on the analyses to get idea on what kind of data we are dealing with.
There are currently (17 Sep 2024) almost 1,000 analyses available. Downloading whole dataset will take some time, which is why we use memory cache.
metadata <- getMetadata(mg, accession = analysis_id)
We can see that there are analyses that are performed with different pipelines. Let's take only those analyses that are generated with the pipeline version 5.0.
metadata <- metadata[metadata[["analysis_pipeline-version"]] == "5.0", ]
We have now analyses that each point to unique sample. The final step is
to fetch abundance tables in TreeSummarizedExperiment
(TreeSE
) format.
tse <- getResult( mg, accession = metadata[["analysis_accession"]], get.func = FALSE ) tse
The fetched data is TreeSE
object, including taxonomy annotations. See
OMA online book
on how to handle the data in this format.
Below, we agglomerate the data to the Order level, meaning we summarize the abundances at this specific taxonomic rank. The OMA provides a detailed chapter explaining agglomeration in more depth.
tse_order <- agglomerateByRank(tse, rank = "Order")
Because of the unique properties of microbiome data, we have to apply transformations. Here, we perform relative transformation. You can find more information on transformations from OMA.
# Transform the main TreeSE tse <- transformAssay(tse, method = "relabundance") # Transform the agglomerated TreeSE tse_order <- transformAssay(tse_order, method = "relabundance")
Alpha diversity measures community diversity within a sample. Learn more on community diversity from here.
# Calculate alpha diversity tse <- addAlpha(tse) # Create a plot p <- plotColData( tse, y = "shannon_diversity", x = "sample_geographic.location..country.and.or.sea.region.", show_boxplot = TRUE ) p
We can test whether the diversity differences are statistically significant. We utilize Mann Whithney U test (or Wilcoxon test).
pairwise.wilcox.test( tse[["shannon_diversity"]], tse[["sample_geographic.location..country.and.or.sea.region."]], p.adjust.method = "fdr" )
To add p-values to the plot, see OMA.
We can assess the differences in microbial compositions between samples, aiming to identify patterns in the data that are associated with covariates.
To achieve this, we perform Principal Coordinate Analysis (PCoA) using Bray-Curtis dissimilarity.
# Perform PCoA tse <- runMDS( tse, FUN = getDissimilarity, method = "bray", assay.type = "relabundance" ) # Visualize PCoA p <- plotReducedDim( tse, dimred = "MDS", colour_by = "sample_geographic.location..country.and.or.sea.region." ) p
See community similarity chapter from OMA for more information.
In DAA, we analyze whether abundances of certain features vary between study groups. Again, OMA has a dedicated chapter also on this topic.
# Perform DAA res <- ancombc2( data = tse_order, assay.type = "counts", fix_formula = "sample_geographic.location..country.and.or.sea.region.", p_adj_method = "fdr", )
Next we visualize features that have the lowest p-values.
# Get the most significant features n_top <- 4 res_tab <- res[["res"]] res_tab <- res_tab[order(res_tab[["q_(Intercept)"]]), ] top_feat <- res_tab[seq_len(n_top), "taxon"] # Create a plot p <- plotExpression( tse_order, features = top_feat, assay.type = "relabundance", x = "sample_geographic.location..country.and.or.sea.region.", show_boxplot = TRUE, show_violin = FALSE, point_shape = NA ) + scale_y_log10() p
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.