library(BiocStyle)
knitr::opts_chunk$set(warning=FALSE, error=FALSE, message=FALSE)


{metabolomicsR}: Tools to process, analyze, and visualize metabolomic data.

metabolomicsR is a streamlined R package to preprocess, analyze, and visualize metabolomic data. We included a set of functions for sample and metabolite quality control, outlier detection, missing value imputation, dimensional reduction, normalization, data integration, regression analysis, annotation, and visualization of data and results. The metabolomicsR is designed to be a comprehensive R package that can be easily used by researchers with basic R programming skills. The framework designed here is also versatile and extensible to various methods and metabolomic platforms. Here, we demonstrate the step-by-step use of the main functions from this package.

Figure: Seamless workflow to preprocess, analyze, and visualize metabolomic data in metabolomicsR


Installation

| Type | Command | |-------------|-----------------------------------------------------| | Development | remotes::install_github("XikunHan/metabolomicsR") |



Data structure

We first designed a “Metabolite” class based on the object-oriented programming system S4 in R. For a particular “Metabolite” data, it will include “assayData” (eg. peak area data or batch-normalized data, samples in rows and metabolites in columns), “featureData” (metabolite annotation), “sampleData” (sample annotation), “featureID”, “sampleID”, “logs” (log information of data analysis process), and “miscData” (other ancillary data). For metabolites, current platforms typically measure ~1,000 features. It will be much easier to manipulate the data with samples in rows and metabolites in columns, rather than the {SummarizedExperiment} container where the columns represent samples.



Import data

To demonstrate the package, we obtained metabolomic data from Qatar Metabolomics Study on Diabetes, similar to the data format from non-targeted mass spectrometry by Metabolon. The dataset is available via figshare.

The data can be imported directly from excel, csv or other formats that imported into R.

library(metabolomicsR)
library(data.table)
library(ggplot2)
library(cowplot)
library(plotROC)

# Load the metabolomic dataset from an excel file

file_path <- base::system.file("extdata", "QMDiab_metabolomics_OrigScale.xlsx", package = "metabolomicsR", mustWork = TRUE)

df_plasma <- load_excel(path = file_path,
                        data_sheet = 1,
                        feature_sheet = 4,
                        sample_sheet = 8,
                        sampleID = "QMDiab-ID",
                        featureID = "BIOCHEMICAL"
                        )

# To save the data in disc. 
# save_data(df_plasma, file = "~/test")

In the “assayData”, the first column is the sample IDs to match with “sampleData”, the other columns are metabolite IDs to match with “featureData”.


click to show plasma data

df_plasma

The {update_Metabolite} function provides several methods to update the data, including "injection_order", "keep_feature", "remove_feature", "keep_sample", "remove_sample", "add_sample_annotation", and "change_featureID".

# change the feature ID to the column `COMP_IDstr`
df_plasma <- update_Metabolite(df_plasma, dataset = "COMP_IDstr", action = "change_featureID")


click to show plasma data

df_plasma

# load urine metabolomic data
df_urine <- load_excel(path = file_path,
                        data_sheet = 2,
                        feature_sheet = 5,
                        sample_sheet = 9,
                        sampleID = "QMDiab-ID",
                        featureID = "BIOCHEMICAL"
                        )

df_urine <- update_Metabolite(df_urine, dataset = "COMP_IDstr", action = "change_featureID")


click to show urine data

df_urine

# load saliva metatabolomic data
df_saliva <- load_excel(path = file_path,
                        data_sheet = 3,
                        feature_sheet = 6,
                        sample_sheet = 10,
                        sampleID = "QMDiab-ID",
                        featureID = "BIOCHEMICAL"
                        )

df_saliva <- update_Metabolite(df_saliva, dataset = "COMP_IDstr", action = "change_featureID")


click to show saliva data

df_saliva



Quality control pipeline

We provided a pipeline for metabolite and sample quality control (QC) procedures. In the QC pipeline, we included the following functions: remove metabolites or samples beyond a particular missing rate threshold (e.g., 0.5), detect outliers (e.g., ± 5 standard deviations) and replace outliers with missing values or winsorize outliers, and various popular methods to impute missing values (e.g., half of the minimum value, median, zero, or nearest neighbor averaging [kNN] method). All the steps can be customized that have been implemented in the “QC_pipeline” function or be used from each individual function (eg. “filter_column_missing_rate”, “replace_outlier”, and “impute”).

First, we display the QC metrics using the {plot_QC} function

p <- plot_QC(df_plasma)
p$p

click to show plasma QC metrics

p

df_plasma_QC <- QC_pipeline(df_plasma, replace_outlier_method = "winsorize", impute_method = NULL)

click to show plasma data after QC

df_plasma_QC

For the urine QC metrics:

p <- plot_QC(df_urine)
p$p
df_urine_QC <- QC_pipeline(df_urine, replace_outlier_method = "winsorize", impute_method = NULL)

click to show urine data after QC

df_urine_QC

p <- plot_QC(df_saliva)
p$p
df_saliva_QC <- QC_pipeline(df_saliva, replace_outlier_method = "winsorize", impute_method = NULL)

click to show saliva data after QC

df_saliva_QC



Transformation

Transformation of metabolites can alter the distribution of data and is an essential step for the following statistical analysis. We provided the following transformation methods: log (natural logarithm), pareto scale, scale, and rank-based inverse normal transformation.

df_plasma_QC <- impute(df_plasma_QC, method = "half-min")
df_plasma_scale <-  transformation(df_plasma_QC, method = "log")
df_plasma_scale <-  transformation(df_plasma_scale, method = "scale")

click to show plasma data after scaling

df_plasma_scale



Visualization

Boxplot: before transformation

# if no features were selected, randomly show 16 metabolites
plot_Metabolite(df_plasma_QC, plot = "boxplot", x = "T2D", color ="ETHNICITY", shape = "T2D")
# select three metabolites
plot_Metabolite(df_plasma_QC, x = "T2D", plot = "boxplot", feature_name = c("M43027",  "M11953", "M38002"))



Visualization: comparisons between groups

# comparisons between groups using `ggbetweenstats`
plot_Metabolite(df_plasma_QC, x = "T2D", plot = "betweenstats",  feature_name = c("M43027",  "M11953"))



Boxplot after transformation

plot_Metabolite(df_plasma_scale, plot = "boxplot", x = "T2D", color ="ETHNICITY", shape = "T2D")



Visualization: histogram

plot_Metabolite(df_plasma_scale, plot = "histogram", color = "T2D")

plot_Metabolite(df_plasma_scale, plot = "histogram", color = "ETHNICITY")



Normalization

Normalization of metabolomics data is an important step to remove systematic variation, and to facilitate the identification of real biological metabolites. We provide popular normalization methods: batch-based normalization, QC sample-based/nearest QC sample-based normalization, and LOESS normalization. The latter three methods are useful when the measurements of internal quality control samples are provided. Briefly, in the “batch_norm” function, we implemented a batch-based normalization method that the raw values were divided by the median values of samples in each instrument batch to force the median value to one in each batch. “QCmatrix_norm” function will use reference quality control samples to normalize raw metabolite measurement values.

click to show urine toy example for normalization

# Take the urine data for example, the batch of the samples are missing. To illustrate the usage of the normalization method, we assume that 120 samples in each batch. 
unique(featureData(df_urine_QC)$PLATFORM)

# add new columns by reference (column name uppercase)
sampleData(df_urine_QC)[, `GC/MS` := rep(1:3, each = 120)[1:359]]
sampleData(df_urine_QC)[, `LC/MS NEG` := rep(1:3, each = 120)[1:359]]
sampleData(df_urine_QC)[, `LC/MS POS` := rep(1:3, each = 120)[1:359]]

# we also added an injection order column (in the current order)
df_urine_QC <- update_Metabolite(df_urine_QC, sampleData(df_urine_QC)[, 1], action = "injection_order")

# and set one QC sample for each 10 samples. 

sampleData(df_urine_QC)[, QCsample := ifelse(1:359%%10 == 0, 1, 0)]
sampleData(df_urine_QC)[, QCsample := factor(QCsample, levels = c(1, 0))]

sampleData(df_urine_QC)[, sampleID := ifelse(1:359 %% 10 == 0, paste0("MTRX-", sampleID), sampleID)]

table(sampleData(df_urine_QC)$QCsample)

assayData(df_urine_QC)$sampleID <- sampleData(df_urine_QC)$sampleID # also change the ID in assayData
df_urine_QC

We then selected the first 16 metabolites to show boxplot

v_features <- featureData(df_urine_QC)$featureID[1:16]

plot_injection_order(df_urine_QC, color = "QCsample", shape = "GC/MS", feature_name = v_features)



Normalization 1: batch-norm

df_urine_QC_norm <- batch_norm(df_urine_QC)
p1 <- plot_injection_order(df_urine_QC_norm, color = "QCsample", shape = "GC/MS", feature_name = v_features)
p1


click to show other normalization methods

Normalization 2: QCmatrix_norm

df_urine_QC_norm <- QCmatrix_norm(df_urine_QC)
p2 <- plot_injection_order(df_urine_QC_norm, color = "QCsample", shape = "GC/MS", feature_name = v_features)
p2

Normalization 3: nearestQC_norm

df_urine_QC_norm <- nearestQC_norm(df_urine_QC)
p3 <- plot_injection_order(df_urine_QC_norm, color = "QCsample", shape = "GC/MS", feature_name = v_features)
p3

Normalization 4: modelling_norm

df_urine_QC_norm_LOESS <- modelling_norm(df_urine_QC, method = "LOESS")
p4 <- plot_injection_order(df_urine_QC_norm, color = "QCsample", shape = "GC/MS", feature_name = v_features)
p4

# to benchmark the performance
df_urine_QC_norm_KNN <- modelling_norm(df_urine_QC, method = "KNN", k = 3) 
p5 <- plot_injection_order(df_urine_QC_norm_KNN, color = "QCsample", shape = "GC/MS", feature_name = v_features)
p5


# test XGBoost
# df_urine_QC_norm_xgboost <- modelling_norm(df_urine_QC, method = "XGBoost") 
# p6 <- plot_injection_order(df_urine_QC_norm_xgboost, color = "QCsample", shape = "GC/MS", feature_name = v_features)
# p6



Dimensional reduction

Dimensional reduction strategies on metabolites data can be used to detect batch effects, sample outliers, and real biological subgroups. We included principal components analysis (PCA), manifold approximation and projection (UMAP), and t-distributed stochastic neighbor embedding (tSNE) methods.

df_plasma_PCA <- run_PCA(df_plasma_QC)

plot_PCA(df_plasma_PCA, color ="ETHNICITY", shape = "T2D")

plot_UMAP(df_plasma_QC, color ="ETHNICITY", shape = "T2D")

plot_tsne(df_plasma_QC, color ="ETHNICITY", shape = "T2D")



Correlation

# pairwise correlation of metabolites between two different fluids

# plasma vs urine
dd <- correlation(df_plasma_QC, df_urine_QC, method = "spearman")
dd <- merge(dd, featureData(df_plasma_QC), by.x = "term", by.y = "featureID")

p <- ggplot(dd, aes(x = SUPER_PATHWAY, y = r, fill = SUPER_PATHWAY)) +
  geom_boxplot() +
  geom_jitter( size=0.4, alpha=0.9) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(9, "Set1")) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))+
  labs(title = "plasma vs urine")

p

click to show plasma vs saliva & urine vs saliva

# plasma vs saliva
dd <- correlation(df_plasma_QC, df_saliva_QC, method = "spearman")
dd <- merge(dd, featureData(df_plasma_QC), by.x = "term", by.y = "featureID")

p2 <- ggplot(dd, aes(x = SUPER_PATHWAY, y = r, fill = SUPER_PATHWAY)) +
  geom_boxplot() +
  geom_jitter(size=0.4, alpha=0.9) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(9, "Set1")) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  labs(title = "plasma vs saliva")


# urine vs saliva
dd <- correlation(df_urine_QC, df_saliva_QC, method = "spearman")
dd <- merge(dd, featureData(df_urine_QC), by.x = "term", by.y = "featureID")

p3 <- ggplot(dd, aes(x = SUPER_PATHWAY, y = r, fill = SUPER_PATHWAY)) +
  geom_boxplot() +
  geom_jitter(size=0.4, alpha=0.9) +
   scale_fill_manual(values = RColorBrewer::brewer.pal(9, "Set1")) +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  labs(title = "urine vs saliva")


p <- cowplot::plot_grid(p2, p3, nrow = 1)
p



Association analysis

Association analysis 1: linear regression

Association analysis between metabolites and interested outcomes was implemented in the {regression} function, supporting general linear regression, logistic regression, poisson regression, proportional hazards regression model, linear mixed-effects model, and logistic linear mixed-effects model, with or without covariates. All the regression models can be run for selected metabolites or all metabolites in a single job with the support of parallel computing. We also provided the option to run linear regression models where a metabolite is the outcome (an example is shown in the Section "## Extension 4: Nightingale’s NMR platform" below).

fit_lm <- regression(object = df_plasma_scale, phenoData = NULL, 
                     model = "lm", outcome = "BMI",
                     covars = c("AGE", "GENDER", "ETHNICITY"), 
                     factors = "ETHNICITY")

head(fit_lm)

dd <- merge(fit_lm, featureData(df_plasma_scale), by.x = "term", by.y = "featureID")

dd[, sig := ifelse(p.value.adj < 0.1, 1, 0)]
plot_volcano(dd, color = NULL, label = "BIOCHEMICAL")

Association analysis 1: linear regression -- annotation in HMDB

df_sig <- dd[sig == 1 & !is.na(HMDb_ID), ]


library(stringr)

# this function is used to change HMDB ID.
add_two_zeros <- function(x) {
  if(str_count(x) == 9) x <- str_replace(x, "HMDB", "HMDB00")
  return(x)
}

v_list <- sapply(df_sig$HMDb_ID, add_two_zeros)
v_list

df_sig$HMDB_anno <- anno_hmdb(v_list)
# df_sig

# explore the first entry
df_sig$HMDB_anno[[1]]@metabolite
df_sig$HMDB_anno[[1]]@diseases



Association analysis 2: logistic regression

fit_glm <- regression(object = df_plasma_scale, phenoData = NULL, 
                      model = "logistic", outcome = "T2D",
                      covars = c("AGE","GENDER", "ETHNICITY"), 
                      factors = "ETHNICITY")

head(fit_glm)
dd <- merge(fit_glm, featureData(df_plasma_scale), by.x = "term", by.y = "featureID")

dd[, sig := ifelse(p.value.adj < 0.1, 1, 0)]

plot_volcano(dd, label = "BIOCHEMICAL")


Association analysis 3: multiple outcomes, using model = "auto"

fit2 <- regression(object = df_plasma_scale, phenoData = NULL, 
                   model = "auto", outcome = c("BMI", "T2D"),
                   covars = c("AGE","GENDER", "ETHNICITY"), 
                   factors = "ETHNICITY")

head(fit2)
dd2 <- merge(fit2, featureData(df_plasma_scale), by.x = "term", by.y = "featureID")

dd2[, sig := ifelse(p.value.adj < 0.1, 1, 0)]
plot_volcano(dd2, color = "outcome", label = "BIOCHEMICAL")



ROC

# ROC for one variable, to add a wrapped for many metabolites separately
plot_ROC(object = df_plasma_scale, y = "T2D", x = "M00527")
plot_ROC(object = df_plasma_scale, y = "T2D", x = c("M00527", "BMI"))



ROC: compare two models

plot_ROC(object = df_plasma_scale, y = "T2D", 
         model_a = c("BMI", "AGE", "GENDER"), 
         model_b = c("M00527", "BMI", "AGE", "GENDER"))



Extension

The metabolomicsR framework is versatile and is extensible to other various methods, measurement platforms, and omics data types. Developers can customize new visualization and statistical methods, or provide an interface to deploy available statistical methods based on the metabolomicsR data structure.

Extension 1: Negative binomial generalized linear models

We have provided various regression models in the {metabolomicsR} package. Here we show how to extend the {regression} function to other models.

# for example, to fit a Negative Binomial Generalized Linear Model, we first define a `fit_glm.nb` function. 

library(MASS)

fit_glm.nb <- function(data = NULL, formula = NULL, keep = NULL) {
  v_var <- all.vars(formula)
  df <- data[, v_var, with = FALSE]
  fit <- tryCatch(
    do.call("glm.nb", args = list(data = df, formula = formula)),
    error = function(e) {
      cat(paste0("Failed to fit model: ", e), "\n")
    })
  if(inherits(fit, "glm")) {
    res  <- as.data.table(broom::tidy(fit))
    if(! is.null(keep)) res <- res[res$term %in% keep, ]
    res$n <- length(fit$residuals)
  } else {
    res <- list(estimate = NA)
  }
  return(res)
}

# the outcome 'toy_y' was only created to demonstrate the usage to extend regression models

sampleData(df_plasma_scale)[, toy_y := round(BMI, 0)]

# equivalent to:
if(0) {
  dd <- sampleData(df_plasma_scale) 
  dd[, toy_y := round(BMI, 0)]
  sampleData(df_plasma_scale)  <- dd
}

fit <- regression(object = df_plasma_scale, phenoData = NULL, model = "glm.nb", outcome = "toy_y",
                  covars = c("AGE","GENDER", "ETHNICITY"), factors = "ETHNICITY", 
                  feature_name = featureData(df_plasma_scale)$featureID[1:50])
dd <- merge(fit, featureData(df_plasma_scale), by.x = "term", by.y = "featureID")
dd[, sig := ifelse(p.value.adj < 0.1, 1, 0)]
plot_volcano(dd, color = NULL, label = "BIOCHEMICAL")


Extension 2: Call available methods and packages

Here, we show an extension example to detect genuine untargeted metabolic features based on a previous method (genuMet):

res_genuMet <- genuMet_makefeature(df_plasma)
str(res_genuMet)


Extension 3: Bridge to maplet package

In this section, we show two functions to bridge the data objects between R package maplet and our package metabolomicsR.

# load the data from maplet. 
# the maplet package is available at: https://github.com/krumsieklab/maplet 

# not run here
if(0) {
  library(maplet)

file_data <- system.file("extdata", "example_data/simulated_data.xlsx", package = "maplet")
file_data 

maplet_data <-
  maplet::mt_load_checksum(file=file_data, checksum = "80afcd72481c6cf3dcf83342e3513699") %>%
  maplet::mt_load_xls(file=file_data, sheet="data", samples_in_row=T, id_col="sample") %>%
  maplet::mt_anno_xls(file=file_data, sheet="metinfo",anno_type="features", anno_id_col="name", data_id_col = "name") %>%
  maplet::mt_anno_xls(file=file_data, sheet="clin", anno_type="samples", anno_id_col ="sample", data_id_col ="sample") 

maplet_data

d <- maplet2Metabolite(maplet_data)

d2 <- Metabolite2maplet(d)
d2
}

Extension 4: Nightingale’s NMR platform

In this section, we show the application of our metabolomicsR package to Nightingale’s NMR platform data, including data import, transformation, visualization, and association analysis.

# the Nightingale data were downloaded from https://nightingalehealth.github.io/ggforestplot/articles/nmr-data-analysis-tutorial.html

library(data.table)
df_data <- fread("~/Documents/test/NightingaleData/12345-Results.csv")
setnames(df_data, 1, "sampleID")

df_sample <- fread("~/Documents/test/NightingaleData/clinical_data.csv")
setnames(df_sample, 1, "sampleID")
df_sample$sampleID <- df_data$sampleID

df_feature <- data.table(featureID = names(df_data)[-1])

# create Metabolite object
df <- create_Metabolite(df_data, df_feature ,df_sample)

sampleData(df)[, obesity := ifelse(BMI >= 30, 1, 0)]

plot_Metabolite(df, plot = "boxplot", x = "obesity", color ="obesity", shape = "obesity", feature_name = "GlycA")


# Transformation
df <-  transformation(df, method = "log1p")
df <-  transformation(df, method = "scale")


# Association of BMI (as exposure) and each metabolite (as outcome)
fit <- regression_each_as_outcome(object = df, phenoData = NULL, exposure = "BMI", 
                                  feature_name = c("Total_C", "non_HDL_C", "Remnant_C", "VLDL_C", "Clinical_LDL_C"),
                                  covars = c("gender", "baseline_age"), factors = "gender")


# the results were the same as the "nmr-data-analysis-tutorial.html"
fit


References




How to cite metabolomicsR

Han, Xikun, and Liming Liang. 2022. “metabolomicsR: A Streamlined Workflow to Analyze Metabolomic Data in R.” Bioinformatics Advances 2 (1): vbac067. https://doi.org/10.1093/bioadv/vbac067

Session Info

sessionInfo()



XikunHan/metabolomicsR documentation built on Feb. 7, 2024, 1:17 p.m.