1. Introduction

The metabolome is well-known to be a sensitive measure of health and disease, reflecting alterations to the genome, proteome, and transcriptome, as well as changes in life-style and environment. As such, one common goal of metabolomic studies is biomarker discovery, aiming to identify a metabolite or a set of metabolites capable of classifying conditions or disease, with high sensitivity (true-positive rate) and specificity (true negative rate). This is achieved through building predictive models of one or multiple metabolites and evaluating the performance, or robustness of the model, to classify new patients into diseased or healthy categories. The main steps for biomarker analysis are as follows:

1) Biomarker selection

2) Performance evaluation

3) Model creation

MetaboAnalystR provides users with several options for single (classical) or multiple biomarker analysis, as well as for predictive biomarker model creation and evaluation.

For a comprehensive introductory tutorial and further details concerning biomarker analysis, please refer to Xia et al. 2013 (PMID: 23543913).

2. Importing data

The biomarker analysis module accepts either a compound concentration table, spectral binned data, or a peak intensity table. The format of the data must be specified, identifying whether the samples are in rows or columns, and whether or not the data is paired. The data may either be .csv or .txt files. For the biomarker analysis vignette, we will be using the plasma_nmr.csv file which can be found online. The data follows the uploading, processing, and filtering steps as per other modules. For information on these steps, please refer to the Introduction to MetaboAnalystR vignette for details.

The normalization step for biomarker analysis, however, includes an additional option for calculating ratios between metabolite concentrations. Ratios between two metabolite concentrations may provide more information than the two individual metabolite concentrations seperately. MetaboAnalystR will compute ratios between all possible metabolite pairs and then choose the top ranked ratios (based on p-values) to be included with the data for further biomarker analysis. Please note, there is a potential overfitting issue associated with this procedure. The main purpose of computing ratios of metabolite concentrations is to improve the chances of biomarker discovery, therefore users will need to validate their performance in future, independent studies. Log normalization of the data will be performed during the process. If you would like to perform data scaling, perform it separately in a second step.

For the tutorial, we will be using a dataset consisting of metabolite concentrations of 90 human plasma samples measured by 1H NMR. Phenotype labels: 0 - Controls; 1 - Patients.

library(MetaboAnalystR)

# Create objects for storing processed data from biomarker analysis 
mSet<-InitDataObjects("conc", "roc", FALSE)

# Read in data and fill in the dataSet list
mSet<-Read.TextData(mSet, "http://www.metaboanalyst.ca/MetaboAnalyst/resources/data/plasma_nmr_new.csv", "rowu", "disc")

# Sanity check, replace missing values, check if the sample size is too small
mSet<-SanityCheckData(mSet)
mSet<-ReplaceMin(mSet)
mSet<-IsSmallSmplSize(mSet)
mSet<-PreparePrenormData(mSet)

###### *** OPTION 1 FOR NORMALIZATION
# Perform no normalization, no ratio calculation
mSet<-Normalization(mSet, "NULL", "NULL", "NULL", ref=NULL, ratio=FALSE, ratioNum=20)

To perform normalization with ratio calculation use Option 2...

###### *** OPTION 2 FOR NORMALIZATION
# No normalization, and computeS metabolite ratios and includeS the top 20 
mSet<-Normalization(mSet, "NULL", "NULL", "NULL", "C01", ratio=TRUE, ratioNum=20)

# If ratio = TRUE: view the normalized dataset including the top ranked ratios
# The ratios will be towards the end of the matrix 
mSet$dataSet$norm

#If ratio = TRUE: view just the top ranked included ratios
mSet$dataSet$ratio

3. Classical ROC Curve Analysis

To perform classical ROC curve analysis, you must first set the analysis mode to "univ" using SetAnalysisMode. The second step is to prepare the data for biomarker analysis using PrepareROCData. The third step is to perform the univariate ROC curve analysis with Perform.UnivROC. This function must be called for each feature separately. You must specify whether or not to compute the 95% confidence interval band (isAUC), whether or not to compute the optimal cutoff (isOpt), and whether to calculate a full AUC or partial AUC (isPartial). Note that when performing univariate ROC curve analysis on metabolite ratio pairs, you cannot include a "\" in the image name (imgName).

Use the CalculateFeatureRanking function to calculate the feature importance, ranked by order of most to least important, based on AUC. This creates a CSV file titled "metaboanalyst_roc_univ.csv" including the AUC, p-value, and log-2-fold-change for each metabolite feature. Keep the name "feat.rank.mat" to ensure that the sweave report will work.

# Set the biomarker analysis mode to perform Classical ROC curve analysis ("univ")
mSet<-SetAnalysisMode(mSet, "univ")

# Prepare data for biomarker analysis
mSet<-PrepareROCData(mSet)

### OPTION 1 Perform univariate ROC curve analysis ###
mSet<-Perform.UnivROC(mSet, feat.nm = "Isoleucine", imgName = "Isoleucine", "png", dpi=300, isAUC=F, isOpt=T, optMethod="closest.topleft", isPartial=F, measure="sp", cutoff=0.2)

# Create box plot showing the concentrations of the selected compound between the groups
mSet<-PlotBoxPlot(mSet, "Isoleucine", "Isoleucineboxplot_0_", "png", 72, T, FALSE)

# Perform calculation of feature importance (AUC, p value, fold change)
mSet<-CalculateFeatureRanking(mSet)

Other options are also available to perform univariate ROC curve analysis, as exemplified below:

### OPTION 2 Perform univariate ROC curve analysis, resulting in a partial AUC with a 95% CI band ###
mSet<-Perform.UnivROC(mSet, feat.nm = "Valine", imgName = "Valine", "png", dpi=300, isAUC=T, isOpt=T, optMethod="closest.topleft", isPartial=T, measure="se", cutoff=0.2)

### OPTION 3 Perform univariate ROC curve analysis on a metabolite ratio pair, note that you cannot save an image with a "\" in the name ###
mSet<-Perform.UnivROC(mSet, feat.nm = "Isoleucine/Valine", imgName = "IsoleucineValine", "png", dpi=300, isAUC=T, isOpt=T, optMethod="closest.topleft", isPartial=T, measure="se", cutoff=0.2)

4. Multivariate ROC Curve Explorer

The aim of the multivariate exploratory ROC curve analysis is to evaluate the performance of biomarker models created through an automated important feature identification step. To perform multivariate ROC curve exploration, you must set the analysis mode to "explore" using SetAnalysisMode. The second step is to prepare the data for biomarker analysis using PrepareROCData. The third step is to perform the multivariate ROC curve analysis with PerformCV.explore. This function creates ROC curves generated through Monte-Carlo cross validation (MCCV) using balanced subsampling. In each MCCV, part of the samples are used to evaluate feature importance, and the remaining samples are used to validate the models created with the first step. The top ranking features (max top 100) in terms of importance are used to build the biomarker classification models. This is repeated several times to calculate the performance and confidence intervals of each model. At minimum, users must specify the classification method (cls.method) and the ranking method (rank.method). There are three multivariate algorithms used for ROC curve analysis, partial least squares discriminant analysis (PLS-DA), random forest, and support vector machines (SVM). For PLS-DA classification, users have the option to input the number of latent variables (lvNum), by default it is 2. As well, users may change the proportion of samples to use for training (propTraining). By default, propTraining = 2/3. For users with large datasets (> 1000 features), the univariate feature ranking method is recommended to avoid long computation times.

There are several options to visualize the multivariate ROC curve exploration. The PlotROC function creates a plot of ROC curves for all of the top biomarker models or a single biomarker model based on its average performance across all Monte-Carlo cross validation runs. For a single biomarker, the 95% confidence interval can be computed (show.conf = 1). Meanwhile, the PlotProbView function creates a plot of the predicted class probabilities of all samples using a single biomarker model. Users have the option to label wrongly classified samples. The PlotAccuracy function creates a plot of the predictive accuracy of the top biomarker models with an increasing number of features on the x-axis. Finally, the PlotImpVars function creates a plot of the top ranked important variables of a selected model. Users have the option to change the number of important features to include in the plot (feat.num) as well as the method to rank the features by (measure).

# Set the biomarker analysis mode to perform Multivariate exploratory ROC curve analysis ("explore")
mSet<-SetAnalysisMode(mSet, "explore")

# Prepare data for biomarker analysis
mSet<-PrepareROCData(mSet)

# Perform multivariate ROC curve analysis, using SVM classification and ranking
mSet<-PerformCV.explore(mSet, cls.method = "svm", rank.method = "svm", lvNum = 2)

### OPTION 1 Comparison plot of ROC curves of all models ###
mSet<-PlotROC(mSet, imgName = "ROC_all_models", format = "png", dpi = 300, mdl.inx= 0, avg.method = "threshold", show.conf = 0, show.holdout = 0, focus="fpr", cutoff=0.5)

# Plot predicted class probabilities for each sample for a selected model, not showing labels of wrongly classified samples
mSet<-PlotProbView(mSet, imgName = "multi_roc_prob", format = "png", dpi = 300, mdl.inx = -1, show = 0, showPred = 0)

# Plot the predictive accuracy of models with increasing number of features
mSet<-PlotAccuracy(mSet, imgName = "multi_roc_accuracy", format = "png", dpi = 300)

# Plot the most important features of a selected model ranked from most to least important
mSet<-PlotImpVars(mSet, imgName = "multi_roc_impvar", format="png", dpi=300, mdl.inx = -1, measure="freq", feat.num=15)

The above workflow plotted ROC curves for all models, to plot the ROC curve for a single model instead use:

### OPTION 2 Plot the ROC curve of a single selected model, in this case model 1 and display the confidence interval ###
mSet<-PlotROC(mSet, imgName = "ROC_model1", format = "png", dpi = 300, mdl.inx = 1, avg.method = "threshold", show.conf = 1, 0, "fpr", 0.2)

Some users may wish to extract a selected model. For instance, say the best model has 10 variables. To view the top-ranked 10 variables use the code below. This will return a 2-column matrix with all the variable ranked by how best they perform.

imp.feats <- GetImpFeatureMat(mSet, mSet$analSet$multiROC$imp.cv, 10)

head(imp.feats)

#                       Rank Freq. Importance
# Glycerol                     0.82 0.15103973
# Methionine                   0.60 0.08181467
# Acetate                      0.40 0.06299938
# Betaine                      0.36 0.06840905

5. ROC Curve Based Model Creation and Evaluation

The aim of ROC curve based model creation and evaluation is to allow users to manually select any combination of features to create biomarker models using any of the three algorithms mentioned previously (PLS-DA, SVM, RF). To do this, follow the selected.nms example in the R code chunk below, which selects the names of the features to use to build a ROC biomarker model. This module also allows users to withhold a subset of samples for extra validation purposes. Follow the selected.smpls example below to select the subset of samples to be withheld. Finally, this module also allows users to predict class labels of new samples (samples without class labels). Features should be selected based on the user's own judgement or prior knowledge. Note, selection of features based solely on their overall ranks (AUC, T-statistic or fold changes) increases the risk of overfitting. These features may be the best biomarkers for a user's own data, but not for new samples. Additionally, in order to get a decent ROC curve for validation, it is recommended that the hold-out data contains a balanced number of samples from both groups and that it contain at least 8 hold-out samples (i.e. 4 from each group).

For this module, we will be using the "plasma_nmr_new.csv" file, which has 77 human plasma samples, 12 of which have empty/unknown class labels. Their class will be predicted later using this module. To perform multivariate ROC curve based model creation and evaluation, set the analysis mode to "test" using SetAnalysisMode. Next, prepare the data for biomarker analysis using PrepareROCData. Use CalculateFeatureRanking to view the AUC, t-test, and fold-change for all features. Then users must select a subset of features and samples for model creation and validation. Finally, perform ROC curve analysis with PerformCV.test. With this function, users have the options to select one of three algorithms used for ROC curve analysis (method), to input the number of latent variables if using PLS-DA (lvNums), to choose the proportion of samples used for training (propTraining), and finally to select the the number of MCCV runs to perform (nRuns).
There are several options to visualize the ROC Curve Based Model Creation and Evaluation. The PlotROCTest function creates a plot of the ROC curve for the created biomarker model. The 95% confidence interval can be computed (show.conf = 1). Meanwhile, the PlotProbViewTest function creates a plot of the predicted class probabilities of all samples for the biomarker model. Users have the option to label wrongly classified samples. The PlotAccuracy function creates a box-plot of the predictive accuracy of the biomarker model with predictive accuracy on the y-axis. Users can use GetAccuracyInfo to view the accuracy of the model based on cross-validation and the accuracy of the model based on the held-out data. Using this, users can get a sense of whether or not their model is over-fitted. The typical rule of thumb is that if a model's performance on the cross-validated data is higher than on the held-out data, the model is likely over-fitted. The Perform.Permute function performs permutations tests using the area under the ROC curve or the predictive accuracy of the model as a measure of performance. Here, users have the option to select the number of permutations to be performed (perm.num) as well as the proportion of samples to be used for training (propTraining). Plot.Permutation will create a plot of the permutations, showing the AUC of all permutations, highlighting the actual observed AUC in blue, along with showing the empirical p-value. This allows users to evaluate the performance of their created model, as the p-value evaluates the fraction of AUCperm greater or equal to the observed AUC. Finally, to view the predicted class labels of unlabeled samples, use ROCPredSamplesTable, which will return a table showing the probability and the predicted class labels for each new sample.

# Set the biomarker analysis mode to perform ROC Curve Based Model Creation and Evaluation ("test")
mSet<-SetAnalysisMode(mSet, "test")

# Prepare data for biomarker analysis
mSet<-PrepareROCData(mSet)

# Perform calculation of feature importance (AUC, p value, fold change)
mSet<-CalculateFeatureRanking(mSet)

# Manually select a subset of features for ROC analysis to build a classifier 
selected.cmpds <- c("Betaine", "N,N-Dimethylglycine", "Quinolinate", "Glucose")

# Manually select a subset of samples for ROC analysis hold-out data for validation purposes
selected.smpls <- c("PIF_178", "PIF_087", "PIF_090", "PIF_102", "PIF_111", "PIF_112")

# Prepare the custom data for model creation and sample hold-out 
mSet<-SetCustomData(mSet, selected.cmpds, selected.smpls)

# Perform ROC curve analysis, using SVM classification 
mSet<-PerformCV.test(mSet, method = "svm", lvNum = 2)

# Plot the ROC curve for the created model
mSet<-PlotROC(mSet, imgName = "cls_roc_0_", format="png",  dpi=300, mdl.inx = 0, avg.method = "threshold", 0, 0, "fpr", 0.5)

# Plot the predicted class probabilities for each sample using the user-created classifier, not showing labels of wrongly classified samples
mSet<-PlotProbView(mSet, imgName = "cls_prob_0_", format="png",  dpi=300, mdl.inx =-1, show=0, showPred= 0)

# Plot the predictive accuracy of the model with increasing number of features
mSet<-PlotTestAccuracy(mSet, imgName = "cls_accu_0_", format="png",  dpi=300)

# Perform permutations tests using the area under the ROC curve as a measure of performance
mSet<-Perform.Permut(mSet, perf.measure = "auroc", perm.num = 500, propTraining = 2/3)

# Plot the results of the permutation tests
mSet<-Plot.Permutation(mSet, imgName = "roc_perm_1_", format="png",  dpi=300)

# View predicted classes of new samples (only applicable if samples with empty class labels were in the uploaded dataset)
mSet <- ROCPredSamplesTable(mSet) # Create table
To view the example of the results:
# View the the predictive accuracy results of the model
# >R GetAccuracyInfo(mSet)
[1] "The average accuracy based on 100 cross validations is 0.692. The accuracy for hold out data prediction is 0.667(4/6)."

# View table of new predictions
# >R mSet$analSet$ROCtest$pred.samples.table 

3. Sweave Report

To prepare the sweave report, please use the PreparePDFReport function. You must ensure that you have the nexessary Latex libraries to generate the report (i.e. pdflatex, LaTexiT). The object created must be named mSet, and specify the user name in quotation marks.

PreparePDFReport(mSet, "My Name")


xia-lab/MetaboAnalystR3.0 documentation built on May 6, 2020, 11:03 p.m.