README.md

MetEx

MetEx is a R package to extract and annotate metabolites from liquid chromatography–mass spectrometry data.

Introduction

Liquid chromatography–high resolution mass spectrometry (LC-HRMS) is the most popular platform for untargeted metabolomics methods, but annotating LC-HRMS data remains a challenge due to the limits of metabolome databases and annotation strategies. In this work, a new LC-HRMS database, MetExDB, was developed, containing retention time (tR), MS1 and MS2 information for 24,674 compounds, in which missing information supplied by machine learning predictions. In parallel, an untargeted LC-HRMS data annotation method based on the database, called MetEx, was suggested for targeted extraction and identification of compounds using information entropy to assist real signal recognition. The number of true positive compounds annotated by MetEx is 2.1~2.6 times that of software packages that use the traditional peak detection-based annotation method. MetEx achieves a false discovery rate of lower than 0.7% using orthogonal information (tR and MS) when using mixed standard solutions for validation. In addition, MetEx supports user-defined databases to suit more application scenarios and is provided as an open-source R package (https://github.com/zhengfj1994/MetEx).

Figure 1. The workflow of MetEx

Installation

  1. If you don't have R language, install R first. >> R download here Note: We developed MetEx in R 3.6.3 and we have test it in R 4.0.2. If you find problems when you use other versions, please contact us. >> The old version of R
  2. We recommended to install Rstudio owing to it is an integrated development environment (IDE) for R. >> Rstudio download here
  3. Install the R package "devtools" and other reliable packages, then install MetEx using codes below. >>The devtools package

    install.packages(c("devtools","BiocManager")) BiocManager::install(c("xcms","KEGGREST"),update = TRUE, ask = FALSE) devtools::install_github('zhengfj1994/MetEx') It will take few minutes to download the packages. 4. If the third step fails to install, users can download the project and install off line as shown in Figure 2-4:

    Figure 2. Download the MetEx-master.zip from github.

    ​ Then, in Rstudio, choose Packages —— Install:

Figure 3. Package intallation in Rstudio

​ Finally, choose install from Package Archive File (.zip; .tar.gz), and select the MetEx-master.zip, click install.

Figure 4. Choose the MetEx-master.zip and install.

5. Call MetEx to see if the installation was successful. library(MetEx)

Dependences

MetEx dependent the following packages, If you find that the installation fails and you are prompted that the following installation package is missing, please manually install the missing packages. openxlsx, tcltk, doSNOW, stringr, xcms, do, KEGGREST, XML, progress, shinydashboard, shinycssloaders, shinyjs, ggrepel, DT, dplyr, foreach, jsonlite, snow, tidyr, BiocManager, knitr, shiny, ggplot2, RColorBrewer

The uniform database format

Supported database

Retention time prediction

  1. Molecular descriptor calculation. Mordred is used to calculate molecular descriptors. And other tools for molecular descriptor calculation are also available. Mordred have provided some examples to calculate molecular descriptors. And users can also see the example provided in https://github.com/zhengfj1994/Retention-time-prediction-in-MetEx
  2. Molecular descriptor processing.
  3. Retention time prediction.

How to use the Shiny App?

We provided a Shiny App and its screen shot is shown in Figure 6.

Figure 6. Screen shot of Shiny App.

  1. Please confirm that you have install the MetEx package in R.

  2. Open Rstudio.

  3. Enter the following line of code:

shiny::runApp(system.file("extdata/shinyApp", "app.R", package = "MetEx"))

  1. A new visualization window is opened.

  2. There are four main taps in the left. The first tap is introduction, the second tap is the annotation work flow of a single file by MetEx, the second tap is the annotation work flow of multiple files by MetEx, the fourth work flow is the annotation work flow based on peak detection result. The parameters of several modules are described separately in the next section.

5.2 The second tap, MetEx (Single file), annotation work flow of a single file by MetEx:

5.3 The third tap, MetEx (Multiple file), annotation work flow of multiple files by MetEx:

5.3 The fourth tap, Annotation from peak table, annotation work flow based on peak detection result:

The main functions and their parameters in MetEx

1. dbImporter: Import the database which saved in xlsx file.

  1. dbFile: the path of the database (xlsx file).

  2. ionMode: the ion mode of the LC-MS, only support two values, positive ion mode is "P" and negative ion mode is "N".

  3. CE: the collision energy of MS/MS spectrum, it depended on the experimental MS/MS conditions and the CE value in databases. The default is "all".

2. retentionTimeCalibration: Use internal standard retention to calibrate retention time of metabolites in database.

  1. is.tR.file: the xlsx file of IS retention times in database and in your experiment.

  2. database.df: the imported database data frame.

3. targetExtraction.parallel: Targeted extraction of metabolites using m/z and retention time.

  1. msRawData: the LC-MS untargeted raw data in the formate of mzXML. e.g. "D:/github/MetEx/Example Data/mzXML/Urine-30V.mzXML".
  2. dbData: the imported dbData by the function named “dbImporter”.
  3. deltaMZ: the m/z window in targeted extraction.
  4. deltaTR: the retention time window in targeted extraction.
  5. trRange: the range of retention time used to calculate information entropy, the default value is 30 (second).
  6. m: a parameter used for peak detection, the default value is 200.
  7. cores: The CPU cores for parallel computing.

4. extracResFilter: Filter the result of targeted extraction based on information entropy and peak height.

  1. targExtracRes: the result of the function named "targetExtraction"
  2. classficationMethod: use the SVM method or not. If you want to use SVM, the value is "SVM", otherwise, the value is "NoSVM". The default value is "NoSVM".
  3. entroThre: When the classficationMethod is "NoSVM", this parameter is meaningful, The value of information entropy.
  4. intThre: When the classficationMethod is "NoSVM", this parameter is meaningful, The value of peak height.

5. importMgf: Import the mgf file.

  1. mgfFile: the file of mgf.

6. batchMS2Score.parallel: MS/MS similarity calculation.

  1. ms1Info: the result of extracResFilter.
  2. ms1DeltaMZ: the m/z tolerance between MS1 and MS2.
  3. ms2DeltaMZ: the m/z tolerance between MS2 in database and experiment.
  4. deltaTR: the retention time tolerance between MS1 and MS2 (second).
  5. mgfMatrix: the matrix of mgf that generate by the function named "importMgf". mgfList$mgfMatrix
  6. mgfData: the R data of mgf that generate by the function named "importMgf". mgfList$mgfData
  7. MS2.sn.threshold: MS2 S/N threshold, the default is 3.
  8. MS2.noise.intensity: The MS2 noise intensity, "minimum" or a number.
  9. MS2.missing.value.padding: The MS2 missing value padding method, two options are available, "half" and "minimal.value". "Half" is referred to MS-DIAL and "minimal" is closer to the actual situation. And now we recommended "minimal.value".
  10. ms2Mode: the MS2 acquisition mode which can be IDA and DIA. the default is "ida", and another option "dia" is developing.
  11. scoreMode: "obverse" means dot product, "reverse" means reverse dot product, "average" means the mean of dot product and reverse dot product.
  12. diaMethod: when the ms2Mode is "dia", you should input an txt file of the dia method. However, the function is in developing, so the default of diaMethod is "NA".
  13. cores: The CPU cores for parallel computing.

7. identifiedResFilter: Filter the identified result and generate an new xlsx file for saving the identification result.

  1. csvFile: the result of batchMS2Score should be output to a csv file. This parameter is the path of the csv file.
  2. resFile: a xlsx file to save the identification result.
  3. MS2score: the MS2 score threshold (0-1).

8. MetExAnnotation: Integration of the above functions, one line of code can complete the targeted extraction and annotation of metabolites.

  1. dbFile: the path of the database (xlsx file). e.g. "D:/github/MetEx/Example Data/Database/MSMLS database.xlsx".
  2. ionMode: the ion mode of the LC-MS, only support two values, positive ion mode is "P" and negative ion mode is "N".
  3. CE: the collision energy of MS/MS spectrum, it depended on the experimental MS/MS conditions and the CE value in databases. The default is "all".
  4. tRCalibration: Calibrate retention time (T) or not (F). The default is F.
  5. is.tR.file: the xlsx file of IS retention times in database and in your experiment. If the tRCalibration is F, this parameter should be set as "NA".
  6. msRawData: the LC-MS untargeted raw data in the formate of mzXML.
  7. MS1deltaMZ: the m/z window in targeted extraction.
  8. MS1deltaTR: the retention time window in targeted extraction.
  9. entroThre: When the classficationMethod is "NoSVM", this parameter is meaningful, The value of information entropy.
  10. intThre: When the classficationMethod is "NoSVM", this parameter is meaningful, The value of peak height.
  11. classficationMethod: use the SVM method or not. If you want to use SVM, the value is "SVM", otherwise, the value is "NoSVM". The default value is "NoSVM".
  12. mgfFile: the file of mgf.
  13. MS2.sn.threshold: MS2 S/N threshold, the default is 3.
  14. MS2.noise.intensity: The MS2 noise intensity, "minimum" or a number.
  15. MS2.missing.value.padding: The MS2 missing value padding method, two options are available, "half" and "minimal.value". "Half" is referred to MS-DIAL and "minimal" is closer to the actual situation. And now we recommended "minimal.value".
  16. MS1MS2DeltaMZ: the m/z tolerance between MS1 and MS2.
  17. MS2DeltaMZ: the m/z tolerance between MS2 in database and experiment.
  18. MS1MS2DeltaTR: the retention time tolerance between MS1 and MS2 (second).
  19. scoreMode: "obverse" means dot product, "reverse" means reverse dot product, "average" means the mean of dot product and reverse dot product.
  20. csvFile: the result of batchMS2Score should be output to a csv file. This parameter is the path of the csv file.
  21. xlsxFile: a xlsx file to save the identification result.
  22. MS2scoreFilter: the MS2 score threshold (0-1).
  23. parallel.Computing: T or F.
  24. cores: The CPU cores for parallel computing.

Examples

MetEx provide two approaches to annotate metabolites. The first approach is peak-detection-independent method and the second is peak-detection-dependent method. The first approach is newly developed and could avoid the peak loss in conventional peak detection methods.

Maintainers

Fujian Zheng zhengfj@dicp.ac.cn or 2472700387@qq.com

Change Log

v1.0

The first version

Developing Plan



zhengfj1994/MeTEA documentation built on June 29, 2021, 5:21 a.m.