In jaspershen/metIdentify: Metabolite identification based on MS1 and MS2 spectra

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction & Installation

metIdentify is a R package which can be used for database construction, and metabolite identification based on in-house and public database.

Please install it via github.

if(!require(devtools)){
  install.packages("devtools")
}
devtools::install_github("jaspershen/metIdentify")
##tinyTools is a dependent package for metIdentify
devtools::install_github("jaspershen/tinyTools")

Database construction

If you have in-house standards which have been acquired with MS2 spectra data, then you can construct the in-house MS2 spectra databases using metIdentify package.

Data preparation

Firstly, please transform your raw standard MS data (positive and negative modes) to mzXML format using ProteoWizard. The parameter setting is shown like figure below:

Data organization

Secondly, please organize your standard information as a table, and output as a csv or xlsx format. The format of stanford information can refer to our demo data in demoData package.

From column 1 to 11, the columns are "Lab.ID", "mz", "RT", "CAS.ID", "HMDB.ID", "KEGG.ID", "Formula", "mz.pos", "mz.neg", "Submitter", respectively. It is OK if you have another information for the standards. Like the demo data shows, there are other additional information, namely "Family", "Sub.pathway" and "Note".

mz: Accurate mass of compound.
RT: Retention time, unit is second.
mz.pos: Mass to change ratio of compound in positive mode, for example, M+H.
mz.neg: Mass to change ratio of compound in negative mode, for example, M-H.
Submitter: The name of person or organization.

Then creat a folder and put you mzXML format datasets (positive mode in 'POS' folder and negative mode in 'NEG' folder) and compound information in it.

Run `databaseConstruction` function

We use the demo data in demoData package to show how to use metIdentify. Please install it first.

devtools::install_github("jaspershen/demoData")

library(demoData)
library(metIdentify)
path <- system.file("database_construction", package = "demoData")
file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE)
new.path <- file.path("./database_construction")

test.database <- databaseConstruction(path = new.path, 
                                      version = "0.0.1",
                                      metabolite.info.name = "metabolite.info_RPLC.csv", 
                                      source = "Michael Snyder lab", 
                                      link = "http://snyderlab.stanford.edu/",
                                      creater = "Xiaotao Shen",
                                      email = "shenxt1990@163.com",
                                      rt = TRUE,
                                      mz.tol = 15,
                                      rt.tol = 30,
                                      threads = 3)

The arguments of databaseConstruction can be found using ?databaseConstruction.

test.database is a databaseClass object, you can print it to see its information.

test.database

Note: test.database is only a demo database (metIdentifyClass). We will don't use it for next metabolite identification.

Retention time correction

The metabolite retention time (RT) may shift in different batches. So if you spike internal standards into your standards and biological samples, you can correct the RTs in database using rtCor4database function.

Data preparation

Firstly, please prepare two internal standard (IS) tables for database and biological samples. The format of IS table is shown like figure below:

The IS table for database should be named as "database.is.table.xlsx" and the IS table for experiment should be named as "experiment.is.table.xlsx".

Run `rtCor4database` function

test.database2 <- rtCor4database(experiment.is.table = "experiment.is.table.xlsx", 
                                 database.is.table = "database.is.table.xlsx", 
                                 database = test.database, 
                                 path = new.path)

The database should be the database (databaseClass object) which you want to correct RTs.

Note: test.database2 is only a demo database (metIdentifyClass). We will don't use it for next metabolite identification.

Metabolite identification

Identify metabolites based on MS2 library

MS1 data preparation

The peak table must contain "name" (peak name), "mz" (mass to charge ratio) and "rt" (retention time, unit is second). It can be from any data processing software (XCMS, MS-DIAL and so on).

MS2 data preparation

The raw MS2 data from DDA or DIA should be transfered to msp, mgf or mzXML format files. You can use ProteoWizard software.

Data organization

Place the MS1 peak table, MS2 data and database which you want to use in one folder like below figure shows:

Run `metIdentify` function

We can use the demo data from demoData package to show how to use it.

library(metIdentify)
path <- system.file("ms2_identification_demo_data1", package = "demoData")
file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE)
new.path <- file.path("./ms2_identification_demo_data1")

data("msDatabase_rplc0.0.1", package = "metIdentify")
save(msDatabase_rplc0.0.1, file = file.path(new.path, "msDatabase_rplc0.0.1"))
ms2.data <- grep(pattern = "mgf", dir(new.path), value = TRUE)

result <- metIdentify(ms1.data = "ms1.peak.table.csv", ##csv format
           ms2.data = ms2.data,##only msp and mgf and mz(X)ML are supported
           ms1.ms2.match.mz.tol = 25,
           ms1.ms2.match.rt.tol = 10,
           ms1.match.ppm = 25,
           ms2.match.tol = 0.4,
           fraction.weight = 0.3,
           dp.forward.weight = 0.6,
           dp.reverse.weight = 0.1,
           rt.match.tol = 60,
           polarity = "positive", 
           ce = "all",
           column = "rp",
           ms1.match.weight = 0.25,
           rt.match.weight = 0.25,
           ms2.match.weight = 0.5,
           path = new.path,
           total.score.tol = 0,
           candidate.num = 3,
           database = "msDatabase_rplc0.0.1",
           threads = 3)

The argument of metIdentify can be found using ?metIdentify.

ms1.data: The MS1 peak table name. It must be the "csv" format.
ms2.data: The MS2 data name. It can be msp, mgf or mzXML format.
ms1.ms2.match.mz.tol: The m/z tolerance for MS1 peak and MS2 spectrum match. Default is 25 ppm.
ms1.ms2.match.rt.tol: The RT tolerance for MS1 peak and MS2 spectrum match. Default is 10 s.
ms1.match.ppm: The m/z tolerance for peak and database metabolite match.
ms2.match.tol: The MS2 similarity tolerance for peak and database metabolite match. The MS2 similarity refers to the algorithm from MS-DIAl. So if you want to know more information about it, please read this publication.

$$MS2\;Simlarity\;Score\;(SS) = Fragment\;fractionWeight_{fraction} + Dot\;product(forward) * Weight_{dp.reverse}+Dot\;product(reverse)Weight_{dp.reverse}$$

fraction.weight: The weight for fragment match fraction.

$$Fragment\;match\;fraction = \dfrac{Match\;fragement\;number}{All\;fragment\;number}$$

dp.forward.weight: The weight for dot product (forward)
dp.forward.weight: The weight for dot product (forward)

$$Dot\;product = \dfrac{\sum(wA_{act.}wA_{lib})^2}{\sum(wA_{act.})^2\sum(wA_{lib})^2}with\;w =1/(1+\dfrac{A}{\sum(A-0.5)})$$

result is metIdentifyClass object. You can print it out to see the identificaiton information.

result

In-house database The in-house database in our lab were provided, with RPLC and HILIC mode RT information. They were acquired using Thermo Fisher QE-plus. However, the LC system may be different with your experiments, so if you want to use our in-house database for metabolite identification, please set rt.match.tol as 100000000(no limitation). The in-house database can be downloaded in my github.

You can get processsing parameters from it.

parameters <- getParams(object = result)
tibble::as.tibble(parameters)

You can also get the identification table from it.

identification.table1 <- getIdentificationTable(object = result)

Filter

You can use filterIden function to filer identification results from the metIdentifyClass object according to m/z error, rt error, MS2 spectra similarity and total score.

result.new <- filterIden(object = result, ms1.match.ppm = 10)

MS2 spectra match plot output

You can also use ms2plot function to output the MS2 specra match plot for one, multiple or all peaks.

Output one MS2 spectra match plot.

##which peaks have identification
peak.name <- whichHasIden(object = result)
head(peak.name)
ms2.plot1 <- ms2plot(object = result, database = msDatabase_rplc0.0.1, 
                     which.peak = peak.name[1,1])
ms2.plot1

You can also output interactive MS2 spectra match plot by setting interaction.plot as TRUE.

##which peaks have identification
ms2.plot2 <- ms2plot(object = result, database = msDatabase_rplc0.0.1, 
                     which.peak = peak.name[1,1], interaction.plot = TRUE)
ms2.plot2

You can output all MS2 spectra match plots by setting which.peak as "all".

ms2plot(object = result, database = msDatabase_rplc0.0.1, 
                     which.peak = "all", path = file.path(new.path, "inhouse"))

Then all the MS2 spectra match plots will be output in the "inhouse" folder.

Result integration and output

You can also use other public database for metabolite identificaiton based on MS2 spectra. We provided four public database, which can be got from the demoData package.

MassBank database (massbankDatabase0.0.1)
MoNA database (monaDatabase0.0.1)
HMDB database (hmdbDatabase0.0.1)
Orbitrap database from MassBank (orbitrapDatabase0.0.1)

Here, we will use orbitrap database and HMDB database for metabolite identification. The metIdentify4all function can identify metabolites with several databases.

Prepare demo data

library(demoData)
library(metIdentify)
path <- system.file("ms2_database", package = "demoData")
file.copy(from = file.path(path, c("orbitrapDatabase0.0.1", "hmdbDatabase0.0.1")), 
          to = new.path, overwrite = TRUE, recursive = TRUE)
load(file.path(new.path, "orbitrapDatabase0.0.1"))
load(file.path(new.path, "hmdbDatabase0.0.1"))

Prepare parameter list for each database

We can use metIdentifyParam and mzIdentifyParam to get parameter list for different databases.

parameter.list1 <- metIdentifyParam(polarity = "positive", 
                                    candidate.num = 3, 
                                    column = "rp", ce = "all", 
                                    database = "orbitrapDatabase0.0.1", 
                                    threads = 3)

parameter.list2 <- metIdentifyParam(polarity = "positive", 
                                    candidate.num = 3, 
                                    column = "rp", ce = "all", 
                                    database = "hmdbDatabase0.0.1", 
                                    threads = 3)

Run `metIdentify4all` function

result2 <- metIdentify4all(ms1.data = "ms1.peak.table.csv", ##csv format  
                           ms2.data = ms2.data, parameter.list = c(parameter.list1, parameter.list2), 
                           path = new.path)

It will return a list of objects of identification results.

result2[[1]]
result2[[2]]

We can also identify metabolites only m/z match, which is level 3 identification according to MSI. We provid the HMDB metabolite database in the demoData package.

library(demoData)
library(metIdentify)
path <- system.file("hmdb_metabolite_database", package = "demoData")
file.copy(from = file.path(path, "HMDB.metabolite.data"), 
          to = new.path, overwrite = TRUE, recursive = TRUE)

result3 <- mzIdentify(ms1.data = "ms1.peak.table.csv", ##csv format 
           ms1.match.ppm = 25,
           polarity = "positive", 
           column = "rp",
           path = new.path,
           candidate.num = 3,
           database = "HMDB.metabolite.data",
           threads = 3)

For result1 using in-hosue database (m/z, RT and MS2 spectra), they are level 1 identifications. For result2 using public MS2 spectra database (m/z and MS2 spectra), they are level 2 identifications. For result3 using HMDB metabolite database (only m/z), they are level 3 identifications according to MSI. So you can integrate those three results together.

identification.table1 <- getIdentificationTable(object = result, candidate.num = 1, type = "new")
identification.table1 <- identification.table1[!is.na(identification.table1$Compound.name),,drop = FALSE]

identification.table2 <- getIdentificationTable(object = result2, candidate.num = 1, type = "new")
identification.table2 <- identification.table2[!is.na(identification.table2$Compound.name),,drop = FALSE]
identification.table2 <- identification.table2[!(identification.table2$name %in% identification.table1$name),,drop = FALSE]

identification.table3 <- getIdentificationTable2(object = result3, candidate.num = 1, type = "new")
identification.table3 <- identification.table3[!is.na(identification.table3$Compound.name),,drop = FALSE]
identification.table3 <- identification.table3[!(identification.table3$name %in% c(identification.table1$name,identification.table1$name)),,drop = FALSE]

identification.table1 <- data.frame(identification.table1, Level = 1, stringsAsFactors = FALSE)
identification.table2 <- data.frame(identification.table2, Level = 2, stringsAsFactors = FALSE)
identification.table3 <- data.frame(identification.table3[-11], 
                                    "mz.match.score" = NA,
                                    "RT.error" = NA,
                                    "RT.match.score" = NA, 
                                    "CE" = NA,  
                                    "SS" = NA,
                                    "Total.score" = NA,
                                    identification.table3[11],
                                    Level = 3, stringsAsFactors = FALSE)

identification.table <- rbind(identification.table1, 
                              identification.table2, 
                              identification.table3)

write.csv(identification.table, file = file.path(new.path, "identification.table.csv"), row.names = FALSE)

Note: Of course, you can also use metIdentify4all function to run all the databases(in-house, publica database and HMDB metabolite database) in one run like we have shown above. What should be noticed is that the parameter list for HMDB.metabolite.data must use mzIdentifyParam function to construct.

jaspershen/metIdentify documentation built on July 26, 2019, 8:54 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jaspershen/metIdentify
Metabolite identification based on MS1 and MS2 spectra

In jaspershen/metIdentify: Metabolite identification based on MS1 and MS2 spectra

Introduction & Installation

Database construction

Data preparation

Data organization

Run `databaseConstruction` function

Retention time correction

Data preparation

Run `rtCor4database` function

Metabolite identification

Identify metabolites based on MS2 library

MS1 data preparation

MS2 data preparation

Data organization

Run `metIdentify` function

Filter

MS2 spectra match plot output

Result integration and output

Prepare demo data

Prepare parameter list for each database

Run `metIdentify4all` function

R Package Documentation

Browse R Packages

We want your feedback!

jaspershen/metIdentify Metabolite identification based on MS1 and MS2 spectra

In jaspershen/metIdentify: Metabolite identification based on MS1 and MS2 spectra

Introduction & Installation

Database construction

Data preparation

Data organization

Run databaseConstruction function

Retention time correction

Data preparation

Run rtCor4database function

Metabolite identification

Identify metabolites based on MS2 library

MS1 data preparation

MS2 data preparation

Data organization

Run metIdentify function

Filter

MS2 spectra match plot output

Result integration and output

Prepare demo data

Prepare parameter list for each database

Run metIdentify4all function

R Package Documentation

Browse R Packages

We want your feedback!

jaspershen/metIdentify
Metabolite identification based on MS1 and MS2 spectra

Run `databaseConstruction` function

Run `rtCor4database` function

Run `metIdentify` function

Run `metIdentify4all` function