knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
metIdentify is a R package which can be used for database construction, and metabolite identification based on in-house and public database.
Please install it via github.
if(!require(devtools)){ install.packages("devtools") } devtools::install_github("jaspershen/metIdentify") ##tinyTools is a dependent package for metIdentify devtools::install_github("jaspershen/tinyTools")
If you have in-house standards which have been acquired with MS2 spectra data, then you can construct the in-house MS2 spectra databases using metIdentify
package.
Firstly, please transform your raw standard MS data (positive and negative modes) to mzXML format using ProteoWizard. The parameter setting is shown like figure below:
Secondly, please organize your standard information as a table, and output as a csv or xlsx format. The format of stanford information can refer to our demo data in demoData
package.
From column 1 to 11, the columns are "Lab.ID", "mz", "RT", "CAS.ID", "HMDB.ID", "KEGG.ID", "Formula", "mz.pos", "mz.neg", "Submitter", respectively. It is OK if you have another information for the standards. Like the demo data shows, there are other additional information, namely "Family", "Sub.pathway" and "Note".
mz: Accurate mass of compound.
RT: Retention time, unit is second.
mz.pos: Mass to change ratio of compound in positive mode, for example, M+H.
mz.neg: Mass to change ratio of compound in negative mode, for example, M-H.
Submitter: The name of person or organization.
Then creat a folder and put you mzXML format datasets (positive mode in 'POS' folder and negative mode in 'NEG' folder) and compound information in it.
databaseConstruction
functionWe use the demo data in demoData
package to show how to use metIdentify
. Please install it first.
devtools::install_github("jaspershen/demoData")
library(demoData) library(metIdentify) path <- system.file("database_construction", package = "demoData") file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE) new.path <- file.path("./database_construction") test.database <- databaseConstruction(path = new.path, version = "0.0.1", metabolite.info.name = "metabolite.info_RPLC.csv", source = "Michael Snyder lab", link = "http://snyderlab.stanford.edu/", creater = "Xiaotao Shen", email = "shenxt1990@163.com", rt = TRUE, mz.tol = 15, rt.tol = 30, threads = 3)
The arguments of databaseConstruction
can be found using ?databaseConstruction
.
test.database is a databaseClass
object, you can print it to see its information.
test.database
Note:
test.database
is only a demo database (metIdentifyClass). We will don't use it for next metabolite identification.
The metabolite retention time (RT) may shift in different batches. So if you spike internal standards into your standards and biological samples, you can correct the RTs in database using rtCor4database
function.
Firstly, please prepare two internal standard (IS) tables for database and biological samples. The format of IS table is shown like figure below:
The IS table for database should be named as "database.is.table.xlsx" and the IS table for experiment should be named as "experiment.is.table.xlsx".
rtCor4database
functiontest.database2 <- rtCor4database(experiment.is.table = "experiment.is.table.xlsx", database.is.table = "database.is.table.xlsx", database = test.database, path = new.path)
The database
should be the database (databaseClass object) which you want to correct RTs.
Note:
test.database2
is only a demo database (metIdentifyClass). We will don't use it for next metabolite identification.
The peak table must contain "name" (peak name), "mz" (mass to charge ratio) and "rt" (retention time, unit is second). It can be from any data processing software (XCMS, MS-DIAL and so on).
The raw MS2 data from DDA or DIA should be transfered to msp, mgf or mzXML format files. You can use ProteoWizard software.
Place the MS1 peak table, MS2 data and database which you want to use in one folder like below figure shows:
metIdentify
functionWe can use the demo data from demoData
package to show how to use it.
library(metIdentify) path <- system.file("ms2_identification_demo_data1", package = "demoData") file.copy(from = path, to = ".", overwrite = TRUE, recursive = TRUE) new.path <- file.path("./ms2_identification_demo_data1") data("msDatabase_rplc0.0.1", package = "metIdentify") save(msDatabase_rplc0.0.1, file = file.path(new.path, "msDatabase_rplc0.0.1")) ms2.data <- grep(pattern = "mgf", dir(new.path), value = TRUE) result <- metIdentify(ms1.data = "ms1.peak.table.csv", ##csv format ms2.data = ms2.data,##only msp and mgf and mz(X)ML are supported ms1.ms2.match.mz.tol = 25, ms1.ms2.match.rt.tol = 10, ms1.match.ppm = 25, ms2.match.tol = 0.4, fraction.weight = 0.3, dp.forward.weight = 0.6, dp.reverse.weight = 0.1, rt.match.tol = 60, polarity = "positive", ce = "all", column = "rp", ms1.match.weight = 0.25, rt.match.weight = 0.25, ms2.match.weight = 0.5, path = new.path, total.score.tol = 0, candidate.num = 3, database = "msDatabase_rplc0.0.1", threads = 3)
The argument of metIdentify
can be found using ?metIdentify
.
ms1.data: The MS1 peak table name. It must be the "csv" format.
ms2.data: The MS2 data name. It can be msp, mgf or mzXML format.
ms1.ms2.match.mz.tol: The m/z tolerance for MS1 peak and MS2 spectrum match. Default is 25 ppm.
ms1.ms2.match.rt.tol: The RT tolerance for MS1 peak and MS2 spectrum match. Default is 10 s.
ms1.match.ppm: The m/z tolerance for peak and database metabolite match.
ms2.match.tol: The MS2 similarity tolerance for peak and database metabolite match. The MS2 similarity refers to the algorithm from MS-DIAl. So if you want to know more information about it, please read this publication.
$$MS2\;Simlarity\;Score\;(SS) = Fragment\;fractionWeight_{fraction} + Dot\;product(forward) * Weight_{dp.reverse}+Dot\;product(reverse)Weight_{dp.reverse}$$
$$Fragment\;match\;fraction = \dfrac{Match\;fragement\;number}{All\;fragment\;number}$$
dp.forward.weight: The weight for dot product (forward)
dp.forward.weight: The weight for dot product (forward)
$$Dot\;product = \dfrac{\sum(wA_{act.}wA_{lib})^2}{\sum(wA_{act.})^2\sum(wA_{lib})^2}with\;w =1/(1+\dfrac{A}{\sum(A-0.5)})$$
result
is metIdentifyClass object. You can print it out to see the identificaiton information.
result
In-house database The in-house database in our lab were provided, with RPLC and HILIC mode RT information. They were acquired using Thermo Fisher QE-plus. However, the LC system may be different with your experiments, so if you want to use our in-house database for metabolite identification, please set
rt.match.tol
as 100000000(no limitation). The in-house database can be downloaded in my github.
You can get processsing parameters from it.
parameters <- getParams(object = result) tibble::as.tibble(parameters)
You can also get the identification table from it.
identification.table1 <- getIdentificationTable(object = result)
You can use filterIden
function to filer identification results from the metIdentifyClass
object according to m/z error, rt error, MS2 spectra similarity and total score.
result.new <- filterIden(object = result, ms1.match.ppm = 10)
You can also use ms2plot
function to output the MS2 specra match plot for one, multiple or all peaks.
Output one MS2 spectra match plot.
##which peaks have identification peak.name <- whichHasIden(object = result) head(peak.name) ms2.plot1 <- ms2plot(object = result, database = msDatabase_rplc0.0.1, which.peak = peak.name[1,1]) ms2.plot1
You can also output interactive MS2 spectra match plot by setting
interaction.plot
as TRUE.
##which peaks have identification ms2.plot2 <- ms2plot(object = result, database = msDatabase_rplc0.0.1, which.peak = peak.name[1,1], interaction.plot = TRUE) ms2.plot2
You can output all MS2 spectra match plots by setting
which.peak
as "all".
ms2plot(object = result, database = msDatabase_rplc0.0.1, which.peak = "all", path = file.path(new.path, "inhouse"))
Then all the MS2 spectra match plots will be output in the "inhouse" folder.
You can also use other public database for metabolite identificaiton based on MS2 spectra. We provided four public database, which can be got from the demoData
package.
MassBank database (massbankDatabase0.0.1)
MoNA database (monaDatabase0.0.1)
HMDB database (hmdbDatabase0.0.1)
Orbitrap database from MassBank (orbitrapDatabase0.0.1)
Here, we will use orbitrap database and HMDB database for metabolite identification. The metIdentify4all
function can identify metabolites with several databases.
library(demoData) library(metIdentify) path <- system.file("ms2_database", package = "demoData") file.copy(from = file.path(path, c("orbitrapDatabase0.0.1", "hmdbDatabase0.0.1")), to = new.path, overwrite = TRUE, recursive = TRUE) load(file.path(new.path, "orbitrapDatabase0.0.1")) load(file.path(new.path, "hmdbDatabase0.0.1"))
We can use metIdentifyParam
and mzIdentifyParam
to get parameter list for different databases.
parameter.list1 <- metIdentifyParam(polarity = "positive", candidate.num = 3, column = "rp", ce = "all", database = "orbitrapDatabase0.0.1", threads = 3) parameter.list2 <- metIdentifyParam(polarity = "positive", candidate.num = 3, column = "rp", ce = "all", database = "hmdbDatabase0.0.1", threads = 3)
metIdentify4all
functionresult2 <- metIdentify4all(ms1.data = "ms1.peak.table.csv", ##csv format ms2.data = ms2.data, parameter.list = c(parameter.list1, parameter.list2), path = new.path)
It will return a list of objects of identification results.
result2[[1]] result2[[2]]
We can also identify metabolites only m/z match, which is level 3 identification according to MSI. We provid the HMDB metabolite database in the demoData
package.
library(demoData) library(metIdentify) path <- system.file("hmdb_metabolite_database", package = "demoData") file.copy(from = file.path(path, "HMDB.metabolite.data"), to = new.path, overwrite = TRUE, recursive = TRUE) result3 <- mzIdentify(ms1.data = "ms1.peak.table.csv", ##csv format ms1.match.ppm = 25, polarity = "positive", column = "rp", path = new.path, candidate.num = 3, database = "HMDB.metabolite.data", threads = 3)
For result1 using in-hosue database (m/z, RT and MS2 spectra), they are level 1 identifications. For result2 using public MS2 spectra database (m/z and MS2 spectra), they are level 2 identifications. For result3 using HMDB metabolite database (only m/z), they are level 3 identifications according to MSI. So you can integrate those three results together.
identification.table1 <- getIdentificationTable(object = result, candidate.num = 1, type = "new") identification.table1 <- identification.table1[!is.na(identification.table1$Compound.name),,drop = FALSE] identification.table2 <- getIdentificationTable(object = result2, candidate.num = 1, type = "new") identification.table2 <- identification.table2[!is.na(identification.table2$Compound.name),,drop = FALSE] identification.table2 <- identification.table2[!(identification.table2$name %in% identification.table1$name),,drop = FALSE] identification.table3 <- getIdentificationTable2(object = result3, candidate.num = 1, type = "new") identification.table3 <- identification.table3[!is.na(identification.table3$Compound.name),,drop = FALSE] identification.table3 <- identification.table3[!(identification.table3$name %in% c(identification.table1$name,identification.table1$name)),,drop = FALSE] identification.table1 <- data.frame(identification.table1, Level = 1, stringsAsFactors = FALSE) identification.table2 <- data.frame(identification.table2, Level = 2, stringsAsFactors = FALSE) identification.table3 <- data.frame(identification.table3[-11], "mz.match.score" = NA, "RT.error" = NA, "RT.match.score" = NA, "CE" = NA, "SS" = NA, "Total.score" = NA, identification.table3[11], Level = 3, stringsAsFactors = FALSE) identification.table <- rbind(identification.table1, identification.table2, identification.table3) write.csv(identification.table, file = file.path(new.path, "identification.table.csv"), row.names = FALSE)
Note: Of course, you can also use
metIdentify4all
function to run all the databases(in-house, publica database and HMDB metabolite database) in one run like we have shown above. What should be noticed is that the parameter list for HMDB.metabolite.data must usemzIdentifyParam
function to construct.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.