metID.rtPred: Quantitative Structure-Retention Relationship modelling...

Description Usage Arguments Details Source See Also Examples

Description

Quantitative Structure-Retention Relationship modelling (QSRR) using molecular descriptors and randomForest modelling

Usage

1
metID.rtPred(object, ...)

Arguments

object

A "compMS2" class object.

...

additional arguments to nearZeroVar.

standardsTable

data.frame of standard compounds. The standard compounds should have been acquired using the same chromatographic method as the metabolomic dataset. If this argument is supplied then this table will be used to calculate the randomForest retention time prediction model rather than the possible_identity annotations from the "met ID comments" table. The table which must contain at mimimum the following 3 column names and an error will be returned if this is not the case (will ignore case e.g. both the column names SMILES or smiles are acceptable):

  1. compound "character" type of compound names.

  2. smiles "character" type of SMILES codes.

  3. RT "numeric" type of retention time values (in seconds)

N.B. The data.frame may also contain additional columns

descriptors

character vector of molecular descriptor class names from get.desc.names. If NULL then all molecular descriptors will be considered.

removeOut

logical (default = TRUE). If true outliers identified by Tukey's method that is a retention time deviation of any of the training set compounds greater than 1.5 * the interquartile range will be removed and the QSRR model will be recalculated.

propMissing

numeric maximum proportion of missing values to include a molecular descriptor (values 0-1, default=0.1 i.e. maximum 10% missing values).

propZero

numeric maximum proportion of zero values to include a molecular descriptor (values 0-1, default=0.2 i.e. maximum 20% zero values).

corrPairWise

numeric minimum pair-wise Pearson Product moment correlation value (values 0-1, default = 0.9), if any molecular descriptors have high pair-wise correlation then the variables with the largest mean absolute correlation of each group are removed.

verbose

logical if TRUE display progress bars.

Details

Based on the method described in Cao et. al. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419193/ and use the caret package (see tutorial: http://topepo.github.io/caret/rfe.html for the recursive feature selection. randomForest method utilized). calculates a quantitative structure-retention relationship model the default is to use the putative annotations included in the "metID comments" table of compMS2Explorer the putative annotations in the possible_identity column of the metID comments interactive table must match perfectly the database entry names found in the "best annotations" table (e.g. ensure correct matching by copy and pasting the possible compound identity in to the possible_identity column of the "metID comments" table). The metID.rtPred functions calculates molecular descriptors for all database entries in the "Best annotations" panel using the rcdk package.

The molecular descriptors are then cleaned in the following sequence:

  1. removing any molecular descriptors with greater than 10% missing values.

  2. removing any molecular descriptors with near zero variance using the function nearZeroVar from the caret package.

  3. a correlation matrix of remaining molecular descriptors is calculated and molecular descriptors with a standard deviation are removed.

  4. finally any molecular descriptors with a high pair-wise correlation (>= 0.9 pearson product moment) are identified and the molecular descriptors with the largest mean absolute correlation of each group are removed. see function findCorrelation from the caret package.

The calculation of molecular descriptors for a large number of database entries is a potentially time-consuming process and is therefore only needs to be conducted once and the results of the process saved in the compMS2 object.

The caret package function rfe function is then used to identify the optimum set of remaining molecular descriptors to predict retention time. A plot should appear showing the correlation between the actual and predicted retention times of the training set.

A possible workflow sequence would consist of initial examination of the results in compMS2Explorer with putative annotation of metabolites followed by use of the metID.rtPred function. After the first time the metID.rtPred function has run a new plot will appear in the compMS2Explorer gui where the "Best Annotations" closest to the randomForest model predicted retention times can be easily visualized. After more identifications have been made and additional putative annotations have been included in the "metID comments" table the metID.rtPred function can be ran a second time. It should be much faster than the first as molecular descriptors have already been calculated and cleaned for all entries.

Source

Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics et. al. Metabolomics 2015 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419193/

See Also

nearZeroVar, rfe, randomForest.

Examples

1
compMS2Example <- metID(compMS2Example, 'rtPred')

WMBEdmands/compMS2Miner documentation built on May 9, 2019, 10:04 p.m.