Description Usage Arguments Details Source See Also Examples
Quantitative Structure-Retention Relationship modelling (QSRR) using molecular descriptors and randomForest modelling
1 | metID.rtPred(object, ...)
|
object |
A "compMS2" class object. |
... |
additional arguments to nearZeroVar. |
standardsTable |
data.frame of standard compounds. The standard compounds should have been acquired using the same chromatographic method as the metabolomic dataset. If this argument is supplied then this table will be used to calculate the randomForest retention time prediction model rather than the possible_identity annotations from the "met ID comments" table. The table which must contain at mimimum the following 3 column names and an error will be returned if this is not the case (will ignore case e.g. both the column names SMILES or smiles are acceptable):
N.B. The data.frame may also contain additional columns |
descriptors |
character vector of molecular descriptor class names from get.desc.names. If NULL then all molecular descriptors will be considered. |
removeOut |
logical (default = TRUE). If true outliers identified by Tukey's method that is a retention time deviation of any of the training set compounds greater than 1.5 * the interquartile range will be removed and the QSRR model will be recalculated. |
propMissing |
numeric maximum proportion of missing values to include a molecular descriptor (values 0-1, default=0.1 i.e. maximum 10% missing values). |
propZero |
numeric maximum proportion of zero values to include a molecular descriptor (values 0-1, default=0.2 i.e. maximum 20% zero values). |
corrPairWise |
numeric minimum pair-wise Pearson Product moment correlation value (values 0-1, default = 0.9), if any molecular descriptors have high pair-wise correlation then the variables with the largest mean absolute correlation of each group are removed. |
verbose |
logical if TRUE display progress bars. |
Based on the method described in Cao et. al. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419193/ and use the caret package (see tutorial: http://topepo.github.io/caret/rfe.html for the recursive feature selection. randomForest method utilized).
calculates a quantitative structure-retention relationship model
the default is to use the putative annotations included in the "metID comments" table of compMS2Explorer
the putative annotations in the possible_identity column of the metID comments interactive table must match perfectly the database entry names found in the "best annotations" table (e.g. ensure correct matching by copy and pasting the possible compound identity in to the possible_identity column of the "metID comments" table). The metID.rtPred functions calculates molecular descriptors for all database entries in the "Best annotations" panel using the rcdk package.
The molecular descriptors are then cleaned in the following sequence:
removing any molecular descriptors with greater than 10% missing values.
removing any molecular descriptors with near zero variance using the function nearZeroVar
from the caret package.
a correlation matrix of remaining molecular descriptors is calculated and molecular descriptors with a standard deviation are removed.
finally any molecular descriptors with a high pair-wise correlation (>= 0.9 pearson product moment) are identified and the molecular descriptors with the largest mean absolute correlation of each group are removed. see function findCorrelation from the caret package.
The calculation of molecular descriptors for a large number of database entries is a potentially time-consuming process and is therefore only needs to be conducted once and the results of the process saved in the compMS2 object.
The caret package function rfe function is then used to identify the optimum set of remaining molecular descriptors to predict retention time. A plot should appear showing the correlation between the actual and predicted retention times of the training set.
A possible workflow sequence would consist of initial examination of the results in compMS2Explorer
with putative annotation of metabolites followed by use of the metID.rtPred
function. After the first time the metID.rtPred
function has run a new plot will appear in the compMS2Explorer
gui where the "Best Annotations" closest
to the randomForest model predicted retention times can be easily visualized.
After more identifications have been made and additional putative annotations
have been included in the "metID comments" table the metID.rtPred
function can be ran a second time. It should be much faster than the first
as molecular descriptors have already been calculated and cleaned for all entries.
Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics et. al. Metabolomics 2015 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419193/
nearZeroVar, rfe, randomForest.
1 | compMS2Example <- metID(compMS2Example, 'rtPred')
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.