TrainQCModel: Train a binary classification model to flag peaks with poor...

Description Usage Arguments Value Examples

Description

The function acts as a wrapper to several functions from the caret package to train and optimize a binary predictive peak QC model for the provided training data. Twenty percent of the training dataset is randomly selected as validation set and left out from the training process to estimate the performance of the models on unseen data. The features are mean centered and scaled by diving by the standard deviation before being used for training. Repeated 10-fold cross validation (3 repeats) is applied to the remainder of the training set to minimize over-fitting. The model offering the highest accuracy is used and returned by the function.

Usage

1
2
3
TrainQCModel(data.merged, response.var = c("Status"),
  description.columns = c("Notes"), method = "RRF", tuneGrid = NULL,
  random.seed = NULL, export.model = FALSE, model.path = "", ...)

Arguments

data.merged

A dataframe that contains peak identifiers (File,FileName,PeptideModifiedSequence,FragmentIon,IsotopeLabelType,PrecursorCharge and ProductCharge), the calculated QC metrics as well as the Status assigned by the expert analyst to each transition pair. data.merged is the output of MakeDataSet function (output$data.merged).

response.var

This variable indicates the name of the column that stores the "ok" and "flag" labels for the transition pairs in the training data.

description.columns

If the input dataframe contains columns corresponding to description variables (such as Notes), it should be indicated here. Description and identifier columns will be removed from the data before training the model.

method

The machine learning algorithm for training the classifier. The algorithm can be chosen from the list of available packages in caret https://topepo.github.io/caret/available-models.html. The following have been tested: RRF, regLogistic, svmLinear3, svmPoly, kknn. Before using TrainQCModel with any of these packages, you will need to first install the machine learning package using the install.packages command.

tuneGrid

Use this parameter of you want to specify tuneGrid for the caret train method. Otherwise, set tuneGrid to NULL. See the caret package help for more details: https://topepo.github.io/caret/model-training-and-tuning.html.

random.seed

To fix the random seed for splitting the dataset into training and validation and the data splitting for cross validation, provide a vector of length 2 e.g. random.seed = c(1000,2000). This is particularly useful if you want to compare multiple models with the same data split.

export.model

A Logical parameter to indicate whether the model should be saved. If export.model = TRUE the model will be saved in model.path.

model.path

Path to the directory where the model will be saved if export.model = TRUE.

Value

A list with the following objects: model: Trained model to flag peaks with poor chromatography or interference. performance.testing: Confusion matrix of applying the model on the unseen validation data (20 model.file.path: If export.model = TRUE and the model is saved, the path and file name for the model is stored in this field.

Examples

1
2
3
4
5
6
7
8
9
rrf.grid <-  expand.grid(mtry = c(2,10),
                         coefReg = c(0.5,1),
                         coefImp = c(0))
model.rrf <- TrainQCModel(data.set.CSF$data.merged,
                          response.var = c("Status"),
                          description.columns = c("Notes"),
                          method = "RRF",
                          tuneGrid = rrf.grid,
                          random.seed = c(100,200))

shadieshghi/TargetedMSQC documentation built on May 13, 2019, 12:20 p.m.