Introduction

The MSiP is a computational approach to predict protein-protein interactions (PPIs) from large-scale affinity purification mass spectrometry (AP-MS) data. This approach includes both spoke and matrix models for interpreting AP-MS data in a network context. The "spoke" model considers only bait-prey interactions, whereas the "matrix" model assumes that each of the identified proteins (baits and prey) in a given AP-MS experiment interacts with each of the others. The spoke model has a high false-negative rate, whereas the matrix model has a high false-positive rate. Thus, although both statistical models have merits, a combination of both models has shown to increase the performance of machine learning classifiers in terms of their capabilities in discrimination between true and false positive interactions.

Load the package

library(MSiP)

Sample Data Description:

A demo AP-MS proteomics dataset is provided in this package to guide the users about data structure.

data("SampleDatInput")
head(SampleDatInput)

Scoring based on "spoke-model":

Comparative Proteomic Analysis Software Suite (CompPASS) is a robust statistical scoring scheme for assigning scores to bait-prey interactions. The output from CompPASS scoring includes Z-score, S-score, D-score, WD-score and other features. This function was optimized from the.

datScoring <- 
    cPASS(SampleDatInput)
head(datScoring)

Scoring based on "matrix-model":

The Dice coefficient was first applied by to score interaction between all identified proteins (baits and preys) in a given AP-MS expriment.

datScoring <- 
    diceCoefficient(SampleDatInput)
head(datScoring)

Alternatively, Jaccard, Simpson, and Overlap scores can be used to score the interaction between all the identified proteins in a given AP-MS experiment.

#Jaccard coefficient
datScoring <- 
    jaccardCoefficient(SampleDatInput)
head(datScoring)

#Simpson coefficient
datScoring <- 
    simpsonCoefficient(SampleDatInput)
head(datScoring)

#Overlap score
datScoring <- 
    simpsonCoefficient(SampleDatInput)
head(datScoring)

Finally, a weighted matrix model can also be employed to score interactions between identified proteins in a given AP-MS experiment. The output of the weighted matrix model includes the number of experiments for which the pair of proteins is co-purified (i.e., k) and $-1$*log(P-value) of the hypergeometric test (i.e., logHG) given the experimental overlap value, each protein's total number of observed experiments, and the total number of experiments.

datScoring <- 
Weighted.matrixModel(SampleDatInput)
head(datScoring)

Assign a confidence score to each instances using classifiers:

The labeled feature matrix can be used as input for Support Vector Machine (SVM) or Random Forest (RF) classifiers. The classifier then assigns each bait-prey pair a confidence score, indicating the level of support for that pair of proteins to interact. Hyperparameter optimization can also be performed to select a set of parameters that maximizes the model's performance. The RF and the SVM functions provided in this package also computes the areas under the precision-recall (PR) and ROC curve to evalute the performance of the classifier.

Import the demo data:

data("testdfClassifier")
head(testdfClassifier)

Run the RF classifier:

#only generate the pr.curve
predidcted_RF <- 
    rfTrain(testdfClassifier,impute = FALSE, p = 0.3, parameterTuning = FALSE,
            mtry  = seq(from = 1, to = 5, by = 1),
            min_node_size = seq(from = 1, to = 5, by = 1),
            splitrule =c("gini"),metric = "Accuracy",
            resampling.method = "repeatedcv",iter = 5,repeats = 5,
            pr.plot = TRUE, roc.plot = FALSE
    )

Output from RF classifier:

#positive score corresponds to  the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_RF)

Run the SVM classifier:

#only generate the ROC curve
predidcted_SVM <- 
    svmTrain(testdfClassifier,impute = FALSE,p = 0.3,parameterTuning = FALSE,
             cost = seq(from = 2, to = 10, by = 2),
             gamma = seq(from = 0.01, to = 0.10, by = 0.02),
             kernel = "radial",ncross = 10,
             pr.plot = FALSE, roc.plot = TRUE
    ) 

Output from SVM classifier:

#positive score corresponds to  the level of support for the pair of proteins to be true positive
#negative score corresponds to the level of support for the pair of proteins to be true negative
head(predidcted_SVM)


Babulab-bioc/MSiP documentation built on Dec. 17, 2021, 9:52 a.m.