PrInCE: PrInCE: Prediction of Interactomes from Co-Elution

Description Usage Arguments Details Value References Examples

View source: R/PrInCE.R

Description

PrInCE is a computational approach to infer protein-protein interaction networks from co-elution proteomics data, also called co-migration, co-fractionation, or protein correlation profiling. This family of methods separates interacting protein complexes on the basis of their diameter or biochemical properties. Protein-protein interactions can then be inferred for pairs of proteins with similar elution profiles. PrInCE implements a machine-learning approach to identify protein-protein interactions given a set of labelled examples, using features derived exclusively from the data. This allows PrInCE to infer high-quality protein interaction networks from raw proteomics data, without bias towards known interactions or functionally associated proteins, making PrInCE a unique resource for discovery.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
PrInCE(
  profiles,
  gold_standard,
  gaussians = NULL,
  precision = NULL,
  verbose = FALSE,
  min_points = 1,
  min_consecutive = 5,
  min_pairs = 3,
  impute_NA = TRUE,
  smooth = TRUE,
  smooth_width = 4,
  max_gaussians = 5,
  max_iterations = 50,
  min_R_squared = 0.5,
  method = c("guess", "random"),
  criterion = c("AICc", "AIC", "BIC"),
  pearson_R_raw = TRUE,
  pearson_R_cleaned = TRUE,
  pearson_P = TRUE,
  euclidean_distance = TRUE,
  co_peak = TRUE,
  co_apex = TRUE,
  n_pairs = FALSE,
  classifier = c("NB", "SVM", "RF", "LR", "ensemble"),
  models = 1,
  cv_folds = 10,
  trees = 500
)

Arguments

profiles

the co-elution profile matrix, or a list of profile matrices if replicate experiments were performed. Can be a single numeric matrix, with proteins in rows and fractions in columns, or a list of matrices. Alternatively, can be provided as a single MSnSet object or a list of objects.

gold_standard

a set of 'gold standard' interactions, used to train the classifier. Can be provided either as an adjacency matrix, in which both rows and columns correspond to protein IDs in the co-elution matrix or matrices, or as a list of proteins in the same complex, which will be converted to an adjacency matrix by PrInCE. Zeroes in the adjacency matrix are interpreted by PrInCE as "true negatives" when calculating precision.

gaussians

optionally, provide Gaussian mixture models fit by the build_gaussians function. If profiles is a numeric matrix, this should be the named list output by build_gaussians for that matrix; if profiles is a list of numeric matrices, this should be a list of named lists

precision

optionally, return only interactions above the given precision; by default, all interactions are returned and the user can subsequently threshold the list using the threshold_precision function

verbose

if TRUE, print a series of messages about the stage of the analysis

min_points

filter profiles without at least this many total, non-missing points; passed to filter_profiles

min_consecutive

filter profiles without at least this many consecutive, non-missing points; passed to filter_profiles

min_pairs

minimum number of overlapping fractions between any given protein pair to consider a potential interaction

impute_NA

if true, impute single missing values with the average of neighboring values; passed to clean_profiles

smooth

if true, smooth the chromatogram with a moving average filter; passed to clean_profiles

smooth_width

width of the moving average filter, in fractions; passed to clean_profiles

max_gaussians

the maximum number of Gaussians to fit; defaults to 5. Note that Gaussian mixtures with more parameters than observed (i.e., non-zero or NA) points will not be fit. Passed to choose_gaussians

max_iterations

the number of times to try fitting the curve with different initial conditions; defaults to 50. Passed to fit_gaussians

min_R_squared

the minimum R-squared value to accept when fitting the curve with different initial conditions; defaults to 0.5. Passed to fit_gaussians

method

the method used to select the initial conditions for nonlinear least squares optimization (one of "guess" or "random"); see make_initial_conditions for details. Passed to fit_gaussians

criterion

the criterion to use for model selection; one of "AICc" (corrected AIC, and default), "AIC", or "BIC". Passed to choose_gaussians

pearson_R_raw

if true, include the Pearson correlation (R) between raw profiles as a feature

pearson_R_cleaned

if true, include the Pearson correlation (R) between cleaned profiles as a feature

pearson_P

if true, include the P-value of the Pearson correlation between raw profiles as a feature

euclidean_distance

if true, include the Euclidean distance between cleaned profiles as a feature

co_peak

if true, include the 'co-peak score' (that is, the distance, in fractions, between the single highest value of each profile) as a feature

co_apex

if true, include the 'co-apex score' (that is, the minimum Euclidean distance between any pair of fit Gaussians) as a feature

n_pairs

if TRUE, include the number of fractions in which both of a given pair of proteins were detected as a feature

classifier

the type of classifier to use: one of "NB" (naive Bayes), "SVM" (support vector machine), "RF" (random forest), "LR" (logistic regression), or "ensemble" (an ensemble of all four)

models

the number of classifiers to train and average across, each with a different k-fold cross-validation split

cv_folds

the number of folds to use for k-fold cross-validation

trees

for random forests only, the number of trees in the forest

Details

PrInCE takes as input a co-elution matrix, with detected proteins in rows and fractions as columns, and a set of 'gold standard' true positives and true negatives. If replicate experiments were performed, a list of co-elution matrices can be provided as input. PrInCE will construct features for each replicate separately and use features from all replicates as input to the classifier. The 'gold standard' can be either a data frame or adjacency matrix of known interactions (and non-interactions), or a list of protein complexes. For computational convenience, Gaussian mixture models can be pre-fit to every profile and provided separately to the PrInCE function. The matrix, or matrices, can be provided to PrInCE either as numeric matrices or as MSnSet objects.

PrInCE implements three different types of classifiers to predict protein-protein interaction networks, including naive Bayes (the default), random forests, and support vector machines. The classifiers are trained on the gold standards using a ten-fold cross-validation procedure, training on 90 that are part of the training data, the held-out split is used to assign a classifier score, whereas for the remaining protein pairs, the median of all ten folds is used. Furthermore, to ensure the results are not sensitive to the precise classifier split used, an ensemble of multiple classifiers (ten, by default) is trained, and the classifier score is subsequently averaged across classifiers. PrInCE can also ensemble across a set of classifiers.

By default, PrInCE calculates six features from each pair of co-elution profiles as input to the classifier, including conventional similarity metrics but also several features specifically adapted to co-elution proteomics. For example, one such feature is derived from fitting a Gaussian mixture model to each elution profile, then calculating the smallest Euclidean distance between any pair of fitted Gaussians. The complete set of features includes:

  1. the Pearson correlation between raw co-elution profiles;

  2. the p-value of the Pearson correlation between raw co-elution profiles;

  3. the Pearson correlation between cleaned profiles, which are generated by imputing single missing values with the mean of their neighbors, replacing remaining missing values with random near-zero noise, and smoothing the profiles using a moving average filter (see clean_profile);

  4. the Euclidean distance between cleaned profiles;

  5. the 'co-peak' score, defined as the distance, in fractions, between the maximum values of each profile; and

  6. the 'co-apex' score, defined as the minimum Euclidean distance between any pair of fit Gaussians

The output of PrInCE is a ranked data frame, containing the classifier score for every possible protein pair. PrInCE also calculates the precision at every point in this ranked list, using the 'gold standard' set of protein complexes or binary interactions. Our recommendation is to select a threshold for the precision and use this to construct an unweighted protein interaction network.

Value

a ranked data frame of interacting proteins, with the precision at each point in the list

References

\insertRef

stacey2017PrInCE

\insertRef

scott2015PrInCE

\insertRef

kristensen2012PrInCE

\insertRef

skinnider2018PrInCE

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard,
  gaussians = gauss, models = 1,
  cv_folds = 3
)

fosterlab/PrInCE-R documentation built on Dec. 11, 2020, 3:51 p.m.