PrInCE: PrInCE: Prediction of Interactomes from Co-Elution
In PrInCE: Predicting Interactomes from Co-Elution

Description Usage Arguments Details Value References Examples

PrInCE is a computational approach to infer protein-protein interaction networks from co-elution proteomics data, also called co-migration, co-fractionation, or protein correlation profiling. This family of methods separates interacting protein complexes on the basis of their diameter or biochemical properties. Protein-protein interactions can then be inferred for pairs of proteins with similar elution profiles. PrInCE implements a machine-learning approach to identify protein-protein interactions given a set of labelled examples, using features derived exclusively from the data. This allows PrInCE to infer high-quality protein interaction networks from raw proteomics data, without bias towards known interactions or functionally associated proteins, making PrInCE a unique resource for discovery.

PrInCE(profiles, gold_standard, gaussians = NULL, precision = NULL,
  verbose = FALSE, min_points = 1, min_consecutive = 5,
  impute_NA = TRUE, smooth = TRUE, smooth_width = 4,
  max_gaussians = 5, max_iterations = 50, min_R_squared = 0.5,
  method = c("guess", "random"), criterion = c("AICc", "AIC", "BIC"),
  pearson_R_raw = TRUE, pearson_R_cleaned = TRUE, pearson_P = TRUE,
  euclidean_distance = TRUE, co_peak = TRUE, co_apex = TRUE,
  classifier = c("NB", "SVM", "RF", "LR", "ensemble"), models = 10,
  cv_folds = 10, trees = 500)

`profiles`	the co-elution profile matrix, or a list of profile matrices if replicate experiments were performed. Can be a single numeric matrix, with proteins in rows and fractions in columns, or a list of matrices. Alternatively, can be provided as a single `MSnSet` object or a list of objects.
`gold_standard`	a set of 'gold standard' interactions, used to train the classifier. Can be provided either as an adjacency matrix, in which both rows and columns correspond to protein IDs in the co-elution matrix or matrices, or as a list of proteins in the same complex, which will be converted to an adjacency matrix by PrInCE. Zeroes in the adjacency matrix are interpreted by PrInCE as "true negatives" when calculating precision.
`gaussians`	optionally, provide Gaussian mixture models fit by the `build_gaussians` function. If `profiles` is a numeric matrix, this should be the named list output by `build_gaussians` for that matrix; if `profiles` is a list of numeric matrices, this should be a list of named lists
`precision`	optionally, return only interactions above the given precision; by default, all interactions are returned and the user can subsequently threshold the list using the `threshold_precision` function
`verbose`	if `TRUE`, print a series of messages about the stage of the analysis
`min_points`	filter profiles without at least this many total, non-missing points; passed to `filter_profiles`
`min_consecutive`	filter profiles without at least this many consecutive, non-missing points; passed to `filter_profiles`
`impute_NA`	if true, impute single missing values with the average of neighboring values; passed to `clean_profiles`
`smooth`	if true, smooth the chromatogram with a moving average filter; passed to `clean_profiles`
`smooth_width`	width of the moving average filter, in fractions; passed to `clean_profiles`
`max_gaussians`	the maximum number of Gaussians to fit; defaults to 5. Note that Gaussian mixtures with more parameters than observed (i.e., non-zero or NA) points will not be fit. Passed to `choose_gaussians`
`max_iterations`	the number of times to try fitting the curve with different initial conditions; defaults to 50. Passed to `fit_gaussians`
`min_R_squared`	the minimum R-squared value to accept when fitting the curve with different initial conditions; defaults to 0.5. Passed to `fit_gaussians`
`method`	the method used to select the initial conditions for nonlinear least squares optimization (one of "guess" or "random"); see `make_initial_conditions` for details. Passed to `fit_gaussians`
`criterion`	the criterion to use for model selection; one of "AICc" (corrected AIC, and default), "AIC", or "BIC". Passed to `choose_gaussians`
`pearson_R_raw`	if true, include the Pearson correlation (R) between raw profiles as a feature
`pearson_R_cleaned`	if true, include the Pearson correlation (R) between cleaned profiles as a feature
`pearson_P`	if true, include the P-value of the Pearson correlation between raw profiles as a feature
`euclidean_distance`	if true, include the Euclidean distance between cleaned profiles as a feature
`co_peak`	if true, include the 'co-peak score' (that is, the distance, in fractions, between the single highest value of each profile) as a feature
`co_apex`	if true, include the 'co-apex score' (that is, the minimum Euclidean distance between any pair of fit Gaussians) as a feature
`classifier`	the type of classifier to use: one of `"NB"` (naive Bayes), `"SVM"` (support vector machine), `"RF"` (random forest), `"LR"` (logistic regression), or `"ensemble"` (an ensemble of all four)
`models`	the number of classifiers to train and average across, each with a different k-fold cross-validation split
`cv_folds`	the number of folds to use for k-fold cross-validation
`trees`	for random forests only, the number of trees in the forest

PrInCE takes as input a co-elution matrix, with detected proteins in rows and fractions as columns, and a set of 'gold standard' true positives and true negatives. If replicate experiments were performed, a list of co-elution matrices can be provided as input. PrInCE will construct features for each replicate separately and use features from all replicates as input to the classifier. The 'gold standard' can be either a data frame or adjacency matrix of known interactions (and non-interactions), or a list of protein complexes. For computational convenience, Gaussian mixture models can be pre-fit to every profile and provided separately to the PrInCE function. The matrix, or matrices, can be provided to PrInCE either as numeric matrices or as MSnSet objects.

PrInCE implements three different types of classifiers to predict protein-protein interaction networks, including naive Bayes (the default), random forests, and support vector machines. The classifiers are trained on the gold standards using a ten-fold cross-validation procedure, training on 90 that are part of the training data, the held-out split is used to assign a classifier score, whereas for the remaining protein pairs, the median of all ten folds is used. Furthermore, to ensure the results are not sensitive to the precise classifier split used, an ensemble of multiple classifiers (ten, by default) is trained, and the classifier score is subsequently averaged across classifiers. PrInCE can also ensemble across a set of classifiers.

By default, PrInCE calculates six features from each pair of co-elution profiles as input to the classifier, including conventional similarity metrics but also several features specifically adapted to co-elution proteomics. For example, one such feature is derived from fitting a Gaussian mixture model to each elution profile, then calculating the smallest Euclidean distance between any pair of fitted Gaussians. The complete set of features includes:

the Pearson correlation between raw co-elution profiles;
the p-value of the Pearson correlation between raw co-elution profiles;
the Pearson correlation between cleaned profiles, which are generated by imputing single missing values with the mean of their neighbors, replacing remaining missing values with random near-zero noise, and smoothing the profiles using a moving average filter (see clean_profile);
the Euclidean distance between cleaned profiles;
the 'co-peak' score, defined as the distance, in fractions, between the maximum values of each profile; and
the 'co-apex' score, defined as the minimum Euclidean distance between any pair of fit Gaussians

The output of PrInCE is a ranked data frame, containing the classifier score for every possible protein pair. PrInCE also calculates the precision at every point in this ranked list, using the 'gold standard' set of protein complexes or binary interactions. Our recommendation is to select a threshold for the precision and use this to construct an unweighted protein interaction network.

a ranked data frame of interacting proteins, with the precision at each point in the list

\insertRef

stacey2017PrInCE

\insertRef

scott2015PrInCE

\insertRef

kristensen2012PrInCE

\insertRef

skinnider2018PrInCE

data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard, gaussians = gauss, models = 1, 
              cv_folds = 3)