Description Usage Arguments Details Value References Examples
PrInCE is a computational approach to infer proteinprotein interaction networks from coelution proteomics data, also called comigration, cofractionation, or protein correlation profiling. This family of methods separates interacting protein complexes on the basis of their diameter or biochemical properties. Proteinprotein interactions can then be inferred for pairs of proteins with similar elution profiles. PrInCE implements a machinelearning approach to identify proteinprotein interactions given a set of labelled examples, using features derived exclusively from the data. This allows PrInCE to infer highquality protein interaction networks from raw proteomics data, without bias towards known interactions or functionally associated proteins, making PrInCE a unique resource for discovery.
1 2 3 4 5 6 7 8 9  PrInCE(profiles, gold_standard, gaussians = NULL, precision = NULL,
verbose = FALSE, min_points = 1, min_consecutive = 5,
impute_NA = TRUE, smooth = TRUE, smooth_width = 4,
max_gaussians = 5, max_iterations = 50, min_R_squared = 0.5,
method = c("guess", "random"), criterion = c("AICc", "AIC", "BIC"),
pearson_R_raw = TRUE, pearson_R_cleaned = TRUE, pearson_P = TRUE,
euclidean_distance = TRUE, co_peak = TRUE, co_apex = TRUE,
classifier = c("NB", "SVM", "RF", "LR", "ensemble"), models = 10,
cv_folds = 10, trees = 500)

profiles 
the coelution profile matrix, or a list of profile matrices
if replicate experiments were performed. Can be a single numeric matrix,
with proteins in rows and fractions in columns, or a list of matrices.
Alternatively, can be provided as a single

gold_standard 
a set of 'gold standard' interactions, used to train the classifier. Can be provided either as an adjacency matrix, in which both rows and columns correspond to protein IDs in the coelution matrix or matrices, or as a list of proteins in the same complex, which will be converted to an adjacency matrix by PrInCE. Zeroes in the adjacency matrix are interpreted by PrInCE as "true negatives" when calculating precision. 
gaussians 
optionally, provide Gaussian mixture models fit by
the 
precision 
optionally, return only interactions above the given
precision; by default, all interactions are returned and the user can
subsequently threshold the list using the

verbose 
if 
min_points 
filter profiles without at least this many total,
nonmissing points; passed to 
min_consecutive 
filter profiles without at least this many
consecutive, nonmissing points; passed to 
impute_NA 
if true, impute single missing values with the average of
neighboring values; passed to 
smooth 
if true, smooth the chromatogram with a moving average filter;
passed to 
smooth_width 
width of the moving average filter, in fractions;
passed to 
max_gaussians 
the maximum number of Gaussians to fit; defaults to 5.
Note that Gaussian mixtures with more parameters than observed (i.e.,
nonzero or NA) points will not be fit. Passed to

max_iterations 
the number of times to try fitting the curve with
different initial conditions; defaults to 50. Passed to

min_R_squared 
the minimum Rsquared value to accept when fitting the
curve with different initial conditions; defaults to 0.5. Passed to

method 
the method used to select the initial conditions for
nonlinear least squares optimization (one of "guess" or "random");
see 
criterion 
the criterion to use for model selection;
one of "AICc" (corrected AIC, and default), "AIC", or "BIC". Passed to

pearson_R_raw 
if true, include the Pearson correlation (R) between raw profiles as a feature 
pearson_R_cleaned 
if true, include the Pearson correlation (R) between cleaned profiles as a feature 
pearson_P 
if true, include the Pvalue of the Pearson correlation between raw profiles as a feature 
euclidean_distance 
if true, include the Euclidean distance between cleaned profiles as a feature 
co_peak 
if true, include the 'copeak score' (that is, the distance, in fractions, between the single highest value of each profile) as a feature 
co_apex 
if true, include the 'coapex score' (that is, the minimum Euclidean distance between any pair of fit Gaussians) as a feature 
classifier 
the type of classifier to use: one of 
models 
the number of classifiers to train and average across, each with a different kfold crossvalidation split 
cv_folds 
the number of folds to use for kfold crossvalidation 
trees 
for random forests only, the number of trees in the forest 
PrInCE takes as input a coelution matrix, with detected proteins in rows and
fractions as columns, and a set of 'gold standard' true positives and true
negatives. If replicate experiments were performed, a list of coelution
matrices can be provided as input. PrInCE will construct features for each
replicate separately and use features from all replicates as input to the
classifier. The 'gold standard' can be either a data frame or adjacency
matrix of known interactions (and noninteractions), or a list of protein
complexes. For computational convenience, Gaussian mixture models can be
prefit to every profile and provided separately to the PrInCE
function. The matrix, or matrices, can be provided to PrInCE either as
numeric matrices or as MSnSet
objects.
PrInCE implements three different types of classifiers to predict proteinprotein interaction networks, including naive Bayes (the default), random forests, and support vector machines. The classifiers are trained on the gold standards using a tenfold crossvalidation procedure, training on 90 that are part of the training data, the heldout split is used to assign a classifier score, whereas for the remaining protein pairs, the median of all ten folds is used. Furthermore, to ensure the results are not sensitive to the precise classifier split used, an ensemble of multiple classifiers (ten, by default) is trained, and the classifier score is subsequently averaged across classifiers. PrInCE can also ensemble across a set of classifiers.
By default, PrInCE calculates six features from each pair of coelution profiles as input to the classifier, including conventional similarity metrics but also several features specifically adapted to coelution proteomics. For example, one such feature is derived from fitting a Gaussian mixture model to each elution profile, then calculating the smallest Euclidean distance between any pair of fitted Gaussians. The complete set of features includes:
the Pearson correlation between raw coelution profiles;
the pvalue of the Pearson correlation between raw coelution profiles;
the Pearson correlation between cleaned profiles, which are generated
by imputing single missing values with the mean of their neighbors,
replacing remaining missing values with random nearzero noise, and
smoothing the profiles using a moving average filter (see
clean_profile
);
the Euclidean distance between cleaned profiles;
the 'copeak' score, defined as the distance, in fractions, between the maximum values of each profile; and
the 'coapex' score, defined as the minimum Euclidean distance between any pair of fit Gaussians
The output of PrInCE is a ranked data frame, containing the classifier score for every possible protein pair. PrInCE also calculates the precision at every point in this ranked list, using the 'gold standard' set of protein complexes or binary interactions. Our recommendation is to select a threshold for the precision and use this to construct an unweighted protein interaction network.
a ranked data frame of interacting proteins, with the precision at each point in the list
stacey2017PrInCE
\insertRefscott2015PrInCE
\insertRefkristensen2012PrInCE
\insertRefskinnider2018PrInCE
1 2 3 4 5 6 7 8  data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset < scott[seq_len(500), ]
gauss < scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi < PrInCE(subset, gold_standard, gaussians = gauss, models = 1,
cv_folds = 3)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.