Description Usage Arguments Details Value References Examples
PrInCE is a computational approach to infer protein-protein interaction networks from co-elution proteomics data, also called co-migration, co-fractionation, or protein correlation profiling. This family of methods separates interacting protein complexes on the basis of their diameter or biochemical properties. Protein-protein interactions can then be inferred for pairs of proteins with similar elution profiles. PrInCE implements a machine-learning approach to identify protein-protein interactions given a set of labelled examples, using features derived exclusively from the data. This allows PrInCE to infer high-quality protein interaction networks from raw proteomics data, without bias towards known interactions or functionally associated proteins, making PrInCE a unique resource for discovery.
1 2 3 4 5 6 7 8 9 | PrInCE(profiles, gold_standard, gaussians = NULL, precision = NULL,
verbose = FALSE, min_points = 1, min_consecutive = 5,
impute_NA = TRUE, smooth = TRUE, smooth_width = 4,
max_gaussians = 5, max_iterations = 50, min_R_squared = 0.5,
method = c("guess", "random"), criterion = c("AICc", "AIC", "BIC"),
pearson_R_raw = TRUE, pearson_R_cleaned = TRUE, pearson_P = TRUE,
euclidean_distance = TRUE, co_peak = TRUE, co_apex = TRUE,
classifier = c("NB", "SVM", "RF", "LR", "ensemble"), models = 10,
cv_folds = 10, trees = 500)
|
profiles |
the co-elution profile matrix, or a list of profile matrices
if replicate experiments were performed. Can be a single numeric matrix,
with proteins in rows and fractions in columns, or a list of matrices.
Alternatively, can be provided as a single
|
gold_standard |
a set of 'gold standard' interactions, used to train the classifier. Can be provided either as an adjacency matrix, in which both rows and columns correspond to protein IDs in the co-elution matrix or matrices, or as a list of proteins in the same complex, which will be converted to an adjacency matrix by PrInCE. Zeroes in the adjacency matrix are interpreted by PrInCE as "true negatives" when calculating precision. |
gaussians |
optionally, provide Gaussian mixture models fit by
the |
precision |
optionally, return only interactions above the given
precision; by default, all interactions are returned and the user can
subsequently threshold the list using the
|
verbose |
if |
min_points |
filter profiles without at least this many total,
non-missing points; passed to |
min_consecutive |
filter profiles without at least this many
consecutive, non-missing points; passed to |
impute_NA |
if true, impute single missing values with the average of
neighboring values; passed to |
smooth |
if true, smooth the chromatogram with a moving average filter;
passed to |
smooth_width |
width of the moving average filter, in fractions;
passed to |
max_gaussians |
the maximum number of Gaussians to fit; defaults to 5.
Note that Gaussian mixtures with more parameters than observed (i.e.,
non-zero or NA) points will not be fit. Passed to
|
max_iterations |
the number of times to try fitting the curve with
different initial conditions; defaults to 50. Passed to
|
min_R_squared |
the minimum R-squared value to accept when fitting the
curve with different initial conditions; defaults to 0.5. Passed to
|
method |
the method used to select the initial conditions for
nonlinear least squares optimization (one of "guess" or "random");
see |
criterion |
the criterion to use for model selection;
one of "AICc" (corrected AIC, and default), "AIC", or "BIC". Passed to
|
pearson_R_raw |
if true, include the Pearson correlation (R) between raw profiles as a feature |
pearson_R_cleaned |
if true, include the Pearson correlation (R) between cleaned profiles as a feature |
pearson_P |
if true, include the P-value of the Pearson correlation between raw profiles as a feature |
euclidean_distance |
if true, include the Euclidean distance between cleaned profiles as a feature |
co_peak |
if true, include the 'co-peak score' (that is, the distance, in fractions, between the single highest value of each profile) as a feature |
co_apex |
if true, include the 'co-apex score' (that is, the minimum Euclidean distance between any pair of fit Gaussians) as a feature |
classifier |
the type of classifier to use: one of |
models |
the number of classifiers to train and average across, each with a different k-fold cross-validation split |
cv_folds |
the number of folds to use for k-fold cross-validation |
trees |
for random forests only, the number of trees in the forest |
PrInCE takes as input a co-elution matrix, with detected proteins in rows and
fractions as columns, and a set of 'gold standard' true positives and true
negatives. If replicate experiments were performed, a list of co-elution
matrices can be provided as input. PrInCE will construct features for each
replicate separately and use features from all replicates as input to the
classifier. The 'gold standard' can be either a data frame or adjacency
matrix of known interactions (and non-interactions), or a list of protein
complexes. For computational convenience, Gaussian mixture models can be
pre-fit to every profile and provided separately to the PrInCE
function. The matrix, or matrices, can be provided to PrInCE either as
numeric matrices or as MSnSet
objects.
PrInCE implements three different types of classifiers to predict protein-protein interaction networks, including naive Bayes (the default), random forests, and support vector machines. The classifiers are trained on the gold standards using a ten-fold cross-validation procedure, training on 90 that are part of the training data, the held-out split is used to assign a classifier score, whereas for the remaining protein pairs, the median of all ten folds is used. Furthermore, to ensure the results are not sensitive to the precise classifier split used, an ensemble of multiple classifiers (ten, by default) is trained, and the classifier score is subsequently averaged across classifiers. PrInCE can also ensemble across a set of classifiers.
By default, PrInCE calculates six features from each pair of co-elution profiles as input to the classifier, including conventional similarity metrics but also several features specifically adapted to co-elution proteomics. For example, one such feature is derived from fitting a Gaussian mixture model to each elution profile, then calculating the smallest Euclidean distance between any pair of fitted Gaussians. The complete set of features includes:
the Pearson correlation between raw co-elution profiles;
the p-value of the Pearson correlation between raw co-elution profiles;
the Pearson correlation between cleaned profiles, which are generated
by imputing single missing values with the mean of their neighbors,
replacing remaining missing values with random near-zero noise, and
smoothing the profiles using a moving average filter (see
clean_profile
);
the Euclidean distance between cleaned profiles;
the 'co-peak' score, defined as the distance, in fractions, between the maximum values of each profile; and
the 'co-apex' score, defined as the minimum Euclidean distance between any pair of fit Gaussians
The output of PrInCE is a ranked data frame, containing the classifier score for every possible protein pair. PrInCE also calculates the precision at every point in this ranked list, using the 'gold standard' set of protein complexes or binary interactions. Our recommendation is to select a threshold for the precision and use this to construct an unweighted protein interaction network.
a ranked data frame of interacting proteins, with the precision at each point in the list
stacey2017PrInCE
\insertRefscott2015PrInCE
\insertRefkristensen2012PrInCE
\insertRefskinnider2018PrInCE
1 2 3 4 5 6 7 8 | data(scott)
data(scott_gaussians)
data(gold_standard)
# analyze only the first 100 profiles
subset <- scott[seq_len(500), ]
gauss <- scott_gaussians[names(scott_gaussians) %in% rownames(subset)]
ppi <- PrInCE(subset, gold_standard, gaussians = gauss, models = 1,
cv_folds = 3)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.