knitr::opts_chunk$set(warning = FALSE, fig.width = 8, fig.height = 6)
options(rmarkdown.html_vignette.check_title = FALSE)
knitr::include_graphics("pmartR_logo_final.jpg")

knitr::opts_chunk$set(fig.width = 8)
knitr::opts_chunk$set(fig.height = 6)

library(pmartR)
library(pmartRdata)
library(ggplot2)

This vignette describes a standard processing pipeline for labeled proteomics data with the pmartR package, including normalization to the reference pool, quality control, global normalization, and protein rollup. After normalization to the reference pool, the pipeline is identical to a typical processing pipeline for unlabeled or global proteomics data, and very similar to processing for metabolomics and lipidomics data. A separate vignette, "RNAseq Data Processing", describes the typical processing steps for seqData objects. For more details about the statistical analyses available within pmartR see the "Statistical Analysis with pmartR" vignette.

Labeled Proteomics Data

Below we load in 3 example data frames from the pmartRdata package in order to create an isobaricpepData object. For more details on omicsData object creation, see the "Data Object Creation" vignette. These data consist of samples from three different strains of a virus, plus pooled reference samples. Since this is only a subset of data from this study, the number of samples is not identical across TMT plexes, and the TMT plex names are not sequential.

# load example e_data, f_data, e_meta for the labeled peptide dataset
edata <- isobaric_edata
fdata <- isobaric_fdata
emeta <- isobaric_emeta

For these data, the column named "Peptide" denotes the peptide identifier in e_data.

head(edata)

The f_data data frame contains the sample identifier, "SampleID", mapping to the column names in e_data other than "Peptide". The TMT plex on which each sample was run is also included in f_data, as this will be required for normalization to the reference pool. The run order, which was spread across multiple days in this case, is also included in f_data although this information will not be used in our analyses. The column for "Virus" takes on 1 of 3 values, and will be used as our main effect, since comparisons of interest in this experiment are between virus strains. Finally, f_data contains information about the Donor and Replicate, which describe the provenance of the samples and biological replicate information.

head(fdata)

The e_meta data frame contains just 2 columns, "Protein" and "Peptide". The "Peptide" column contains the same peptides found in e_data, and "Protein" provides the mapping of the peptides to the protein level.

head(emeta)

Since these are isobaric labeled (TMT) peptide data, we will create an isobaricpepData object in pmartR. Next, we log2 transform the data.

Log transforming the data prior to analysis is highly recommended, and pmartR supports log2, log10, and natural logarithm transformations. The edata_transform() function provides this capability. We can also use edata_transform() to transform the data back to the abundance scale if needed. Note that the scale of the data is stored and automatically updated in the data_info$data_scale attribute of the omicsData object.

mypep <- as.isobaricpepData(
  e_data = edata,
  f_data = fdata,
  e_meta = emeta,
  edata_cname = "Peptide",
  fdata_cname = "SampleID",
  emeta_cname = "Protein",
  data_scale = "abundance"
)

# log2 transform the data
mypep <- edata_transform(omicsData = mypep, data_scale = "log2")

We can use the summary function to get basic summary information for our isbaricpepData object.

summary(mypep)

Normalize to Reference Pool

One of the first steps when analyzing labeled proteomics data is to normalize to the reference pool samples. We use the pmartR normalize_isbaric function to first create an isobaricnormRes object, but not apply the normalization. This allows us to examine the reference pool samples prior to using them for this normalization step.

# Don't apply the normalization quite yet; can use summary() and plot() to view reference pool samples
mypep_refpools <- normalize_isobaric(
  omicsData = mypep,
  exp_cname = "Plex",
  apply_norm = FALSE,
  refpool_cname = "Virus",
  refpool_notation = "Pool"
)

We can use the summary and plot functions on our isobaricnormRes object, to make sure that there isn't anything concerning about these samples.

summary(mypep_refpools)

plot(mypep_refpools)

When we're ready to actually perform the reference pool normalization, we can do so as follows with the apply_norm argument.

# Now apply the normalization; can use plot() to view the study samples after reference pool normalization
mypep <- normalize_isobaric(
  omicsData = mypep,
  exp_cname = "Plex",
  apply_norm = TRUE,
  refpool_cname = "Virus",
  refpool_notation = "Pool"
)

We can plot the samples after normalization to the reference pool.

plot(mypep)

Workflow

Once the omicsData object is created, and after normalization to the reference pool samples for labeled peptide data, a typical QC workflow follows the figure below.

knitr::include_graphics("pmartR_graphic-final.jpg")

Main Effects

We are preparing this data for statistical analysis where we will compare the samples belonging to one group to the samples belonging to another, and so we must specify the group membership of the samples. We do this using the group_designation() function, which modifies our pepData object and returns an updated version of it. Up to two main effects and up to two covariates may be specified, with one main effect being required at minimum. For these example data, we specify the main effect to be "Virus" so that we can make comparisons between the different strains. Certain functions we will use below require that groups have been designated via the group_designation() function, and doing so makes the results of plot and summary more meaningful.

mypep <- group_designation(omicsData = mypep, main_effects = "Virus")
summary(mypep)
plot(mypep, order_by = "Virus", color_by = "Virus")

The group_designation() function creates an attribute of the dataset as follows:

attributes(mypep)$group_DF

Filter Biomolecules

It is often good practice to filter out biomolecules that do not meet certain criteria, and we offer up to 5 different filters to do this. Each of the filter functions calculates metric(s) that can be used to filter out biomolecules and returns an S3 object. Using the summary() function produces a summary of the metric(s) and using the plot() function produces a graph. Filters that require a user-specified threshold in order to actually filter out biomolecules have corresponding summary and plot methods that take optional argument(s) to specify that threshold. Once one of the filter functions has been called, the results of that function can be used in conjunction with the applyFilt() function to actually filter out peptides based on the metric(s) and user-specified threshold(s) and create a new, filtered omicsData object.

Proteomics Filter

The proteomics filter is applicable only to peptide level data (isobaricpepData or pepData objects) that contain the e_meta component, as this filter counts the number of peptides that map to each protein and/or the number of proteins to which each individual peptide maps.

Using the argument degen_peps = TRUE, we see from the summary that each peptide maps to a single protein, so there are no redundant peptides to remove.

myfilter <- proteomics_filter(mypep)
summary(myfilter, degen_peps = TRUE)
plot(myfilter, bw_theme = TRUE)

Using the argument min_num_peps, we can restrict our dataset to just those proteins with a minimum number of peptides mapping to them. A common threshold to use is 2, as this ensures that only proteins with at least 2 peptides mapping to them are retained. We can apply the proteomics filter using the applyFilt function.

summary(myfilter, min_num_peps = 2)

# apply the filter, requiring proteins to have at least 2 peptides that map to them
mypep <- applyFilt(filter_object = myfilter, omicsData = mypep, min_num_peps = 2)

Molecule Filter

The molecule filter allows the user to remove from the dataset any biomolecule not seen in at least min_num samples. The user may specify a threshold of the minimum number of times each biomolecule must be observed across all samples; the default value is 2. See the "Filter Functionality" vignette for more detail about this filter, and how to account for main effects and/or batches.

myfilter <- molecule_filter(mypep)
summary(myfilter, min_num = 3)
plot(myfilter, bw_them = TRUE)

Setting the threshold to 3, we would filter 60,349 peptides out of the dataset. If we'd like to make the filter less stringent, we could use a threshold of 2. We use the applyFilt function to apply the molecule filter.

summary(myfilter, min_num = 2)

mypep <- applyFilt(filter_object = myfilter, mypep, min_num = 2)
summary(mypep)

Coefficient of Variation Filter

The coefficient of variation (CV) filter calculates the pooled CV values as in [@ahmed1995pooling].

The user can then specify a CV threshold, above which peptides are removed.

myfilter <- cv_filter(mypep)
summary(myfilter, cv_threshold = 140)
plot(myfilter, cv_threshold = 140)

mypep <- applyFilt(filter_object = myfilter, omicsData = mypep, cv_threshold = 140)
summary(mypep)

Custom Filter

Sometimes it is known a priori that certain peptides or samples should be filtered out of the dataset prior to statistical analysis. Perhaps there are known contaminant proteins, and so peptides mapping to them should be removed, or perhaps something went wrong during sample preparation for a particular sample. On the other hand, it may be preferred to specify peptides or samples to keep (removing those not explicitly specified), and this can also be accomplished. Keep in mind that both 'remove' and 'keep' arguments cannot be specified together; either 'remove' arguments only or 'keep' arguments only may be specified in a single call to custom_filter().

Here, we demonstrate the removal of contaminant proteins and their associated peptides; the contaminant protein names contain "XXX".

remove_prots <- mypep$e_meta$Protein[grep("XXX", mypep$e_meta$Protein)]
myfilter <- custom_filter(mypep, e_meta_remove = remove_prots)
summary(myfilter)

mypep <- applyFilt(filter_object = myfilter, omicsData = mypep)

Note that there is a summary() method for objects of type custom_filt, but no plot() method.

IMD-ANOVA Filter

The IMD-ANOVA filter removes biomolecules that do not have sufficient data for the statistical tests available in pmartR; these are ANOVA (quantitative test) and an independence of missing data (IMD) using a g-test (qualitative test). When using the summary(), plot(), and applyFilt() functions, you can specify just one filter (ANOVA or IMD) or both, depending on the tests you'd like to perform later.

First we create the filter object.

myfilter <- imdanova_filter(mypep)

Here we consider what the filter would look like for both ANOVA and IMD.

summary(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)

Here we consider what the filter would look like for just ANOVA.

summary(myfilter, min_nonmiss_anova = 2)

Here we consider what the filter would look like for just IMD.

summary(myfilter, min_nonmiss_gtest = 3)

Now we apply the filter for both ANOVA and IMD.

mypep <- applyFilt(filter_object = myfilter, omicsData = mypep, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)

summary(mypep)

Filter Samples

To identify any samples that are potential outliers or anomalies (due to sample quality, preparation, or processing circumstances), we use a robust Mahalanobis distance (rMd) [@matzke2011improved] score based on 2-5 metrics. The possible metrics are:

The rMd score can be mapped to a p-value, and a p-value threshold used to identify potentially outlying samples. In general, for proteomics data we recommend using all 5 metrics (or sometimes leaving out Kurtosis). A plot of the rMd values for each sample is generated, and specifying a value for 'pvalue_threshold' results in a horizontal line on the plot, with samples above the line slated for filtering at the given threshold.

myfilter <- rmd_filter(omicsData = mypep, metrics = c("Correlation", "Proportion_Missing", "MAD", "Skewness"))
plot(myfilter)

summary(myfilter, pvalue_threshold = 0.0001)
plot(myfilter, pvalue_threshold = 0.0001, bw_theme = TRUE)

Using the output from the summary() function, we can explore the potential outliers identified to see which metrics are driving their "outlier-ness". Box plots for each metric are graphed, with the specified sample marked with an X.

# get vector of potential outliers
potential_outliers <- summary(myfilter, pvalue_threshold = 0.0001)$filtered_samples

# loop over potential outliers to generate plots
if (length(potential_outliers) > 0) {
  for (i in 1:length(potential_outliers)) {
    print(plot(myfilter, sampleID = potential_outliers[i]))
  }
}

StrainB_D3_R4 appears to have been flagged as a potential outlier primarily on the basis of its MAD relative to the other samples. StrainA_D3_R2 and StrainB_D3_R2 have similar profiles across the four metrics, with high amounts of missing data, low correlations, and high MAD and skew.

We can use this information to help determine whether to remove any of these samples. It is often useful to look at additional data summaries to inform outlier removal, such as a correlation heat map or principal components analysis (PCA) plot.

In the correlation heatmap we are looking to see if any samples stand out as having low correlation with a majority of the others; any such samples would appear as dark stripes. We use the interactive = TRUE argument so we can hover over the plot and see which samples correspond to the cells in the heatmap. None of the samples stands out as concerning in the correlation heatmap for these data.

mycor <- cor_result(omicsData = mypep)
plot(mycor, interactive = TRUE)

The PCA function in pmartR utilizes [@stacklies2007pcamethods] probabilistic PCA, which which allows for missing data. In the PCA plot we are interested to see whether any of the potential outliers identified by the rMd filter do not cluster with their main effect group. Note that for large proteomics datasets, such as this example, generating the PCA results may take up to a minute or longer.

Using the interactive = TRUE argument to the plot function, we can see that the three potential outliers identified by the rMd filter are the same samples that do not cluster with the others (paying attention to the R^2 values on the axes).

mypca <- dim_reduction(omicsData = mypep)

plot(mypca, interactive = TRUE)

Given the above exploratory analyses, we'd recommend removing all three samples identified by the rMd filter. Note that if we had only wanted to remove a subset of those samples, we could do so with a custom filter, as opposed to applying the rMd filter.

mypep <- applyFilt(filter_object = myfilter, omicsData = mypep, pvalue_threshold = 0.0001)
summary(mypep)

Summarize Data

In addition to the correlation heatmap and PCA plot that we utilized above, the pmartR package contains additional methods for data summarization and exploration that can be used as part of the QC process: numeric summaries and associated plots (via the edata_summary function), missing data summarization (via missingval_result, plot_missingval, missingval_scatterplot and missingval_heatmap).

Numeric Summaries

We can generate numeric summaries of our data by either sample or molecule. The edata_summary function computes the mean, standard deviation, median, percent of observations for which a value was observed, the minimum value, and the maximum value.

edata_summary(mypep, by = "sample", groupvar = NULL)

# edata_summary(norm_data, by = "molecule", groupvar = "Condition")
# edata_summary(norm_data, by = "molecule", groupvar = NULL)

Missing Data

Patterns of missing data can be explored using the missingval_result and plot_missingval functions.

results <- missingval_result(omicsData = mypep)

plot(results, omicsData = mypep, plot_type = "bar")
plot(results, omicsData = mypep, plot_type = "scatter")

Normalize Data

After quality control, filtering, and exploring our data, the next step in a typical workflow is to normalize the data. See the "Data Normalization" vignette for more details about the normalization approaches included in pmartR.

Since we are working with a global proteomics dataset, we will use the SPANS algorithm to guide our selection of normalization approach and then normalize the data. Datasets with only a couple hundred biomolecules or fewer are not candidates for SPANS. Note that running SPANS can take up to a couple of minutes or more.

# returns a data frame arranged by descending SPANS score
# not run due to long runtime; SPANS plot generated and saved to include in vignette
spans_result <- spans_procedure(mypep)

plot(spans_result)

Plotting the SPANS object returns a heatmap where each cell in the grid corresponds to one combination of subset and normalization function. A dot in the cell indicates that the method had the highest SPANS scores; ties are possible as we see below.

knitr::include_graphics("SPANS_isobaricpep.png")

Particularly when there are ties in the SPANS scores, it can be useful to see how many biomolecules are part of the subset(s) being used. Basing a normalization on too few biomolecules, say fewer than 10%, is not advised. The column in the SPANSRes object called "mols_used_in_norm" provides the number of biomolecules in the subset. In our case, the subset RIP (0.25) contains 20753 peptides, so we will select this subset. Since all normalization functions resulted in the same SPANS score for this subset, we will choose median.

Similar to normalizing to the reference pool, we will first create a normRes object so we can see the effect of applying the specified normalization ahead of actually applying it to our data.

# readRDS("/vignettes/spans_result.RDS")

mynorm <- normalize_global(
  omicsData = mypep,
  subset_fn = "rip", params = list(rip = 0.25),
  norm_fn = "median",
  apply_norm = FALSE
)

# plot(mynorm) # currently errors out

Now that we are satisfied with our selections, we can apply the normalization

# apply the normalization
mynorm <- normalize_global(omicsData = mypep, 
                           subset_fn = "rip", params = list(rip = 0.25), 
                           norm_fn = "median", 
                           apply_norm = TRUE)

At this point it may be of interest to re-evaluate some of the earlier exploratory analyses or plots.

Protein Quantification Methods

Protein quantification can be done using the protein_quant function, either with or without accounting for different isoforms of the proteins, which are also called 'proteoforms'. Regardless of whether the user is accounting for protein isoforms, they must specify a method for rolling peptides up to proteins (one of "rollup", "rollup", "qrollup", or "zrollup") and a function to use for combining the peptide-level data (either "mean" or "median"). When the user does wish to account for protein isoforms, they may utilize Bayesian Proteoform Quantification (BP-Quant) [@webb2014bayesian] via the bpquant function, and input those results as one of the arguments to the protein_quant function.

The protein_quant function takes in an object of class pepData (or isobaricpepData) and returns an object of class proData.

Rollup Methods

The rollup method takes either the mean or median of all peptides mapping to a given protein, and sets that value as the relative protein abundance.

In the rrollup method, all peptides that map to a single protein are scaled based on a chosen reference peptide, which is the peptide with most observations. Next the average or mean of the scaled peptides is set as the relative protein abundance. [@matzke2013comparative]

The qrollup method starts with all peptides that map to a single protein. Next peptides are chosen according to an abundance cutoff value and the average or mean of the scaled peptides is set as the protein abundance.

In the zrollup method, scaling is done similarly to the z-score formula (except with medians instead of means). The scaling formula is applied to peptides that map to a single protein and then the mean or median of the scaled peptides is set as protein abundance (from DAnTE article). The rollup method is similar to rrollup method, except there is no scaling involved in these methods. Either the mean or median is applied to all peptides that map to a single protein to obtain protein abundance. [@polpitiya2008dante]

Here we utilize rrollup.

mypro <- protein_quant(
  pepData = mypep,
  method = "rrollup",
  combine_fn = "median"
)

plot(mypro, order_by = "Virus", color_by = "Virus")

Statistical Comparisons

Now that we have data at the protein level, we will perform the comparisons of interest for these data, namely all pairwise comparisons between the virus strains. Note that this dataset is included in the pmartRdata package and used as an example only. Information about Donor's is included for exemplar usage of a second main effect, if desired. A valid statistical analysis for an experiment that involved Donor's or other units of repeated measurements cannot be performed within pmartR. To account for random or mixed effects, please consult with a statistician as needed. We commonly use the lme4 library for this type of statistical analysis [@lme4].

To make all pairwise comparisons between our 3 virus strains using both ANOVA and the IMD-test (or g-test), we can use the imd_anova function in the following way using defaults (no p-value adjustments being the most important default to note). The summary function provides an overview of the results of these tests.

mystats <- imd_anova(omicsData = mypro, test_method = "combined")

summary(mystats)

We can also use plot to obtain graphical results.

plot(mystats, plot_type = "volcano")
plot(mystats, plot_type = "bar")
plot(mystats, plot_type = "gheatmap") # exclude this one?
plot(mystats, plot_type = "fcheatmap")

See ?imd_anova for more details and options.

References



pmartR/pmartR documentation built on May 5, 2024, 12:03 a.m.