knitr::opts_chunk$set(collapse = T, comment = "#",cache=T,cache.path = '.cache/',fig.path='.figure/')




Table of Contents






1. Introduction

What is labelpepmatch?

Labelpepmatch (lpm) is a package for the analysis and visualisation of labelled peptide mass spectrometry data. The labels should be stable isotopic tags, and the data should be peak lists of chromatography-MS, with peaks matched between runs (replicates). Since peptidomics datasets are complex and large, calculation times can sometimes be long. To this end, labelpepmatch is equipped with powerful parallel processing features.


After installation, labelpepmatch can be loaded:

library(labelpepmatch)
set.seed(7)

Labelpepmatch contains a number of functions that will take you from a list of peaks (mass/charge, retention time, quantity) to a list of deconvoluted peak pairs, with the possibility of identification, visualisation and statistical analysis. Since different mass spectrometers produce different output, and since numerous good software packages for peak picking and chromatogram alignment are already around, this package offers a pipeline that starts where peak picking software stops. The input for this package is a list of peaks with their mass/charge ratio, their retention time and some quantitative measure (peak intensity or abundance). For easy data import, we provide two adaptors for common data formats (standard lpm and Progenesis LCMS).
This vignette shows how the package works, based on a real life example of neuropeptides extracted from the brain of the desert locust (Schistocerca gregaria), using the stable isotopic tag TMAB (trimethyl ammonium butyric acid). However, any source of peptides, and any stable isotopic tag can be used.

Some agreements on nomenclature

Every detected analyte within a mass spectrum will be called a feature. When two features appear to originate from the same peptide and only differ by their label, they will be called a peak pair.
When looking for differences between two peptidomes, we compare a measure of the quantity of peptides between conditions. Different mass spectrometers measure this quantity in different ways (e.g. centroid vs profile). Sometimes the best measure is the height of peaks (intensity), sometimes the surface under peaks is used (abundance). In this document, this will always be called quantity, be it intensity or abundance. Oftentimes, output of peak picking software will contain both. Here, the preferred quantity measure can be given as an input parameter to the read functions. Throughout the entire pipeline, the samples where the first condition ("sol" in our example) is labelled with a light label are called forward and the samples where the first condition is labelled with a heavy label are called reverse. (see designvector) When we use limma, the light labelled peptides will be called "red" (and are internally coded as labelled with cy5) and the heavy labelled peptides are called "green" (internally coded as labelled with cy3).

2. Reading in data

Data from different sources

Within the package, the data class lpm_input is defined as an object that contains all the data, and some metadata like the number of runs, sample names and experiment design. The lpm_input object is the starting point for the pipeline of peak pair detection. It can be directly generated using the function read.labelpepmatch, or for standard output from the package progenesis LC-MS, we have an adaptor called read.progenesis. It is also possible to generate an lpm_input object from matched peak lists (xcmsSet object) from the R package xcms. Users of other peak picking software are kindly invited to contribute and extend this package with extra adaptors for other data formats.

The design vector

An important input parameter for the read functions is the design vector. This is a vector that contains in this exact order: name of first condition, name of second condition, labelling order of first sample, labelling order of second sample, ... labelling order of last sample. The labelling order can either be "F" (forward, first condition is light), or "R" (reverse, first condition is heavy).
For our locust dataset, the designvector looks like c("sol","greg","F","R","F","R","F,"R","F","R"). This means that in the first sample, the "sol" condition is labelled with a light label, in the second, the "greg" condition is labelled with a light label, and so on. We would like to emphasize the importance of good experimental design with a balanced label swap. This is critical, since there is no way to tell apart label and condition effects if they are collinear.

Reading in an example dataset

For now, we will simply use the example dataset^[Neuropeptides extracted from the corpora cardiaca of fifth instar nymphs of the desert locust Schistocerca gregaria under two rearing conditions: solitarious and gregarious, analyzed with nano-LC-ESI-Q-TOF. For details, see Verdonck et al.] that is available within the package. If you want to read in your own data, use one of the read functions mentioned above.

schistocerca_tmab<-locustdata

# Of course, when reading in data on your own computer,
# reading in data would look something like this:
# (code not executed)

# indata      <-"/filepath/to/dataset.csv/"

# samplenames <-c("A","B","C","D","E","F","G","H")

# designvector<-c("sol","greg","F","F","R","R","F","F","R","R")

# schistocerca_tmab<-read.labelpepmatch(indata,designvector=designvector,samplenames=samplenames)

Now, let's see what an lpm_input object looks like. As you will notice, this data is of class lpm_input, and it contains a data frame schistocerca_tmab$frame that contains all the data, accompanied by a namekey and the design vector.

class(schistocerca_tmab)
summary(schistocerca_tmab)
colnames(schistocerca_tmab$frame)

Inspecting the lpm_input object

Labelpepmatch also has an inbuild summary function that summarizes the data in a somewhat more informative way. This lpm_summary function is your best friend throughout the pipeline. It can be applied to any labelpepmatch specific class and will always give you an informative overview of the object. For example in the case of an lpm_input object, you get specific information about the structure of the data, the mass/charge ratios and the charges, the retention times etc. There is also a graphical output that will help you quickly visualize the data in the m/z and retention time dimensions. The parameter "run" determines which run is used for the graphical representation of the data. This can serve as a quick quality check per run before actually getting started with peak pair matching. If run is set to 0, a boxplot overview of all runs is given.

lpm_summary(schistocerca_tmab, run=0, graphics=T)

Notice that the output is rather extensive and consists of basic information on experiment design, origin of the data and some summary statistics of the data. The figures give a more graphical overview of the specific run.^[If one wants to compare multiple runs graphically without graphic output in the terminal, the lpm_summary function can be run with printoutput=F]

lpm_summary(schistocerca_tmab, run=2, graphics=T, printoutput=F)

Some observations that can be made here:

3. Finding peak pairs

Using the pepmatch function

The core function of the labelpepmatch package is pepmatch. This function interprets an lpm_input object, and will search for peak pairs within the data. The label that we used in this example (TMAB) is hard-coded in the package, but any label mass can be given as an input parameter. Also, users are invited to extend the list of hard-coded labels in further versions of this package.

matched<-pepmatch(
  lpm_input=schistocerca_tmab, 
  elutionthresh=0.3,
  labelthresh=0.05,
  labelcountmax=5,
  label="TMAB",   
  minmolweight=132,
  quantmin=0,
  FDR=T,
  iterations=10,
  cores=1
)

A rather high number of parameters for peak pair detection can be tweaked. The two most important ones are the thresholds for mass and retention time. elutionthresh is the maximal allowed difference between retention times of two features to be considered a peak pair. Since the labels that we use are supposed to be chemically equivalent, the peak pairs should in theory co-elute. Note however, that peak picking algorithms make mistakes, and some pre-processing software rounds elution times to discrete values.
labelthresh is the maximal allowed deviation from the theoretical mass difference between two labelled peptides. So if the mass difference between two labels is 10.00, and you get two co-eluting peaks with masses 100.00 and 110.05, they will be considered a peak pair if labelthresh >= 0.05. For multiple labels, this threshold is not multiplied. So for the same example, two peaks of 100.00 and 120.09 will not be considered a peak pair if labelthresh==0.05. Note that this threshold looks like a difference between two molecular weights rather than absolute values. This allows to use this threshold in a very strict way, because the error on values within a small interval will be small.
Further, we set some restrictions to the molecular weight (at least the smallest possible dipeptide glycineglycine), number of labels (5 positions for a label to bind to is a lot for a neuropeptide), and the minimal quantity of signal for at least one feature within a peak pair.

There will always be the inevitable trade-off between losing real positives and keeping false positives. Hence, the stringency of the peak pair detection parameters should be meticulously selected based on the specifications of the device used. Other cut-offs (quantity, mass, ...) can be based on the numbers and kinds of peptides that are expected, and some prior information like the distribution of quantities. It is noteworthy that taking a large fraction of "uninteresting" peak pairs along throughout the analysis, will lead to multiple testing issues, and in the end will dilute the power to detect real differences between conditions.

False discovery rate estimation

The pepmatch function also contains a false discovery rate estimation. This should give an idea of how well you can trust the peak pairs that you do find. Here a mock dataset will be created and exactly the same peak pair search procedure with the same detection parameters will then be executed. The mock data are generated according to a restricted randomization procedure that produces mock data with a very comparable structure to the original data. Structure elements that are retained are:

Note that estimating a false discovery rate in highly structured data is difficult. We suspect that our FDR estimates are on the conservative side, but we will only gain full understanding of this once multiple real datasets with varying degrees of structure are being subjected to these analyses. The FDR estimation can be ran in multiple iterations in order to get better estimates of the false discovery rate. Note however that for larger datasets, multiple iterations may dramatically slow down your analyses and blow up the size of objects that will be carried along the pipeline. For more information on mock data generation, we refer to the lpm_mockdata function.

The pepmatched object

The pepmatch function outputs an object of class pepmatched. This object contains a list of data frames with peak pairs for each separate run, the parameters used for the peak pair detection, information about FDR (if FDR=T) and some other metadata. A quick look into the structure of the object:

# A summary shows which slots there are in our object
# Notice that the first eight slots correspond to the runs
summary(matched)

# A closer look at one of these matchlists:
# Notice that since a 'pepmatched' object is a list,
# you need double brackets for indexing. 
knitr::kable(head(matched[[1]],3),digits=3)

A quick look at the first run within our pepmatched object.

Again, you can use the lpm_summary function to summarize all the information from this new object in a sensible and oversightful way.

lpm_summary(matched, run=1, graphics=T)

Note that the observed mass precision for peak pair detection is a symmetric distribution with the maximum around 0 and bilaterally bounded by labelthresh. Systematic error, if any, turns out to be minor. Notice that also for the false discovery rate tested peak pairs, the distribution is not uniform. This is expected since mock features in the mock database look like real labelled peptides and hence their masses are non random.

The next panel shows a table of the number of charges versus the number of labels, bounded by labelcountmax. No peak pairs with more than 4 charges were found. Most peak pairs have two or three charges and one or two labels. Note that a minor proportion of detected peak pairs have more labels than charges (indicated in red), which for TMAB is physically impossible. No doubt these are false positives. We will remove them in a next step.

Next, there is an MA plot where you see that there is a tendency for the light labelled channel to have a higher signal. See lpm_MAplot for more information.

The last panel shows a plot of the retention time precision (bounded by elutionthresh) versus the mass difference precision (bounded by labelthresh). Correct interpretation of this figure can lead to better estimation on parameter stringency. For example: here we see that most peak pairs seem to have co-eluted perfectly simultaneously, while the deviation from the expected mass difference is distributed more evenly over the x-axis. However, in the peak pairs with a larger retention time difference, there is a trend towards having a larger deviation from the theoretical mass difference. So perhaps we have chosen our elutionthresh on the high side. If after inspection of the matched object, one would want to adjust some of the thresholds of the pepmatch function, the function does not have to be repeated. Instead, the much quicker function lpm_refine can be used. Note that a single value for the quantmin parameter means that both peaks in a pair have to be above this quantity threshold. If we want to make our analysis more sensitive to extreme up- or downregulations, we can set quantmin=c(0,minimal value). In this latter case, all peak pairs with at least one peak above the minimal value would be retained.

Cleaning your data

Finding peak pairs can be time consuming, and hence it might be convenient to do it only once, with rather lenient parameters, and refine in a next step. This is where the lpm_refine function comes in. Here for example, we reset the threshold for elution time difference to 0.15, and the allowed mass difference within a peak pair to 0.03 Dalton. We also set a minimal quantity. Note that the quantmin parameter is either a vector of two values (setting minima for the least and most abundant in a peak pair), or a single value, where both features within the peak pair have to exceed the set quantity. Note that it is impossible to refine parameters to values less strict than the ones used to generate the pepmatched object. If you want to do this, you will have to re-run the pepmatch function.

matched<-lpm_refine(matched,elutionthresh=0.15,quantmin = 2**8)

And have a look at the output again (this time without printing in the terminal).

lpm_summary(matched, printoutput = F, run=1, graphics=T)

Note that compared with the first version of our pepmatched object, 32 out of 133 features were discarded, among them, 4 out of 7 features with more labels than charges disappeared, which is another confirmation that we were mainly discarding false positives. Since we expect all of them to be false positives anyway, we can do:

matched<-lpm_refine(matched,remove.more.labels.than.charges =T)

4. Mass matching peak pairs to a database

Accessing databases of known peptides

This package contains a couple of functions designed to work with databases of known peptides. The main identification function is pep.id which takes an object of class pepmatched, but let's first begin with a couple of examples of smaller functions. For example, we can calculate the molecular weight of a given peptide. Notice this is the monoisotopic mass, not the average molar mass.

calculate_peptide_mass("SVPFKPRLa")

There is also a function for downloading databases from our website. As an example we download the desert locust peptide database.

db<-download_lpm_db("desertlocust")
# And show what one line of database looks like. 
# For example Schistocerca sulfakinin. 
# Note that we introduced the $ sign for a sulfotyrosin residue. 
db[38,]

And a function to quality check this database. This function recalculates all molecular weights of the database and compares them to the database molecular weights. It outputs a graph and a vector with all differences from calculated molecular weights. In our database, there are two clear mismatches that after closer investigation appear to be sequences where the MW is known from mass spectrometry, but the sequence is not entirely known.

QC_database(db)

## And have a look at the mismatches:
db[31:32,]

Generating a mock peptide database

It is also possible to generate a mock database of peptides with a restricted randomization procedure that shuffels the amino acids of the input database over its peptide length distribution. It also respects the N-terminal pyroglutamination and C-terminal amidation frequencies. If we plot the sorted molecular weights of the mock database on the sorted molecular weights of the input database, we expect them to reside approximately on y=x. The generate_random_db function will be used in the false discovery estimation of pep.id.

randomdb<-generate_random_db(db,plot=T)

Mass matching peptides to a database

Sometimes one just wants to quickly mass match a putative peptide with a given molecular weight to a database of known peptides. This is where the pep.massmatch function comes in handy. Remember we have a database of desert locust peptides. We will now check if two analytes with molecular weights 1080.25 and 1500.58 can be matched to known peptides within a 10 ppm interval, which would in this case be about 0.10 to 0.15 Da. Note that 5 ppm is probabily on the high side for most mass spectrometers, but this data was generated on a slightly older Q-TOF machine.

pep.massmatch(input=c(1080.25, 1500.60) ,db=db, ID_thresh=5)
knitr::kable(pep.massmatch(input=c(1080.25, 1500.60) ,db=db, ID_thresh=5)$matchlist)

And we see that no match is found for the first mass, but the second one gets to be matched to sulfakinin. This function also has an FDR estimation based on the generate_random_db function, and an optional correction for systematic bias based on identifications. Caution should however be payed here, since this will only work with a sufficiently high number of real matches. This rather sketchy feature should be considered something interesting to have a look at, rather than a have-to-do.

Mass matching all features of a pepmatched object: the pep.id function.

As mentioned earlier, the identification function that takes a pepmatched object as input and hence fits right in the pipeline is pep.id. The parameter ID_thresh is just the maximum allowed difference between two masses for mass match. This time not in absolute values, but in ppm. The database can either be an object from the environment, but it can also be a preset database, or a local one. We here choose for 10 iterations of false discovery rate estimation. This means that for each run, 10 different databases will be generated with the generate_random_db function, and in each of them the identification frequency is recorded. This will yield a good estimate of the false discovery rate in a random peptide database.

set.seed(7)
matched_id<-pep.id  (pepmatched=matched,ID_thresh=5,db=db,cores=1,FDR=T,iterations=10)

Visualising a pepmatched object with peptide identifications

The output of this function will still be of the same pepmatched class, but will now contain identification information. This also implies that identifications are not necessary for further downstream statistical analysis. More details about identifications can be found by having a closer look at the object:

lpm_summary(matched_id, printoutput = T, run=1, graphics=T)

A total of 11 different neuropeptides have been mass matched to peak pairs in all runs. The estimated false discovery rate is in the order of magnitude of 10%. If identifications are considered a strong signal for genuine labelled peptides, it is interesting to remark that we observe the identified peptides to have a slightly higher quantity, both for light and heavy labelled peptides. However, this can also be due to the fact that usually the most abundant peptides are the best studied, generating a bias in identification. Especially for a non-model organism like the desert locust, we should not expect to be able to match all peak pairs to known peptides. Notice that the bias for the light labelled peptides to have a higher quantity is also clearly visible. Also we observe that most of the identified peptides appear to have only minor differences in retention time within the peak pair. Finally, it is interesting to note that also the peak pair detection estimates are lower than in the first version of this object. This is still due to the refinements that we made with the lpm_refine function.

Identified peptides have charges between one and three with a preference for double charges, and one single label. Looking deeper in the data, we notice that the same peptides are found both in single, double and triple charged form. Notice the presence of two peptides that are pyroglutaminated, which prevents TMAB from binding to the N-terminal ammine group. Nevertheless, these two peptides can be labeled thanks to the epsilon amine moiety of their lysine residues.

# The first element of a 'pepmatched' object is 
# a dataframe with all the peak pairs of run 1.
# We select only the rows that contain
# identified peak pairs. 

 knitr::kable(matched_id[[1]][matched_id[[1]]$isID==T,],digits=3)

Adding extra identifications to a pepmatched object

In this example case, we assume the only information availble for identification is mass match. But, of course, extra identification information can be available, e.g. from MS/MS fragmentation. As a matter of fact we did some MS/MS on the most interesting peaks in this dataset. One of them, a feature with a deconvoluted monoisotopic mass of 2909.5 Da, could be clustered with other insect pyrokinins with the bonanza algorithm^[Falkner et al. A spectral clustering approach to MS/MS identification of post-translational modifications. J. Prot. Res. 2008 Nov;7(11):4614-22] and we presume it is an N-terminal elongated version of one of the better known Schistocerca pyrokinins. Just for illustration sake, we will here call it "precursorX". This precursor occurs 3 times, in the double, triple and quadruple charge state.

matched_id<-add_id(matched_id,c("313_973","544_1384","65_187"),rep("precursorX",3))

5. The view_spectra function

Preprocessing software usually offers a "top view" on mass spectra where features are visible in the m/z - retention time space. Labelpepmatch offers this feature, which will greatly help in feeding back to the preprocessing step if necessary. We will here apply the view_spectra function to the 3 consecutive objects generated in our pipeline. Note that the first input paramer is always the lpm_input object.

## The "lpm_input" object
view_spectra(schistocerca_tmab,run=1)
## The "pepmatched" object without identifications
view_spectra(schistocerca_tmab,matched,run=1)
## The "pepmatched" object with identifications
view_spectra(schistocerca_tmab,matched_id,run=1)

6. Statistical analysis

Generating a statlist object

Now we have a list of peak pairs for each run, we want to know which peak pairs are significanlty different between conditions. To this end, we will transform our pepmatched object to the statlist object that is a very suitable format for statistical analysis. The make.statlist function will also log2-transform the data (other transformations can also be chosen) and discard peak pairs that have not been found in less than a given proportion of runs (standard all runs). Also all parameters for the trimming of the object can be given just like in lpm_refine. Notice that the quantmin and quantmax parameters are in original scale. It is however easy to give them in log2 scale (see example).

statlist<-make.statlist  (                   
                pepmatched_object=matched_id,
                cutoff=1, # PROPORTION of runs in which a feature should be found to be retained in statlists. Default is 1
                logtransform=T,
                quantmin=2**8 # Minimum quantity for most abundant peak of a peak pair. Cuts them off on 8 in log2 scale. 
                )

Looking at the statlist object

The statlist object contains four matrices with a column for every run and a row for every feature. The first two matrices are organized according to label. The last two matrices are organized according to condition. The metadata and design are placed apart from the real data.

lpm_summary(statlist,graphics=T)

Here again, we observe a clear difference in quantity between light and heavy labelled peptides, but no overall trends between the conditions. This is a typical picture that is seen in many high throughput technologies: most features show no difference, and there might be a "channel"" effect (e.g. in micro array studies, a dye effect). As long as the experimental design is balanced (i.e. with the labelling directions equally distributed over conditions), this should be no problem. Either, the label effect can be normalized, or label effect can be added as a covariate in a linear model.

MA-plots

The statlist object can also be visualized with MA-plots. These are plots of the difference in quantity versus the average quantity per feature. The lpm_MAplot also draws a loess fit on the figures.

lpm_MAplot(statlist,loess_span=1)

When we compare all runs, we consistently observe the same overall label bias, but the structure of the bias along A-axis seems more or less random. This suggests that the label effect is not dependent on quantity.

Linear models for comparing peptidomes

The lpm_linearmodel function contains two methods. The "vanilla" method runs a separate mixed model on each feature, using label effect as a covariate and run as a random effect. This method is quick, and powerful given enough replicates. It is also the preferred method if previous normalizations have been applied to account for global or run-specific label effects. The "complexmixed" method is a mixed model ran on all features at once, with label effect nested in run as a covariate. This method is extremely powerful, but calculation times rise quickly, and hence it is only possible to use on a limited number of features (e.g. only mass matched features, only highest quantities etc.). The time complexity is estimated to be quasipolynomial $n^{log(n)}$, and it is advised not to use this method for more than 50 features. We will here focus on the use of the "vanilla" method. Don't forget to correct for multiple testing. Here, we use the Benjamini Hockberg procedure, but other multiple testing correction methods are also present.

model<-lpm_linearmodel(statlist,method="vanilla",p.adjust.method="BH",cores=1)

The lpm_linearmodel function outputs an object of class lpm_linearmodel that contains a data frame with all the model estimates and p-values per feature.

summary(model)
knitr::kable(head(model$model,3),digits=3)

Visualizing the statistics after analysis

The standard way for visualisation of this kind of data is a volcano plot. Here, the negative of the base 10 logarithm of the p-value is plotted on the log2 fold change. The lpm_volcanoplot has an interactive mode (plotlocator=T) where feature IDs are printed in the terminal when clicking on them in the plot. This is a quick way of getting a grasp on which features to investigate in more detail. For now, we will just print the volcanoplot without the locator function.

lpm_volcanoplot(model,plotlocator=F)


Features in the blue zone have a corrected p-value lower than 0.05, features in the green zone have a corrected p-value lower than 0.01. Red dots indicate features with identifications, either through mass match, or added later. For more information on the interpretation of this specific volcanoplot, we refer to Verdonck et al.


Another nice way to visualize these data is through a heatmap. Here, we make a heatmap for all features that correspond to the standard cutoffs in the function: p value lower than 0.05, adjusted p value lower than 0.05, and (non transformed) fold change higher than 1.25 (25% up or downregulation).


Remember that we sometimes have to cope with label effects. Hence, a heatmap with just raw quantities (or their contrasts) would probabily not tell you much (remember these boxplots). On that account, it is much more interesting to either look at normalized data, or at contrasts of a linear model. The bioconductor package limma, to which this package links through the lpm_make.RGList function, offers numerous features for normalization. Within the labelpepmatch function however, we concentrate on linear models to cope with this issue. Accordingly, the lpm_heatmap function uses the residuals of the "null model" of the lpm_linearmodel. This is a linear model with "label" as a main effect, without accounting for "treatment". The lpm_heatmap function can be applied both on the raw residuals or on the contrasts within one peak pair.

lpm_heatmap(model,FCcutoff=1.25,main="HM on residuals")
lpm_heatmap(model,FCcutoff=1.25,contrasts=T,main="HM on res. contrasts")

Looking at either of these heatmaps, one would not be able to infer the labeling direction of different replicates. This would be different if we were to make a heatmap using the raw quantities. In other words: the label effect has been effectively cancelled out by using the residuals of the linear model. Also, it is clear that using the above mentioned cutoffs, we observe 3 times more upregulations in solitarious than in gregarious locusts. This is an observation we already made looking at the volcano plot, but here we get a more detailed view on the most interesting features.

For example, when we look at the constrast heat map, we clearly see that not all differences are equally homogeneous. But, nothing in biology is ever completely black and white (or red and green for that matter). Even more interestingly, the clustering dendrograms left to the heatmap shows us that some features show remarkable similarities in their patterns over different runs. For example, there is one "clade" where we see pyrokinin 5 popping up twice, and on top of that, some other features that show remarkable similarities. In order to take a closer look in this cluster, we only take the subset of our linear model that contains these features.

knitr::kable(model$model[rownames(model$model) %in% c("262_614", "65_187", "544_1384", "313_973") | model$model$pepID %in% c("Scg-PK-4","Scg-PK-5"), c(18, 22, 23, 26, 27)],digits=3)

This closer look gives away that we are actually looking at only 3 different peptides, with charges ranging between 2 and 4. Most probabily, since the patterns of expression for the 1644 Da and the 2909 Da peptides align so well, we are looking at a longer isoform or precursor of the Scg-PK-5 peptide (which we already knew from MS/MS, see add_id). The 790 Da peptide remains unkown, but it may well be a fragment that is also co-encoded in the larger precursor.

7. Further analysis with Bioconductor limma

Bioconductor limma

Labelled peptide data look a lot like microarray data: comparison of two samples within one run, dye/label swap, matched features between runs etc. This is the reason why labelpepmatch links to Bioconductor limma: a package for data analysis, linear models and differential expression for microarray data. If one wants to take the analysis further than the linear models offered by labelpepmatch (e.g. Bayesian estimation of variance, different kinds of normalisations etc.), limma is a very good extension to our pipeline. Also, it allows for a broad range of experimental designs.

library(limma)

Linking labelpepmatch to limma

Since limma is a very elaborate package, a detailled overview of every function is beyond the scope of this vignette. However, we here give a basic pipeline that can serve as a starting point for further exploration. Many of the limma functions can be applied directly to our data once in the right format. We start by converting our statlist object to a limma RGList object.

RG1<-lpm_make.RGList(statlist)
summary(RG1)
knitr::kable(head(RG1$R))
knitr::kable(head(RG1$Rb))
knitr::kable(RG1$targets)

Note that in the limma syntax, the two channels are called "red" (corresponds to cy5) and "green" (corresponds to cy3). And, for consistency reasons, we will always call the light channel "red" and the heavy channel "green".

Since our mass spectrum does not really have a background signal, the background matrices are zero matrices.

Next, we turn this object in an MAList object and plot

MA1=MA.RG(RG1,bc.method="subtract",offset=0) 
summary(MA1)
limma::plotMA(MA1, array = 1)

Within run normalisation

In the MA plots there should be no systematic trends as a function of quantity: most genes should lie around zero. If we notice a trend along the A axis, or a systematic bias, we can execute a couple of normalisations within runs.

MA2=normalizeWithinArrays(RG1,method="median",bc.method="none")    
# method="loess" for loess normalisation, this is if label effects are quantity dependent, 
# if this is not the case use "median", "robustspline" can be alternative for "loess"
RG2=RG.MA(MA2)       
# these would be the corresponding normalised R and G values backtransformed onto the original scale
limma::plotMA(MA2, array = 1) 

Between run normalisation

To make the different runs as comparable as possible, we can also apply between array normalisations. Two common normalisations are the quantile and aquantile normalisations. Quantile normalisation will ensure that the quantities will have the same empirical distribution across all arrays and channels, and aquantile normalisation will ensure that the A-values (average quantities) will have the same empirical distribution across runs while leaving the M-values (log-ratios) unchanged. Let's do the quantile normalisation here:

MA3=normalizeBetweenArrays(MA2,method="quantile") 
# Most drastic normalisation is "quantile", "Aquantile" would be less drastic. 
RG3=RG.MA(MA3)       
# these would be the corresponding normalised R and G values backtransformed onto the original scale

# DIAGNOSTIC PLOTS: let's see how the Aquantile normalisation affects the distribution of the R and G intensities:
par(mfrow = c(1, 3))                                    # now we make a grid with 1 row and 3 columns
plotDensities(RG1) # raw R and G intensities
plotDensities(RG2) # after median within-array normalisation
plotDensities(RG3) # after median within-array and quantile between-array normalisation

Notice how the two "channels" are normalized towards each other. This is a very powerful normalization and it should be emphasized that this is probabily not necessary in every situation.

Further statistical inference with limma

Now we know how to create an RG list object and normalize it, there are ample possibilities for linear models on the data. In what follows, we will give some pointers without printing all the output. Here, we use a simple linear model after loess within-array normalization. The lmFit function offers ample possibilities to for example account for multiple observations of the same peptide in different charge states (ndups, normally used for multiple probes per gene), and different regression methods that can be explored. The eBayes function is another very interesting function that should be explored with these kind of data. It was originally designed for inferences on gene expression, where it has proven to be a good ally in the battle against overdispersion.
For more information, the reader is referred to the limma manual and the example workflow in the labelpepmatch package.

design=model.matrix(as.formula("~ Cy3"),data=RG2$targets,contrasts = list(Cy3 = "contr.sum"))[,2]
fit = lmFit(MA2,design,ndups=1,block=NULL)
fit.eb=eBayes(fit,proportion=0.1)
stats=topTable(fit.eb,coef=1,number=100,adjust.method="BH",genelist=fit$genes,sort.by="p")
knitr::kable(head(stats),digits=3)



















goat-anti-rabbit/labelpepmatch.R documentation built on May 17, 2019, 7:29 a.m.