Abstract This workshop will give delegates the opportunity to discover and try some of the recent R / Bioconductor developments for proteomics. Topics covered will including support for open community-driven formats for raw data and identification results, packages for peptide-spectrum matching, quantitative proteomics, mass spectrometry (MS) and quantitation data processing, and visualisation. The workshop material will be a self-contained vignette/workflow including example data.
This short tutorial is part of the ProteomicsBioc2014Workshop
package (version r packageVersion("ProteomicsBioc2014Workshop")
),
available at
https://github.com/ComputationalProteomicsUnit/ProteomicsBioc2014Workshop.
library("knitr") opts_knit$set(error = FALSE) library("RforProteomics") library("mzR") library("mzID") library("MSnbase") library("rpx") library("MLInterfaces") library("pRoloc") library("pRolocdata") library("BiocInstaller") library("rTANDEM") library("shinyTANDEM")
biocv <- as.character(biocVersion()) pp <- proteomicsPackages(biocv) msp <- massSpectrometryPackages(biocv) msdp <- massSpectrometryDataPackages(biocv)
In Bioconductor version r biocv
, there are respectively r nrow(pp)
proteomics and r nrow(msp)
mass spectrometry software packages and
r nrow(msdp)
mass spectrometry experiment packages. These respective
packages can be extracted with the proteomicsPackages()
,
massSpectrometryPackages()
and massSpectrometryDataPackages()
and
explored interactively.
library("RforProteomics") pp <- proteomicsPackages(biocv) display(pp)
datatab <- data.frame(Type = c("raw", "identification", "quantitation", "peak lists", "other"), Format = c("mzML, mzXML, netCDF, mzData", "mzIdentML", "mzQuantML", "mgf", "mzTab"), Package = c( "[`mzR`](http://bioconductor.org/packages/release/bioc/html/mzR.html) (read)", "[`mzID`](http://bioconductor.org/packages/release/bioc/html/mzID.html) (read)", "", "[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write)", "[`MSnbase`](http://bioconductor.org/packages/release/bioc/html/MSnbase.html) (read/write)")) kable(datatab)
Contemporary MS-based proteomics data is disseminated through the
ProteomeXchange infrastructure,
which centrally coordinates submission, storage and dissemination
through multiple data repositories, such as the
PRIDE data base at the EBI for
MS/MS experiments, PASSEL at
the ISB for SRM data and the
MassIVE
resource. The
rpx
is an interface to ProteomeXchange and provides a basic and unified
access to PX data.
library("rpx") pxannounced()
px <- PXDataset("PXD000001") px pxfiles(px)
Other metadata for the px
dataset:
pxtax(px) pxurl(px) pxref(px)
Data files can then be downloaded with the pxget
function as
illustrated below. Alternatively, the file is available on the
workshop's Amazon virtual machine in /data/Proteomics/data/
.
mzf <- "TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzXML" mzf <- file.path("/data/Proteomics/data", mzf) if (!file.exists(mzf)) mzf <- pxget(px, pxfiles(px)[6]) mzf
The mzR
package provides an interface to the
proteowizard code base,
the legacy RAMP is a non-sequential parser and other C/C++ code to
access various raw data files, such as mzML
, mzXML
, netCDF
, and
mzData
. The data is accessed on-disk, i.e it does not get loaded
entirely in memory by default. The three main functions are
openMSfile
to create a file handle to a raw data file, header
to
extract metadata about the spectra contained in the file and peaks
to extract one or multiple spectra of interest. Other functions such
as instrumentInfo
, or runInfo
can be used to gather general
information about a run.
library("mzR") ms <- openMSfile(mzf) ms
hd <- header(ms) dim(hd) names(hd)
Extract the index of the MS2 spectrum with the highest base peak intensity and plot its spectrum. Is the data centroided or in profile mode?
hd2 <- hd[hd$msLevel == 2, ] i <- which.max(hd2$basePeakIntensity) hd2[i, ] pi <- peaks(ms, hd2[i, 1]) plot(pi, type = "h") mz <- hd2[i, "basePeakMZ"] plot(pi, type = "h", xlim = c(mz-0.5, mz+0.5)) pj <- peaks(ms, 100) plot(pj, type = "l") plot(pj, type = "l", xlim = c(536,540))
The RforProteomics
package distributes a small identification result
file (see
?TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzid
) that we
load and parse using infrastructure from the
mzID
package.
library("mzID") (f <- dir(system.file("extdata", package = "ProteomicsBioc2014Workshop"), pattern = "mzid", full.names=TRUE)) id <- mzID(f) id
Various data can be extracted from the mzID
object, using one the
accessor functions such as database
, scans
, peptides
, ... The
object can also be converted into a data.frame
using the flatten
function.
Is there a relation between the length of a protein and the number of identified peptides, conditioned by the (average) e-value of the identifications?
fid <- flatten(id) x <- by(fid, fid$accession, function(x) c(unique(x$length), length(unique(x$pepseq)), mean(x$'ms-gf:specevalue'))) x <- data.frame(do.call(rbind, x)) colnames(x) <- c("plength", "npep", "eval") x$bins <- cut(x$eval, summary(x$eval)) library("lattice") xyplot(plength ~ npep | bins, data = x)
While searches are generally performed using third-party software
independently of R or can be started from R using a system
call, the
rTANDEM
package allows one to execute such searches using the X!Tandem
engine. The
shinyTANDEM
provides a interactive interface to explore the search results.
library("rTANDEM") ?rtandem library("shinyTANDEM") ?shinyTANDEM
The above sections introduced low-level interfaces to raw and
identification results. The
MSnbase
package provides abstractions for raw data through the MSnExp
class
and containers for quantification data via the MSnSet
class. Both
store the actual assay data (spectra or quantitation matrix) and
sample and feature metadata, accessed with spectra
(or the
[
, [[
operators) or exprs
, pData
and fData
.
The figure below give a schematics of an MSnSet
instance and the
relation between the assay data and the respective feature and sample
metadata.
plot(NA, xlim = c(0, 5), ylim = c(0, 10), axes=FALSE, xlab = NA, ylab = NA) rect(0, 0, 3, 1.9) rect(0, 2, 3, 10) rect(3.05, 2, 5, 10) segments(seq(0, 3, length.out = 7), rep(0, 7), seq(0, 3, length.out = 7), rep(10, 7), lty = "dotted") segments(rep(0, 50), seq(2, 10, length.out = 50), rep(5, 100), seq(2, 10, length.out = 50), lty = "dotted") text(1.5, 1, "sample metadata", cex = 1.5) text(1.5, 6, "assay data", cex = 1.5) text(4, 6, "feature\nmetadata", cex = 1.5)
Another useful slot is processingData
, accessed with processingData(.)
, that records all the processing that objects have undergone since their creation (see examples below).
The readMSData
will parse the raw data, extract the MS2 spectra and
construct an MS experiment file.
library("MSnbase") quantFile <- dir(system.file(package = "MSnbase", dir = "extdata"), full.name = TRUE, pattern = "mzXML$") quantFile msexp <- readMSData(quantFile, verbose=FALSE) msexp
The identification results stemming from the same raw data file can then be used to add PSM matches.
## find path to a mzIdentML file identFile <- dir(system.file(package = "MSnbase", dir = "extdata"), full.name = TRUE, pattern = "mzid$") identFile msexp <- addIdentificationData(msexp, identFile) fData(msexp)
msexp[[1]] plot(msexp[[1]], full=TRUE) as(msexp[[1]], "data.frame")[100:105, ]
There are a wide range of proteomics quantitation techniques that can broadly be classified as labelled vs. label-free, depending whether the features are labelled prior the MS acquisition and the MS level at which quantitation is inferred, namely MS1 or MS2.
qtb <- matrix(c("XIC", "Counting", "SILAC, 15N", "iTRAQ, TMT"), nrow = 2, ncol = 2) dimnames(qtb) <- list( 'MS level' = c("MS1", "MS2"), 'Quantitation' = c("Label-free", "Labelled")) kable(qtb)
In terms of raw data quantitation, most efforts have been devoted to
MS2-level quantitation. Label-free XIC quantitation has however been
addressed in the frame of metabolomics data processing by the
xcms
infrastructure.
An MSnExp
is converted to an MSnSet
by the quantitation
method. Below, we use the iTRAQ 4-plex isobaric tagging strategy
(defined by the iTRAQ4
parameter; other tags are available).
plot(msexp[[1]], full=TRUE, reporters = iTRAQ4)
msset <- quantify(msexp, method = "trap", reporters = iTRAQ4, verbose=FALSE) exprs(msset) processingData(msset)
Other MS2 quantitation methods available in quantify
include the
(normalised) spectral index SI
and (normalised) spectral abundance
factor SAF
or simply a simple count method.
exprs(si <- quantify(msexp, method = "SIn")) exprs(saf <- quantify(msexp, method = "NSAF"))
Note that spectra that have not been assigned any peptide (NA
) or
that match non-unique peptides (npsm > 1
) are discarded in the
counting process.
See also The
isobar
package supports quantitation from centroided mgf
peak lists or its
own tab-separated files that can be generated from Mascot and Phenyx
vendor files.
The PSI mzTab
file format is aimed at providing a simpler (than XML
formats) and more accessible file format to the wider community. It is
composed of a key-value metadata section and peptide/protein/small
molecule tabular sections.
mztf <- pxget(px, pxfiles(px)[2]) (mzt <- readMzTabData(mztf, what = "PEP"))
It is also possible to import arbitrary spreadsheets as MSnSet
objects into R with the readMSnSet2
function. The main 2 arguments
of the function are (1) a text-based spreadsheet and (2) column names
of indices that identify the quantitation data.
csv <- dir(system.file ("extdata" , package = "pRolocdata"), full.names = TRUE, pattern = "pr800866n_si_004-rep1.csv") getEcols(csv, split = ",") ecols <- 7:10 res <- readMSnSet2(csv, ecols) head(exprs(res)) head(fData(res))
Each different types of quantitative data will require their own
pre-processing and normalisation steps. Both isobar
and MSnbase
allow to correct for isobaric tag impurities normalise the
quantitative data.
data(itraqdata) qnt <- quantify(itraqdata, method = "trap", reporters = iTRAQ4, verbose = FALSE) impurities <- matrix(c(0.929,0.059,0.002,0.000, 0.020,0.923,0.056,0.001, 0.000,0.030,0.924,0.045, 0.000,0.001,0.040,0.923), nrow=4, byrow = TRUE) ## or, using makeImpuritiesMatrix() ## impurities <- makeImpuritiesMatrix(4) qnt.crct <- purityCorrect(qnt, impurities) processingData(qnt.crct)
plot0 <- function(x, y, main = "") { old.par <- par(no.readonly = TRUE) on.exit(par(old.par)) par(mar = c(4, 4, 1, 1)) par(mfrow = c(2, 2)) sx <- sampleNames(x) sy <- sampleNames(y) for (i in seq_len(ncol(x))) { plot(exprs(x)[, i], exprs(y)[, i], log = "xy", xlab = sx[i], ylab = sy[i]) grid() } } plot0(qnt, qnt.crct)
Various normalisation methods can be applied the MSnSet
instances
using the normalise
method: variance stabilisation (vsn
), quantile
(quantiles
), median or mean centring (center.media
or
center.mean
), ...
qnt.crct.nrm <- normalise(qnt.crct,"quantiles") plot0(qnt, qnt.crct.nrm)
The combineFeatures
method combines spectra/peptides quantitation
values into protein data. The grouping is defined by the groupBy
parameter, which is generally taken from the feature metadata (protein
accessions, for example).
## arbitraty grouping g <- factor(c(rep(1, 25), rep(2, 15), rep(3, 15))) prt <- combineFeatures(qnt.crct.nrm, groupBy = g, fun = "sum") processingData(prt)
Finally, proteomics data analysis is generally hampered by missing
values. Missing data imputation is a sensitive operation whose success
will be guided by many factors, such as degree and (non-)random nature
of the missingness. Missing value in MSnSet
instances can be
filtered out and imputed using the filterNA
and impute
functions.
set.seed(1) qnt0 <- qnt exprs(qnt0)[sample(prod(dim(qnt0)), 10)] <- NA table(is.na(qnt0)) qnt00 <- filterNA(qnt0) dim(qnt00) qnt.imp <- impute(qnt0) plot0(qnt, qnt.imp)
The mzt
instance created from the mzTab
file has the following is
a TMT 6-plex with the following design:
In this TMT 6-plex experiment, four exogenous proteins were spiked into an equimolar Erwinia carotovora lysate with varying proportions in each channel of quantitation; yeast enolase (ENO) at 10:5:2.5:1:2.5:10, bovine serum albumin (BSA) at 1:2.5:5:10:5:1, rabbit glycogen phosphorylase (PHO) at 2:2:2:2:1:1 and bovin cytochrome C (CYT) at 1:1:1:1:1:2. Proteins were then digested, differentially labelled with TMT reagents, fractionated by reverse phase nanoflow UPLC (nanoACQUITY, Waters), and analysed on an LTQ Orbitrap Velos mass spectrometer (Thermo Scientic).
Explore the mzt
data using some of the illustrated functions. The
heatmap and MAplot (see MAplot
function), taken from the
RforProteomics
vignette, have been produced using the same data.
R in general and Bioconductor in particular are well suited for the statistical analysis of data. Several packages provide dedicated resources for proteomics data:
MSstats
:
A set of tools for statistical relative protein significance
analysis in DDA, SRM and DIA experiments.
msmsTest
:
Statistical tests for label-free LC-MS/MS data by spectral counts,
to discover differentially expressed proteins between two biological
conditions. Three tests are available: Poisson GLM regression,
quasi-likelihood GLM regression, and the negative binomial of the
edgeR package.
isobar
also provides dedicated infrastructure for the statistical analysis of isobaric data. The
MLInterfaces
package provides a unified interface to a wide range of machine
learning algorithms. Initially developed for microarray and
ExpressionSet
instances, the
pRoloc
package enables application of these algorithms to MSnSet
data.
library("MLInterfaces") library("pRoloc") library("pRolocdata") data(dunkley2006) traininds <- which(fData(dunkley2006)$markers != "unknown") ans <- MLearn(markers ~ ., data = t(dunkley2006), knnI(k = 5), traininds) ans
kcl <- MLearn( ~ ., data = dunkley2006, kmeansI, centers = 12) kcl plot(kcl, exprs(dunkley2006))
hcl <- MLearn( ~ ., data = t(dunkley2006), hclustI(distFun = dist, cutParm = list(k = 4))) hcl plot(hcl, exprs(t(dunkley2006)))
All the
Bioconductor annotation infrastructure,
such as
biomaRt
,
GO.db
,
organism specific annotations, .. are directly relevant to the
analysis of proteomics data. Some proteomics-centred annotations such
as the PSI Mass Spectrometry Ontology, Molecular Interaction (PSI MI
2.5) or Protein Modifications are available through the
rols
. Data
from the Human Protein Atlas is
available via the
hpar
package.
isobar
.synapter
.pRoloc
.MALDIquant
package.PSICQUIC
package.Additional relevant packages are described in the
RforProteomics
vignette.
print(sessionInfo(), local = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.