knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(MSnbase) library(camprotR) library(dplyr)
What is an MSnSet? To quote from MSnbase
:
The
MSnSet
class is derived form theBiobase::eSet
class and mimics theBiobase::ExpressionSet
class classically used for microarray data.
This function description is a bit dense and unintelligible to the uninitiated.
Additionally, there is already a
vignette
in the MSnbase
package describing MSnSets, but this may be a bit hard to
understand for beginners.
Here, I will describe an MSnSet in my own words.
An MSnSet is a special type of list (specifically it is an S4 object) that contains information about an MS experiment.
To better understand MSnSets we first need to define some terminology. In a quantitative proteomics experiment we are analysing 'samples' from different experimental conditions via MS, e.g. comparing 'samples' from cells treated with a drug versus a control. The quantitative data we eventually obtain consists of measurements of 'features'. In this context, features can be PSMs, peptides, or proteins.
MSnSets contain multiple objects of different types:
numeric matrix
)Biobase::AnnotatedDataFrame
)Biobase::AnnotatedDataFrame
)These underlying objects must have a specific structure:
assayData
must match the number of rows in
featureData
and the row names must match exactly.assayData
must match the number of rows in
phenoData
and the column/row names must match exactly.msnset_cap <- "Dimension requirements for the assayData (aka. expression data), featureData and phenoData (aka. sample data), slots. Adapted from [this MSnbase vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/v02-MSnbase-io.html)." knitr::include_graphics("figures/msnset.png")
Conveniently, the MSnbase
package comes with some example MSnSets. In this
vignette we will explore the msnset
MSnSet. This data set is from an
iTRAQ 4-plex experiment wherein BSA and Enolase have been spiked into a background
of Erwinia proteins. See ?msnset
for more information.
My favourite way to have a look at anything in R is to use the str()
function to explore an objects' structure. Here we look at msnset
which is an MSnSet with 9 'slots' that each contain some sort of object.
str(msnset, max.level = 2)
Each slot in msnset
is described in detail in the 'MSnSet slots' section
below. For now we will only concern ourselves with the assayData
,
featureData
, and phenoData
slots.
The assayData
slot contains the quantitative data from the experiment,
i.e. how much of each feature (spectra/PSM, peptide, or protein) was detected
in each sample. This is the essential part of an MSnSet. All other slots
are optional.
We can extract this information from the MSnSet into a numeric matrix with the
exprs()
function.
msnset_exprs <- exprs(msnset)
Lets look at its structure.
str(msnset_exprs)
It is an r nrow(msnset_exprs)
by r ncol(msnset_exprs)
numeric matrix.
The data it contains is reporter ion intensities from r ncol(msnset_exprs)
iTRAQ tags across r nrow(msnset_exprs)
different PSMs.
Each column of this matrix refers to an iTRAQ tag which corresponds to an individual 'sample'. Each row of this matrix corresponds to a 'feature' which in this case is a PSM. The numbers indicate the intensity of the reporter ion from a particular tag (i.e. sample) in a particular PSM.
The featureData
slot contains metadata about the 'features' (e.g. PSMs,
peptides, proteins).
We can extract this information from the MSnSet into a data.frame with the
fData()
function.
msnset_fdata <- fData(msnset)
Lets look at its structure.
str(msnset_fdata)
It is a r nrow(msnset_exprs)
by r ncol(msnset_exprs)
data.frame. The data
it contains is metadata about each 'feature', which are PSMs in this case.
The type of metadata included is entirely arbitrary and there can be as many
or as few columns as you want.
Each column of this matrix refers to a particular type of metadata.
Each row of this matrix corresponds to a 'feature' which in this case
is a PSM. Thus, the number of rows in featureData
is the same as
the number of rows in assayData
. Also note that the row names of
featureData
exactly match the row names of assayData
.
rownames(exprs(msnset)) == rownames(fData(msnset))
The phenoData
slot contains metadata about the 'samples'.
We can extract this information from the MSnSet into a data.frame with the
pData()
function.
msnset_pdata <- pData(msnset)
Let's look at its structure.
str(msnset_pdata)
It is a r nrow(msnset_exprs)
by r ncol(msnset_exprs)
data.frame. The data
it contains is metadata about each 'sample', which are iTRAQ tags in this case.
The type of metadata included is entirely arbitrary and there can be as many
or as few columns as you want.
Each column of this matrix refers to a particular type of metadata.
Each row of this matrix corresponds to a 'sample' which in this case
is an iTRAQ tag. Thus, the number of rows in phenoData
is the
same as the number of columns in assayData
. Also note that the row names of
phenoData
exactly match the column names of assayData
.
colnames(exprs(msnset)) == rownames(pData(msnset))
In the previous section we explored an small example MSnSet supplied with
MSnbase
. Here we will construct our own MSnSet. A small PSMs.txt
Proteome Discoverer (PD) table from a TMT 10-plex
experiment is provided with the camprotR
package which we will turn into an MSnSet.
Lets have a look at our PSM data from PD. It is a data.frame.
str(psm_tmt_total)
This data.frame contains r nrow(psm_tmt_total)
PSMs. We have
quantitative data for each PSM (the Abundance
columns) and metadata for each
PSM (all the other columns).
As before, the single essential part of an MSnSet is the assayData
slot which
contains the quantitative data from your experiment.
In this case, it should contain a numeric matrix with r nrow(psm_tmt_total)
rows corresponding to the r nrow(psm_tmt_total)
PSMs and
r sum(grepl("Abundance\\.", colnames(psm_tmt_total)))
columns corresponding
to the r sum(grepl("Abundance\\.", colnames(psm_tmt_total)))
TMT tags.
First we extract the columns with the quantitative data and convert them to a numeric matrix.
# abundance columns for TMT PD output start with Abundance abundance_cols <- colnames(psm_tmt_total)[grepl('Abundance.', colnames(psm_tmt_total))] tmt_exprs <- as.matrix(psm_tmt_total[, abundance_cols])
Then we remove the word 'Abundance' from the column names to make them more concise.
# update the column names to remove the 'Abundance.` prefix colnames(tmt_exprs) <- gsub('Abundance.', '', colnames(tmt_exprs))
Lastly, we use the unique PSMs.Peptide.ID
column to define unique row names.
This is important for extracting and combining data down the line. Row names
must be unique!
# use PSMs.Peptide.ID, which are unique, to define rownames rownames(tmt_exprs) <- psm_tmt_total$PSMs.Peptide.ID
Our quantitative data are now ready.
Now we construct a data.frame with metadata for each PSM to go into the
featureData
slot of our MSnSet.
In this case, it should be a data.frame with r nrow(psm_tmt_total)
rows
corresponding to the r nrow(psm_tmt_total)
PSMs and any number of columns.
First we extract the columns with the metadata of interest. Here we want
everything but the Abundance
columns and the unique IDs.
# get all columns except Abundance columns identified earlier metadata_cols <- setdiff(colnames(psm_tmt_total), c(abundance_cols, "PSMs.Peptide.ID")) tmt_fdata <- psm_tmt_total[, metadata_cols]
Again, we use the unique PSMs.Peptide.ID
column to define unique row names.
This must match tmt_exprs
!
# use PSMs.Peptide.ID, which are unique, to define rownames rownames(tmt_fdata) <- psm_tmt_total$PSMs.Peptide.ID
Our metadata are now ready.
Lastly, we construct a data.frame with metadata for each TMT 10-plex tag, to
go into the phenoData
slot of our MSnSet.
In this case, it should be a data.frame with r ncol(tmt_exprs)
rows corresponding
to the r ncol(tmt_exprs)
TMT tag and any number of columns.
First we construct an empty data.frame with r ncol(tmt_exprs)
rows.
tmt_pdata <- data.frame(matrix(nrow = 10, ncol = 0))
Then we can add some metadata. In this example we will just add some fake sample names and fake treatment conditions.
tmt_pdata$sample <- paste0("sample", 1:10) tmt_pdata$treatment <- rep(c("trt", "ctrl"), each = 5)
The rownames must be identical to the column names of tmt_exprs
.
rownames(tmt_pdata) <- colnames(tmt_exprs)
Now we construct the MSnSet. As long as we have set up the underlying data properly, this step is the easiest!
tmt_msnset <- MSnSet(exprs = tmt_exprs, fData = tmt_fdata, pData = tmt_pdata)
Lets have a look at its structure.
str(tmt_msnset, max.level = 2)
As before we can access the different slots as follows.
# access the quantitative data head(exprs(tmt_msnset)) # access the PSM metadata head(fData(tmt_msnset)) # access the sample metadata head(pData(tmt_msnset))
The code below shows briefly how to save/export the data within an MSnSet.
Using write.exprs()
from MSnbase
is the easiest way. Use the fDataCols
argument to specify which featureData
columns to add to the right of the
quantitative data (specify as column names, column numbers, or a logical vector).
The other arguments are the same as write.table()
.
MSnbase::write.exprs( tmt_msnset, file = "results.csv", fDataCols = c("Percolator.q.Value", "Master.Protein.Accessions"), sep = ",", row.names = FALSE, col.names = TRUE )
Alternatively you can manually combine the results manually.
results <- merge( exprs(tmt_msnset), # extract PSM quantitative data fData(tmt_msnset), # extract PSM metadata by = 0 # join by rownames )
And then use the writexl
package to save to Excel.
writexl::write_xlsx(results, path = "results.xlsx")
This section contains a detailed description of each MSnSet slot.
Contains the quantitative data from the experiment, i.e. how much of each feature (e.g. PSM, peptide, protein) was detected in each sample. This is the essential part of an MSnSet.
exprs(MSnSet)
dim(MSnSet)
Q12345;Q98765
),
access with featureNames(MSnSet)
sampleNames(MSnSet)
Optional. Contains metadata about the features (e.g. proteins, peptides, PSMs). For example for protein features this object might contain the protein names, their lengths, isoelectric points, number of transmembrane domains, associated GO terms, etc.
featureData(MSnSet)
fData(MSnSet)
and fvarMetadata(MSnSet)
.Biobase::AnnotatedDataFrame
, which is comprised of 2
data.framesQ12345;Q98765
)transmem
, access with
fvarLabels(MSnSet)
transmem
labelDescription
Number of transmembrane domains
Optional. Contains metadata about each sample, usually relating to the experimental design, e.g. replicates, tissues, animals, treatments, etc.
phenoData(MSnSet)
pData(MSnSet)
and varMetadata(MSnSet)
.Biobase::AnnotatedDataFrame
, which is comprised of 2
data.frames. trt
, access
with MSnSet$
trt
labelDescription
Drug treatment
Optional. Contains equipment-generated information about the protocols used for
each sample. The number of rows and the row names must match the number of
columns and column names of assayData
.
protocolData(MSnSet)
pData(protocolData(MSnSet))
and varMetadata(protocolData(MSnSet))
.Biobase::AnnotatedDataFrame
, which is comprised of 2
data.framesms_model
ms_model
labelDescription
MS Model
Optional. Contains descriptive information about the experiment and the experimenter.
experimentData(MSnSet)
Biobase::MIAME
object, which is essentially a list of several
characters and lists.expinfo(MSnSet)
expinfo(MSnSet)
expinfo(MSnSet)
expinfo(MSnSet)
expinfo(MSnSet)
abstract(MSnSet)
Biobase::MIAME
for info about other (probably unnecessary) sub-objects.Contains the version of MSnbase used to construct the MSnSet and also a log of what processes have been applied to the MSnSet.
processingData(MSnSet)
MSnProcess
object, which contains several sub-objects that
can be accessed using processingData(MSnSet)@
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.