knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(MSnbase)
library(camprotR)
library(dplyr)

Introduction

What is an MSnSet? To quote from MSnbase:

The MSnSet class is derived form the Biobase::eSet class and mimics the Biobase::ExpressionSet class classically used for microarray data.

This function description is a bit dense and unintelligible to the uninitiated. Additionally, there is already a vignette in the MSnbase package describing MSnSets, but this may be a bit hard to understand for beginners.

Here, I will describe an MSnSet in my own words.

An MSnSet is a special type of list (specifically it is an S4 object) that contains information about an MS experiment.

To better understand MSnSets we first need to define some terminology. In a quantitative proteomics experiment we are analysing 'samples' from different experimental conditions via MS, e.g. comparing 'samples' from cells treated with a drug versus a control. The quantitative data we eventually obtain consists of measurements of 'features'. In this context, features can be PSMs, peptides, or proteins.

MSnSets contain multiple objects of different types:

These underlying objects must have a specific structure:

msnset_cap <- "Dimension requirements for the assayData (aka. expression data), 
featureData and phenoData (aka. sample data), slots. Adapted from 
[this MSnbase vignette](https://www.bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/v02-MSnbase-io.html)."
knitr::include_graphics("figures/msnset.png")

Conveniently, the MSnbase package comes with some example MSnSets. In this vignette we will explore the msnset MSnSet. This data set is from an iTRAQ 4-plex experiment wherein BSA and Enolase have been spiked into a background of Erwinia proteins. See ?msnset for more information.

Exploring an MSnSet

My favourite way to have a look at anything in R is to use the str() function to explore an objects' structure. Here we look at msnset which is an MSnSet with 9 'slots' that each contain some sort of object.

str(msnset, max.level = 2)

Each slot in msnset is described in detail in the 'MSnSet slots' section below. For now we will only concern ourselves with the assayData, featureData, and phenoData slots.

assayData

The assayData slot contains the quantitative data from the experiment, i.e. how much of each feature (spectra/PSM, peptide, or protein) was detected in each sample. This is the essential part of an MSnSet. All other slots are optional.

We can extract this information from the MSnSet into a numeric matrix with the exprs() function.

msnset_exprs <- exprs(msnset)

Lets look at its structure.

str(msnset_exprs)

It is an r nrow(msnset_exprs) by r ncol(msnset_exprs) numeric matrix. The data it contains is reporter ion intensities from r ncol(msnset_exprs) iTRAQ tags across r nrow(msnset_exprs) different PSMs.

Each column of this matrix refers to an iTRAQ tag which corresponds to an individual 'sample'. Each row of this matrix corresponds to a 'feature' which in this case is a PSM. The numbers indicate the intensity of the reporter ion from a particular tag (i.e. sample) in a particular PSM.

featureData

The featureData slot contains metadata about the 'features' (e.g. PSMs, peptides, proteins).

We can extract this information from the MSnSet into a data.frame with the fData() function.

msnset_fdata <- fData(msnset)

Lets look at its structure.

str(msnset_fdata)

It is a r nrow(msnset_exprs) by r ncol(msnset_exprs) data.frame. The data it contains is metadata about each 'feature', which are PSMs in this case. The type of metadata included is entirely arbitrary and there can be as many or as few columns as you want.

Each column of this matrix refers to a particular type of metadata. Each row of this matrix corresponds to a 'feature' which in this case is a PSM. Thus, the number of rows in featureData is the same as the number of rows in assayData. Also note that the row names of featureData exactly match the row names of assayData.

rownames(exprs(msnset)) == rownames(fData(msnset))

phenoData

The phenoData slot contains metadata about the 'samples'.

We can extract this information from the MSnSet into a data.frame with the pData() function.

msnset_pdata <- pData(msnset)

Let's look at its structure.

str(msnset_pdata)

It is a r nrow(msnset_exprs) by r ncol(msnset_exprs) data.frame. The data it contains is metadata about each 'sample', which are iTRAQ tags in this case. The type of metadata included is entirely arbitrary and there can be as many or as few columns as you want.

Each column of this matrix refers to a particular type of metadata. Each row of this matrix corresponds to a 'sample' which in this case is an iTRAQ tag. Thus, the number of rows in phenoData is the same as the number of columns in assayData. Also note that the row names of phenoData exactly match the column names of assayData.

colnames(exprs(msnset)) == rownames(pData(msnset))

Making an MSnSet

In the previous section we explored an small example MSnSet supplied with MSnbase. Here we will construct our own MSnSet. A small PSMs.txt Proteome Discoverer (PD) table from a TMT 10-plex experiment is provided with the camprotR package which we will turn into an MSnSet.

The input data

Lets have a look at our PSM data from PD. It is a data.frame.

str(psm_tmt_total)

This data.frame contains r nrow(psm_tmt_total) PSMs. We have quantitative data for each PSM (the Abundance columns) and metadata for each PSM (all the other columns).

assayData

As before, the single essential part of an MSnSet is the assayData slot which contains the quantitative data from your experiment.

In this case, it should contain a numeric matrix with r nrow(psm_tmt_total) rows corresponding to the r nrow(psm_tmt_total) PSMs and r sum(grepl("Abundance\\.", colnames(psm_tmt_total))) columns corresponding to the r sum(grepl("Abundance\\.", colnames(psm_tmt_total))) TMT tags.

First we extract the columns with the quantitative data and convert them to a numeric matrix.

# abundance columns for TMT PD output start with Abundance 
abundance_cols <- colnames(psm_tmt_total)[grepl('Abundance.', colnames(psm_tmt_total))]

tmt_exprs <- as.matrix(psm_tmt_total[, abundance_cols])

Then we remove the word 'Abundance' from the column names to make them more concise.

# update the column names to remove the 'Abundance.` prefix
colnames(tmt_exprs) <- gsub('Abundance.', '', colnames(tmt_exprs))

Lastly, we use the unique PSMs.Peptide.ID column to define unique row names. This is important for extracting and combining data down the line. Row names must be unique!

# use PSMs.Peptide.ID, which are unique, to define rownames
rownames(tmt_exprs) <- psm_tmt_total$PSMs.Peptide.ID

Our quantitative data are now ready.

featureData

Now we construct a data.frame with metadata for each PSM to go into the featureData slot of our MSnSet.

In this case, it should be a data.frame with r nrow(psm_tmt_total) rows corresponding to the r nrow(psm_tmt_total) PSMs and any number of columns.

First we extract the columns with the metadata of interest. Here we want everything but the Abundance columns and the unique IDs.

# get all columns except Abundance columns identified earlier
metadata_cols <- setdiff(colnames(psm_tmt_total), c(abundance_cols, "PSMs.Peptide.ID"))

tmt_fdata <- psm_tmt_total[, metadata_cols]

Again, we use the unique PSMs.Peptide.ID column to define unique row names. This must match tmt_exprs!

# use PSMs.Peptide.ID, which are unique, to define rownames
rownames(tmt_fdata) <- psm_tmt_total$PSMs.Peptide.ID

Our metadata are now ready.

phenoData

Lastly, we construct a data.frame with metadata for each TMT 10-plex tag, to go into the phenoData slot of our MSnSet.

In this case, it should be a data.frame with r ncol(tmt_exprs) rows corresponding to the r ncol(tmt_exprs) TMT tag and any number of columns.

First we construct an empty data.frame with r ncol(tmt_exprs) rows.

tmt_pdata <- data.frame(matrix(nrow = 10, ncol = 0))

Then we can add some metadata. In this example we will just add some fake sample names and fake treatment conditions.

tmt_pdata$sample <- paste0("sample", 1:10)
tmt_pdata$treatment <- rep(c("trt", "ctrl"), each = 5)

The rownames must be identical to the column names of tmt_exprs.

rownames(tmt_pdata) <- colnames(tmt_exprs)

Make the MSnSet

Now we construct the MSnSet. As long as we have set up the underlying data properly, this step is the easiest!

tmt_msnset <- MSnSet(exprs = tmt_exprs, fData = tmt_fdata, pData = tmt_pdata)

Lets have a look at its structure.

str(tmt_msnset, max.level = 2)

As before we can access the different slots as follows.

# access the quantitative data
head(exprs(tmt_msnset))

# access the PSM metadata
head(fData(tmt_msnset))

# access the sample metadata
head(pData(tmt_msnset))

Extracting results from an MSnSet

The code below shows briefly how to save/export the data within an MSnSet.

Using write.exprs() from MSnbase is the easiest way. Use the fDataCols argument to specify which featureData columns to add to the right of the quantitative data (specify as column names, column numbers, or a logical vector). The other arguments are the same as write.table().

MSnbase::write.exprs(
  tmt_msnset, 
  file = "results.csv",
  fDataCols = c("Percolator.q.Value", "Master.Protein.Accessions"),
  sep = ",", row.names = FALSE, col.names = TRUE
)

Alternatively you can manually combine the results manually.

results <- merge(
  exprs(tmt_msnset), # extract PSM quantitative data
  fData(tmt_msnset), # extract PSM metadata
  by = 0 # join by rownames
)

And then use the writexl package to save to Excel.

writexl::write_xlsx(results, path = "results.xlsx")

MSnSet slots

This section contains a detailed description of each MSnSet slot.

assayData

Contains the quantitative data from the experiment, i.e. how much of each feature (e.g. PSM, peptide, protein) was detected in each sample. This is the essential part of an MSnSet.

featureData

Optional. Contains metadata about the features (e.g. proteins, peptides, PSMs). For example for protein features this object might contain the protein names, their lengths, isoelectric points, number of transmembrane domains, associated GO terms, etc.

phenoData

Optional. Contains metadata about each sample, usually relating to the experimental design, e.g. replicates, tissues, animals, treatments, etc.

protocolData

Optional. Contains equipment-generated information about the protocols used for each sample. The number of rows and the row names must match the number of columns and column names of assayData.

experimentData

Optional. Contains descriptive information about the experiment and the experimenter.

processingData

Contains the version of MSnbase used to construct the MSnSet and also a log of what processes have been applied to the MSnSet.



CambridgeCentreForProteomics/camprotR documentation built on Jan. 27, 2023, 8:36 p.m.