Introduction

The msdata2 package mainly contains a set of public accessible label-free Proteomics datasets. The purpose of this data package is to provide cleaned, formatted, compressed and extensible proteomics datasets to the research community.
Moreover, this package will also be built as a benchmarking data repository, which provides standard and curated datasets to facilitate the benchmarking of different quanfitication workflows [@Rost2014vh].

Datasets

The package contains the following datasets:
I. Raw MS datasets:
II. Expression Results:
- DIA/SWATH-MS:
* SWATH-MS Gold Standard Dataset
* OpenSWATH Spyogenes Dataset
- Label-free quantification:
* Yeast + UPS1 spiked datasets X8
To list all the datasets in the package, use msdata2().

Experimental Description

In this vignette, I will use one of the datasets - Rost2014sgsHuman as the example, to demonstrate my pacakge.

The Rost2014sgsHuman dataset was imported from SGS GOLD STANDARD DATASET (termed SGS for SWATH-MS Gold Standard), for the human background. The SGS Dataset was initially created to validate and benchmark SWATH-MS data analysis algorithms, which consists of known composition [@Rost2014vh]. The SWATH-MS Gold Standard (SGS) dataset consists of 90 SWATH-MS runs of 422 synthetic stable isotope-labeled standard (SIS) peptides in ten different dilution steps, spiked into three protein backgrounds of varying complexity (water, yeast and human), acquired in three technical replicates. The SGS dataset was manually annotated, resulting in 342 identified and quantified peptides with three or four transitions each. In total, 30,780 chromatograms were inspected and 18,785 were annotated with one true peak group, whereas in 11,995 cases no peak was detected. See also http://www.openswath.org/openswath_data.html for details.

Sample preparation

422 stable isotope-labeled standard (SIS) peptides (AAA-quantified, Sigma or Thermo) were pooled in a master sample of equal concentration. 10 two-fold dilution steps of the master sample were conducted, resulting in a 512-fold concentration range from 50 fmol/(\mu)L to 0.097 fmol/(\mu)L. 15 (\mu)L of each sample were spiked into 7.5(\mu)L of Hela cell lysate (human background), Saccharomyces cerevisiae BY4741 (yeast background) or water (no background). All samples were finally supplemented with 2.5 (\mu)L of reference peptides (iRT-Kit, Biognosys AG) for retention time re-alignment, yielding a final sample volume of 25 (\mu)L.

Importing data

Load data

library(msdata2)

Exploring the data

Datasets in the package:

msdata2()

Proteins in the experiment:

unique(fData(Rost2014sgsHuman)$ProteinName)

Filenames:

library(msdata2)
pander::pandoc.table(pData(Rost2014sgsHuman)[1:30,2:4], row.names = F)

Data Visualization:

library(ggplot2)
library(reshape2)
library(ggforce)

Missingness plot in the expression data:

naplot(Rost2014sgsHuman)

Identification Accuracy

To compute the identification accuracy, all results reported by mProphet above a certain cutoff were taken and it was counted how many of these results were false positive (no manual annotation present) or mis-identified (manual annotation present but at a different retention time, in our case further than 30 seconds away).

Missingness in Expression data:

  table(rowSums(is.na(exprs(Rost2014sgsHuman))))

Peptide pattern among 3 BioReplicates:

plot.e <- exprs(Rost2014sgsHuman)
colnames(plot.e) <- pData(Rost2014sgsHuman)$Run
plot.e <- as.data.frame(t(plot.e))
plot.e$Run <- as.numeric(pData(Rost2014sgsHuman)$Run)

plot.run <- melt(plot.e, id.vars = 'Run', variable.name = 'Peptide')
## Add "BioReplicate" column
plot.run$BioReplicate <- cut(plot.run$Run,
                             breaks = c(0, 10, 20, 30),
                             labels = c(1,2,3))
plot.run$Condition <- rep(1:10,3)

##
Rep.labs <- c(`1`="BioReplicate 1", `2` = "BioReplicate 2", `3` = "BioReplicate 3")
p <-ggplot(plot.run, aes(x = Condition, y = value, fill=BioReplicate) ) +
                scale_x_continuous(breaks = seq(1:10)) +
                geom_smooth(method = lm, formula = y ~ splines::bs(x, 3), se = FALSE, show.legend = FALSE, color="darkred") +
                geom_boxplot(aes(group = Run), alpha = 0.5) +
                facet_grid(.~BioReplicate, labeller = as_labeller(Rep.labs)) +
                theme(strip.text.x = element_text(size=9, color="black", face="bold"))
p

Correlation between different runs:

pairs(exprs(Rost2014sgsHuman)[, c(1, 10, 11, 20, 21, 30)]) 
Density plots
library(limma)
limma::plotDensities(exprs(Rost2014sgsHuman)[, c(1, 10, 11, 20, 21, 30)], legend = "topright") 

Peptide Count for each Protein

df.pro <- as.data.frame(table(fData(Rost2014sgsHuman)$ProteinName))
colnames(df.pro) <- c('Protein', 'Freq')
df.pro$Protein <- gsub(".*_", "", df.pro$Protein)
p.pro <- ggplot(df.pro, aes(x=Protein, y = Freq)) + 
                  geom_bar(stat="identity")
p.pro + coord_flip()

PCA plot

Performs a principal components analysis on the given expression dataset.

library(factoextra)
combineFeatures(filterNA(Rost2014sgsHuman), fcol = "ProteinName", method = "robust")
fData(Rost2014sgsHuman)$prot <- fData(Rost2014sgsHuman)$ProteinName
prot <- combineFeatures(filterNA(Rost2014sgsHuman), fcol = "prot", method = "robust")
p2 <- prcomp(t(exprs(prot)), scale = TRUE, center = TRUE)
fviz_pca_ind(p2, habillage = Rost2014sgsHuman$BioReplicate)
fviz_pca_ind(p2, habillage = Rost2014sgsHuman$BioReplicate, geom="point",
             addEllipses=TRUE, ellipse.level=0.95)

fviz_pca_ind(p2, habillage = Rost2014sgsHuman$Condition, geom="point",
             addEllipses=TRUE, ellipse.level=0.95, palette =  "ucscgb")

Contribution guidelines

References



UCLouvain-CBIO/msdata2 documentation built on July 4, 2020, 10:15 p.m.