Spatial proteomics datasets

The pRolocdata package

pRolocdata is a Bioconductor experiment package (release and devel pages) that collects published (mainly, although some unpublished datasets are also available) mass spectrometry-based spatial/organelle and protein-complex dataset. The data are distributed as MSnSet instances (see the MSnbase for details) and are used throughout the pRoloc and pRolocGUI software for spatial proteomics data analysis and visualisation.

Current build status:

Build Status

Installation

library("knitr")
library("pRolocdata")
x <- data.frame(pRolocdata()$results[, -(1:2)])
colnames(x) <- c("Data", "Description")
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("pRolocdata")

Once installed, the package needs to be loaded

library("pRolocdata")

Available datasets

Currently, there are r nrow(x) datasets available in pRolocdata. Use the pRolocdata() function to obtain a list of data names and their description.

pRolocdata()
kable(x, format = "markdown")

Loading data

Data is loaded into the R session using the load function; for instance, to get the data from Dunkley et al (2006), one would type

data(dunkley2006)

To get more information about a given dataset, see its manual page

?dunkley2006
## or
help("dunkley2006")

The datasets

Each data object in pRolocdata is available as an MSnSet instance. The instances contain the actual quantitative data, sample and features annotations (see pData and fData respectively). Additional MIAPE data [1, 2] experimental data is available in the experimentData slot, as described in section Required metadata below.

The source of these data is generally one or several text-based spreadsheet (csv, tsv) produced by a third-party application. These original files are often distributed as supplementary material to the research paper and used to generate the R objects. These source spreadsheets are available in the package's inst/extdata directory. The R script files, that read the spreadsheets and create the R data is distributed in the inst/scripts directory.

Suggested metadata

Additional metadata is available with the pRolocmetadata() function as detailed below.

Species

Documented in experimentData(.)@samples$species

Tissue

Documented in experimentData(.)@samples$tissue. If tissue is Cell, then details about the cell line are available in experimentData(.)@samples$cellLine.

PMID

Documented in pubMedIds(.).

Spatial proteomics experiment annotation

Documented in experimentData(.)@other: - MS ($MS) type of mass spectrometry experiment: iTRAQ8, iTRAQ4, TMT6, LF, SC, ... - Experiment ($spatexp) type of spatial proteomics experiment: LOPIT, LOPIMS, subtractive, PCP, other, PCP-SILAC, ... - MarkerCol ($markers.fcol) name of the markers feature data. Default is markers. - PredictionCol ($prediction.fcol) name of the localisation prediction feature data.

Example

experimentData(dunkley2006)@samples
pubMedIds(dunkley2006)
otherInfo(experimentData(dunkley2006))

## all at once
pRolocmetadata(dunkley2006)

Adding new data

The procedure to data in pRolocdata is as follows. Here, we assume that 3 new data files are available from the manuscript of Smith et al. 2017, and these files will be added to pRolocdata as three MSnSet objects.

  1. the original data (often from supplementary material) are added to inst/extdata, say Smith_expA.csv, Smith_expB.csv and Smith_expC.csv (the name should ideally be the same as the original files), and the files and provenance is documented in inst/extdata/README. If the data files are really big, then they should be compressed. If they are too big (for example don't fit on github or would substantially increase the size of the package), then we might decide not to added them, but they should still be documented in the README file and the script (see point 2) should still assume they are there.

  2. A script, typically called Smith2017.R, is added to inst/scripts/. That script reads the files above and saves the corresponding (compressed) MSnSet objects directly in data, typically called Smith2016a.rda, Smith2016a.rda, ..., and the objects themselves would be named Smith2016a, Smith2016b, ...

  3. Write a man/Smith2016.Rd documentation file documenting all relevant data objects, providing some information about the experiment and data provenance, and a reference to the original paper.

  4. Build and check the package and, if successful, send a github pull request.

If you do not have the R expertise to prepare the data, please open an issue in the pRolocdata Github repo or send me an email at laurent.gatto<AT>uclouvain<dot>be with the source csv files and appropriate metadata and I will add it for you.



lgatto/pRolocdata documentation built on Oct. 17, 2024, 10:16 p.m.