Proteins-class: The 'Proteins' Class for Proteomics Data And Meta-Data
In Pbase: Manipulating and exploring protein and proteomics data

Description Objects from the Class Details Developement notes Slots Extends Methods Functions Author(s) References See Also Examples

The Proteins class encapsulates data and meta-data for proteomics experiments. The class stores the protein sequences as well as specific subsets of interest, typically peptides, as ranges. The Proteins instances, the sequence and peptide slots are described by their respective metadata attributes.

Objects can be created using its constructor Proteins. The constructor either takes a fasta file name as first argument, an EnsDb object or a named uniprotIds argument with valid UniProt accession numbers (not yet implemented).

The Proteins constructor with the EnsDb loads protein data directly from the EnsDb object. The additional arguments filter, loadProteinDomains, columns and fetchLRG allow to additionally specify if only proteins matching a certain criteria should be fetched, whether all protein domains should be added as pranges, optional additional annotation columns that should be retrieved and whether proteins from Locus Reference Genes (LRG) should also be retrieved from the database.

An instance of class Proteins is characterised by one or multiple protein sequences that can be accessed as AAStringSet with the aa accessor. Sequence-specific annotation, such as accession numbers, protein and gene names, ... is available with the acols method. General metadata such as the data of creation of the instance are stored as a list returned by the metadata accessor, which would typically contain a created character that documents when the object was created, a reference genome descriptor, a UniProtRelease with the release data of the UniProt database and possibly others.

Each sequence of a Proteins instance can also be characterised by a set of specific ranges describing peptides of interest. These peptide features can be extracted as an AAStringSetList, where each protein sequence contains 0 or more peptide features. These peptides features are encode as ranges along the original proteins sequences (a list of IRanges) that can be extracted with the pranges function. These peptide features have their own metadata describing for example peptide identification scores, number of missed cleavages, ... available with the pcols methods.

See also the Pbase-data vignette.

The Proteins constructor with argument file being an EnsDb object allows to retrieve protein sequences along with all their related protein domains from an EnsDb annotation database. The optional filter argument can be used to fetch only proteins matching the defined filtering criteria from the database. The filter argument takes an object extending the AnnotationFilter class, an AnnotationFilterList combining such objects or a filter expression in form of a formula. See the AnnotationFilter and proteins documentation for more details.

Additional annotation columns from the database that should be retrieved from the database and included into acols can be specified with the columns argument. The listColumns can be used to list all available annotation columns from the database. Ensembl protein IDs will be used as the names of the returned Proteins object. See the vignette from the ensembldb package for an overview of supported filters or below for some examples.

Since version 0.2.0, addIdentificationData supports multiple identification file names to be added to a Proteins instance (argument renamed filenames) using either mzID or mzR. Added new Pparams parametrisation infrastructure.

See news(package = "Pbase") for a description of all changes.

Other possible metadata fields: Uniprot.sw, biomaRt instances.

metadata:: Object of class "list" containing global metadata, accessed with metadata.
aa:: Object of class "AAStringSet" storing the protein sequences, accessed with aa.

.__classVersion__:: Object of class "Versions" documenting the class verions. Intended for developer use and debugging.

Class "Versioned", directly.

aa: signature(x = "Proteins"): Returns an AAStringSet instance representing the sequences of the proteins.
pfeatures: signature(x = "Proteins"): ...
pranges: signature(x = "Proteins"): ...
metadata: signature(x = "Proteins"): Returns a list of global metadata of the instance x, including data of instance creation or, if created from a set of UnitProt identifiers (see constructors above), the UniProt version and UnitProt.WS version number.
acols: signature(x = "Proteins"): Returns a DataFrame of protein metadata.
pcols: signature(x = "Proteins"): Returns a list of feature metadata.
avarLabels: signature(x = "Proteins"): Returns the names of the sequences metadata.
pvarLabels: signature(x = "Proteins"): Returns the names of the peptide feature metadata.
seqnames: signature(x = "Proteins"): Returns the protein sequence names defined as UniProt accession numbers.
names: signature(x = "Proteins"): Returns the protein sequence names defined as UniProt accession numbers. It is just a synonym for seqnames.
length: signature(x = "Proteins"): Returns the number of proteins.
[: signature(x = "Proteins", i = "ANY", j = "missing"): Creates a subset of the Proteins instance.
[[: signature(x = "Proteins", i = "ANY", j = "missing"): Returns an AAString instance representing the sequence of the selected protein.
pfilter: signature(x = "Proteins", mass = "numeric", len = "numeric", ...): ...
cleave: signature(x = "Proteins", enzym = "character", missedCleavages = "numeric"): Cleaves all proteins using the enzym rule while allowing missedCleavages missing cleavages. Please see cleave for details.
addIdentificationData: signature(object = "Proteins", id = "character", rmEmptyRanges = "logical", par = "Pparams"): Adds identification data from an IdentMzMl file (id) to the Proteins object. If rmEmptyRanges is TRUE proteins without any identification data are removed. See Pparams for further settings.
addPeptideFragments: signature(object = "Proteins", filenames = "character", rmEmptyRanges = "logical", par = "Pparams"): Adds identification data from a fasta file (filenames) to the Proteins object. Please note that both fasta files (the origin of the Proteins object and the ones given in filenames) must share the same Uniprot accession numbers. If rmEmptyRanges is TRUE proteins without any identification data are removed. See Pparams for further settings.
plot: signature(x = "Proteins", y = "missing"): Plots all proteins and associated peptides using the Gviz/Pviz infrastructure.
show: signature(object = "Proteins"): Displays object summary as text.

rmEmptyRanges: signature(x = "Proteins")

: removes proteins with empty peptide ranges.

proteotypic: signature(x = "Proteins"): returns a modified Proteins object. pcols(x) gains a "Proteotypic" logical column, indicating of the peptide is proteotypic or now.
proteinCoverage: signature(pattern = "Proteins"): calulates the coverage of proteins. pcols(x) gains a "Coverage" numeric column.
isCleaver: signature(x = "Proteins", missedCleavages = "numeric"): Tests whether a Protein object was cleaved already.

Laurent Gatto <lg390@cam.ac.uk>, Sebastian Gibb <mail@sebastiangibb.de> and Johannes Rainer <johannes.rainer@eurac.edu>

Definition of the UniProt fasta comment format: http://www.uniprot.org/help/fasta-headers

calculateHeavyLabels

## Create a Protein object reading all proteins from a fasta file.
fastaFiles <- list.files(system.file("extdata", package = "Pbase"),
                         pattern = "fasta", full.names = TRUE)
p <- Proteins(fastaFiles)
p
metadata(p)

## Adding custom metadata
metadata(p, "Comment") <- "I love R"
metadata(p)

## Plotting
plot(p[1:5], from = 1, to = 30)

## Cleaving
pp <- cleave(p[1:100])
pp <- proteotypic(pp)
pp
pcols(pp[1:2])

plot(pp[1:2], from = 20, to = 30)

## Protein coverage
pp <- proteinCoverage(pp)
avarLabels(pp)
acols(pp)$Coverage
pp


## Add indentification data
idfile <- system.file("extdata/Thermo_Hela_PRTC_selected.mzid",
                      package = "Pbase")
p <- addIdentificationData(p, idfile)
pranges(p)
pfeatures(p)

plot(p[1])
plot(p[1], # the first protein has 36 peptides
     fill = c(rep("orange", 13), rep("steelblue", 13)))

## Retrieve a Proteins object from an EnsDb object: first load the annotation
## database for all human genes defined in Ensembl version 86.
library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## Define a filter to retrieve all genes from chromosome Y
sqnf <- SeqNameFilter("Y")
## Retrieve the proteins without protein domains but specify to retrieve in
## addition the transcript biotype for the encoding transcripts and the gene
## names
prts <- Proteins(edb, filter = sqnf, loadProteinDomains = FALSE,
                 columns = c("tx_biotype", "gene_name"))
aa(prts)
acols(prts)

## The listColumns method lists all available columns from the database.
listColumns(edb)

## Load all proteins from the gene ZBTB16 including all protein domains from
## the database. Here we pass the filter criteria as a formula to the method
prts <- Proteins(edb, filter = ~ gene_name == "ZBTB16")

## List available pranges
pcols(prts)

## Access the protein domains
pcols(prts)$ProteinDomains