knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(Biostrings) library(camprotR) library(httr)
Peptide-spectrum matching requires 2 main inputs: a collection of MS2 spectra and a FASTA database of protein/peptide sequences. Search algorithms are then used to match the supplied MS2 spectra to theoretical spectra generated from an in-silico digest of the supplied FASTA database (typically we use the SwissProt protein database for the organism of interest). The selection of the FASTA database should be done carefully as having a very large database can reduce the number of PSMs obtained [@Jeong2012]. Supposedly, using an unecessarily large database can also affect protein quantification.
There are certain proteinaceous contaminants that are commonly, accidentally introduced during sample preparation for proteomics. In order to avoid misidentification of MS2 spectra that have come from these contaminants we use an additional (much smaller) FASTA database containing the sequences of these common contaminants. This is generally called a contaminants database, or is colloquially known as a cRAP database in our lab.
It is very important that you know what proteins are
in your contaminants database as PSMs/peptides from these proteins are
deliberately filtered out in many of the functions in camprotR
. See this
blog post
some experimental situations where you may be interested in proteins that are
usually deemed contaminants.
There are two widely used FASTA databases containing common contaminants:
contaminants.fasta
file which is distributed in every single
installation of MaxQuant.In addition, in the CCP we use our own cRAP database which is similar but
not identical to the GPM cRAP. I highly recommend you use the CCP cRAP database,
as the other two are quite out of date. The download_crap()
function in
camprotR
provides an easy and quick way to download the CCP cRAP database
(see the section below).
This list of proteins was collated by the Global Proteome Machine (GPM) organisation. To quote the cRAP website,
cRAP is an attempt to create a list of proteins commonly found in proteomics experiments either by accident or through unavoidable contamination of protein samples.
cRAP includes 5 main categories of proteins:
According to the cRAP website, it has not been updated since 2012. Therefore, I would not recommend using this FASTA for your peptide searches.
Nevertheless, we can download this FASTA database through R. The following chunk is not run by default as the URL is not always stable.
# download cRAP crap_path <- tempfile(fileext = ".fasta") download.file("ftp://ftp.thegpm.org/fasta/cRAP/crap.fasta", destfile = crap_path)
The GPM cRAP database is also installed with camprotR
. You can access it
with the following command.
crap_path <- system.file("extdata", "cRAP_GPM.fasta.gz", package = "camprotR")
Then we can take a look at it using the useful Biostrings
package.
With Biostrings
we can load FASTA files into R and easily compare and
manipulate them.
# load cRAP FASTA crap <- Biostrings::readAAStringSet(crap_path) crap
We can see that there are r length(crap)
sequences in cRAP.
head(names(crap))
We can see that the headers for each sequence in the FASTA file have very little information in them. Another reason not to use this cRAP database as your contaminants database.
The cRAP database we use in the CCP is largely based off of the GPM cRAP database, with a few extra sequences added on the end. It also has much more informative headers and the sequences for the commercial proteases Endoproteinase GluC (NEB, P8100S) and recombinant Lys-C (Promega, V167A) have been added.
You can download the latest version of the CCP cRAP using the
download_ccp_crap()
function. It is important to take a note of what date you download the CCP
cRAP database and what the current UniProt release is, as the sequences and
accessions can change (slightly) over time. Generally I like to include the
UniProt release in the file name e.g. 2021-01_CCP_cRAP.fasta
, but hfor this
example we'll just use a temporary file.
ccp_tmp <- tempfile(fileext = ".fasta") download_ccp_crap(ccp_tmp, is_crap = TRUE, verbose = TRUE)
We can load this FASTA file into R and take a look.
ccp_crap <- Biostrings::readAAStringSet(ccp_tmp) ccp_crap
We can see there are r length(ccp_crap)
sequences. Now lets have a look
at some of the sequence headers.
head(names(ccp_crap))
These sequence headers are actually complete, and follow the standard structure of UniProt FASTA headers. Importantly they contain both the UniProt accession e.g. P00330 and the sequence version e.g. SV=5. Each time the amino acid sequence of a UniProt entry is modified, the sequence version is incremented by 1. With the UniProt accession and the SV number, one can know exactly which sequence is being referred to in the FASTA file.
This isn't exactly important for everyday use, but may prove important when reanalysing a proteomics experiment after a long time has passed.
Included in every installation of MaxQuant there is a file called
contaminants.fasta
. This is the contaminants database that MaxQuant will use
be default for peptide searches, however it is possible to provide your own
contaminants FASTA in MaxQuant if you want.
This is conjecture, but I'm fairly sure this FASTA database hasn't changed since the MaxQuant paper was released in 2008, making it possibly even older than the GPM cRAP database. Therefore, I also would not recommend using this FASTA for your peptide searches.
Nevertheless, we can download this FASTA database through R and explore it like the GPM cRAP FASTA above. The following chunk is not run by default as the URL is not always stable.
# download cRAP mq_path <- tempfile(fileext = ".fasta") download.file("http://lotus1.gwdg.de/mpg/mmbc/maxquant_input.nsf/7994124a4298328fc125748d0048fee2/$FILE/contaminants.fasta", destfile = mq_path)
This FASTA is also installed with camprotR
and you can access it in a similar
way to the GPM cRAP FASTA.
# get file path mq_path <- system.file("extdata", "contaminants.fasta.gz", package = "camprotR") # load cRAP FASTA mq <- Biostrings::readAAStringSet(mq_path) mq head(names(mq))
You can add sequences easily enough to a cRAP FASTA database by just copying and pasting. However, for reproducibility purposes it is better to use an R script to generate your own custom cRAP database.
Here we will generate a cRAP database based off of the CCP cRAP, but with some extra protease sequences added to the end.
First you would download the CCP cRAP FASTA. The following chunk is not run by default as we have already downloaded it above. In this chunk we show the "best practice" of naming the CCP cRAP FASTA file with the current UniProt release.
# download CCP cRAP download_ccp_crap(paste0(check_uniprot_release(), "_CCP_cRAP.fasta"))
Then you download the sequences you want to add and save them into a FASTA file. Here we'll get some protease sequences from the bacteria Streptomyces griseus.
We can download sequences from UniProt automatically using the
make_fasta()
function. This function takes a character vector of UniProt
accessions, downloads their sequences, and then saves the sequences into a
FASTA file. We don't want to add the cRAP00X
numbering to
the sequence headers just yet so we set is_crap = FALSE
.
griseus_tmp <- tempfile(fileext = ".fasta") make_fasta(accessions = c("P00776", "P00777", "P80561"), file = griseus_tmp, is_crap = FALSE)
Before we add these to our CCP cRAP FASTA, we just need to take note of what cRAP number the sequences headers stop at.
tail(names(ccp_crap))
The final number is 127, so we want the cRAP numbering of the additional
sequences to start at 128. Finally, we use the append_fasta()
function
from camprotR to add our S. griseus sequences to our CCP cRAP FASTA. This
function is used to add the sequences from one FASTA (file1) onto the end of
another FASTA (file2).
append_fasta( file1 = griseus_tmp, file2 = ccp_tmp, is_crap = TRUE, crap_start = 128 )
Lets just check that this worked properly. There should be 130 sequences now.
Biostrings::readAAStringSet(ccp_tmp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.