knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(Biostrings)
library(camprotR)
library(httr)

Introduction

Peptide-spectrum matching requires 2 main inputs: a collection of MS2 spectra and a FASTA database of protein/peptide sequences. Search algorithms are then used to match the supplied MS2 spectra to theoretical spectra generated from an in-silico digest of the supplied FASTA database (typically we use the SwissProt protein database for the organism of interest). The selection of the FASTA database should be done carefully as having a very large database can reduce the number of PSMs obtained [@Jeong2012]. Supposedly, using an unecessarily large database can also affect protein quantification.

There are certain proteinaceous contaminants that are commonly, accidentally introduced during sample preparation for proteomics. In order to avoid misidentification of MS2 spectra that have come from these contaminants we use an additional (much smaller) FASTA database containing the sequences of these common contaminants. This is generally called a contaminants database, or is colloquially known as a cRAP database in our lab.

It is very important that you know what proteins are in your contaminants database as PSMs/peptides from these proteins are deliberately filtered out in many of the functions in camprotR. See this blog post some experimental situations where you may be interested in proteins that are usually deemed contaminants.

The 3 cRAP databases

There are two widely used FASTA databases containing common contaminants:

  1. The Global Proteome Machine (GPM) common Repository of Adventitious Proteins (cRAP)
  2. The contaminants.fasta file which is distributed in every single installation of MaxQuant.

In addition, in the CCP we use our own cRAP database which is similar but not identical to the GPM cRAP. I highly recommend you use the CCP cRAP database, as the other two are quite out of date. The download_crap() function in camprotR provides an easy and quick way to download the CCP cRAP database (see the section below).

common Repository of Adventitious Proteins (cRAP)

This list of proteins was collated by the Global Proteome Machine (GPM) organisation. To quote the cRAP website,

cRAP is an attempt to create a list of proteins commonly found in proteomics experiments either by accident or through unavoidable contamination of protein samples.

cRAP includes 5 main categories of proteins:

  1. Laboratory proteins used somewhere in the sample preparation process.
  2. Proteins added from dust or accidental contact with the sample
  3. Proteins used as molecular weight markers
  4. Proteins in the Sigma Universal Protein Standard (UPS)
  5. Common viral contaminants

According to the cRAP website, it has not been updated since 2012. Therefore, I would not recommend using this FASTA for your peptide searches.

Nevertheless, we can download this FASTA database through R. The following chunk is not run by default as the URL is not always stable.

# download cRAP
crap_path <- tempfile(fileext = ".fasta")
download.file("ftp://ftp.thegpm.org/fasta/cRAP/crap.fasta",
              destfile = crap_path)

The GPM cRAP database is also installed with camprotR. You can access it with the following command.

crap_path <- system.file("extdata", "cRAP_GPM.fasta.gz", package = "camprotR")

Then we can take a look at it using the useful Biostrings package. With Biostrings we can load FASTA files into R and easily compare and manipulate them.

# load cRAP FASTA
crap <- Biostrings::readAAStringSet(crap_path)
crap

We can see that there are r length(crap) sequences in cRAP.

head(names(crap))

We can see that the headers for each sequence in the FASTA file have very little information in them. Another reason not to use this cRAP database as your contaminants database.

Cambridge Centre for Proteomics cRAP (CCP cRAP)

The cRAP database we use in the CCP is largely based off of the GPM cRAP database, with a few extra sequences added on the end. It also has much more informative headers and the sequences for the commercial proteases Endoproteinase GluC (NEB, P8100S) and recombinant Lys-C (Promega, V167A) have been added.

You can download the latest version of the CCP cRAP using the download_ccp_crap() function. It is important to take a note of what date you download the CCP cRAP database and what the current UniProt release is, as the sequences and accessions can change (slightly) over time. Generally I like to include the UniProt release in the file name e.g. 2021-01_CCP_cRAP.fasta, but hfor this example we'll just use a temporary file.

ccp_tmp <- tempfile(fileext = ".fasta")
download_ccp_crap(ccp_tmp, is_crap = TRUE, verbose = TRUE)

We can load this FASTA file into R and take a look.

ccp_crap <- Biostrings::readAAStringSet(ccp_tmp)
ccp_crap

We can see there are r length(ccp_crap) sequences. Now lets have a look at some of the sequence headers.

head(names(ccp_crap))

These sequence headers are actually complete, and follow the standard structure of UniProt FASTA headers. Importantly they contain both the UniProt accession e.g. P00330 and the sequence version e.g. SV=5. Each time the amino acid sequence of a UniProt entry is modified, the sequence version is incremented by 1. With the UniProt accession and the SV number, one can know exactly which sequence is being referred to in the FASTA file.

This isn't exactly important for everyday use, but may prove important when reanalysing a proteomics experiment after a long time has passed.

MaxQuant contaminants.fasta

Included in every installation of MaxQuant there is a file called contaminants.fasta. This is the contaminants database that MaxQuant will use be default for peptide searches, however it is possible to provide your own contaminants FASTA in MaxQuant if you want.

This is conjecture, but I'm fairly sure this FASTA database hasn't changed since the MaxQuant paper was released in 2008, making it possibly even older than the GPM cRAP database. Therefore, I also would not recommend using this FASTA for your peptide searches.

Nevertheless, we can download this FASTA database through R and explore it like the GPM cRAP FASTA above. The following chunk is not run by default as the URL is not always stable.

# download cRAP
mq_path <- tempfile(fileext = ".fasta")
download.file("http://lotus1.gwdg.de/mpg/mmbc/maxquant_input.nsf/7994124a4298328fc125748d0048fee2/$FILE/contaminants.fasta",
              destfile = mq_path)

This FASTA is also installed with camprotR and you can access it in a similar way to the GPM cRAP FASTA.

# get file path
mq_path <- system.file("extdata", "contaminants.fasta.gz", package = "camprotR")

# load cRAP FASTA
mq <- Biostrings::readAAStringSet(mq_path)
mq

head(names(mq))

Make your own cRAP database

You can add sequences easily enough to a cRAP FASTA database by just copying and pasting. However, for reproducibility purposes it is better to use an R script to generate your own custom cRAP database.

Here we will generate a cRAP database based off of the CCP cRAP, but with some extra protease sequences added to the end.

First you would download the CCP cRAP FASTA. The following chunk is not run by default as we have already downloaded it above. In this chunk we show the "best practice" of naming the CCP cRAP FASTA file with the current UniProt release.

# download CCP cRAP
download_ccp_crap(paste0(check_uniprot_release(), "_CCP_cRAP.fasta"))

Then you download the sequences you want to add and save them into a FASTA file. Here we'll get some protease sequences from the bacteria Streptomyces griseus.

We can download sequences from UniProt automatically using the make_fasta() function. This function takes a character vector of UniProt accessions, downloads their sequences, and then saves the sequences into a FASTA file. We don't want to add the cRAP00X numbering to the sequence headers just yet so we set is_crap = FALSE.

griseus_tmp <- tempfile(fileext = ".fasta")
make_fasta(accessions = c("P00776", "P00777", "P80561"),
                file = griseus_tmp,
                is_crap = FALSE)

Before we add these to our CCP cRAP FASTA, we just need to take note of what cRAP number the sequences headers stop at.

tail(names(ccp_crap))

The final number is 127, so we want the cRAP numbering of the additional sequences to start at 128. Finally, we use the append_fasta() function from camprotR to add our S. griseus sequences to our CCP cRAP FASTA. This function is used to add the sequences from one FASTA (file1) onto the end of another FASTA (file2).

append_fasta(
  file1 = griseus_tmp,
  file2 = ccp_tmp,
  is_crap = TRUE,
  crap_start = 128
)

Lets just check that this worked properly. There should be 130 sequences now.

Biostrings::readAAStringSet(ccp_tmp)


CambridgeCentreForProteomics/camprotR documentation built on July 7, 2024, 2:13 a.m.