knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette will give you an overview of how you can prepare the quantitative protein/peptide matrix output from common search engines and software such as Spectronaut, MaxQuant, Proteome Discoverer and Skyline for the analysis with protti. Due to its modular and flexible structure protti can be used on the output of common bottom-up proteomics search engines irrespective of the measurement mode (DDA, DIA, targeted-MS).
Furthermore, you are not only restricted to reports from the above mentioned search engines. As long as your data has a tabular format (data frame) and a specific minimal number of data columns you can analyse it with protti. The columns minimally required contain information on sample, condition, intensity, protein ID and the level the intensity is based on (fragment, precursor, peptide) if different from protein intensity. Depending on the analysis many more columns can be useful, but they are not required. Ultimately, your data should have a structure similar to this:
| Sample | Protein ID | Peptide Sequence | Condition | Intensity | |:-------:|:-----------:|:-----------------:|:----------:|:---------:| | sample1 | P62942 | PEPTIDER | treated | 14000 | | sample2 | P62942 | PEPTI | treated | 15000 | | sample3 | P62942 | PEPTIDE | treated | 14500 | | sample4 | P62942 | PEPTIDER | control | 18000 | | sample5 | P62942 | PEPTI | control | 21000 | | sample6 | P62942 | PEPTIDE | control | 19000 |
It is very important, that each unit of the level you perform your analysis on (e.g. peptide) has a single unique intensity associated with it. If, for example, a peptide has two different intensities, protti would not know how to deal with this and many functions will likely fail.
Data should always be organised in a format called tidy data. That means data should be contained in a long format (e.g. all sample names in one column) rather than a wide format (e.g. each sample name in its own column with intensity as the content of the columns). You can easily achieve this by using the
pivot_longer() function from the
tidyr package. The output of many search engines already contains tidy data and working with it is very easy because you can refer to information with only one variable. protti is designed to work together well with the
tidyverse package family that is build around the concept of tidy data.
Many search engines provide the user with protein intensities. However, it is also possible to calculate protein intensities directly from precursor intensities with the protti function
calculate_protein_abundance(). Protti implements the
"iq" method, previously implemented in the R package
iq which performs protein quantification based on the maximal peptide ratio extraction algorithm adapted from the MaxLFQ algorithm (Cox, J. 2013).
One advantage of calculating the protein abundance with protti is the possibility to median normalise run intensities on the precursor level. This is closer to the actually acquired intensities and thus sample concentrations than if normalisation is performed on the protein level. Some search engines provide the option for automatic median normalisation but not all. Furthermore, some search engines calculate protein intensities by summation of precursor intensities irrespective of missingness of peptides in certain samples. In these cases the maximal peptide ratio implemented in extraction algorithm provides a more robust calculation of protein intensities.
If you prefer to use protein intensities provided by the seach engine of your choice this is not a problem and we will show how some of this information can be converted into the right format.
We will demonstrate how most outputs can be converted with functions from the R packages
stringr. You can load packages after you installed them with the
library(magrittr) library(dplyr) library(tidyr) library(stringr)
Note that we are using the R package
magrittr because of its pipe operator
%>%. It takes the output of the preceding function and supplies it as the first argument of the following function. Using
%>% makes code easier to read and follow.
Spectronaut reports already contain data in the tidy data format. Therefore nothing needs to be changed in order to use them with protti. However, the columns we would recommend (not all columns are required) to export from Spectronaut are:
Please make sure that the report is a .csv file. You can use the
read_protti() function in order to load the spectronaut report into R. This function is a wrapper around the fast
fread() function from the
data.table package and the
clean_names() function from the
janitor package. This will allow you to not only load your data into R very fast, but also to clean up the column names into lower snake case. This will make it easier to remember them and to use them in your data analysis. For the Spectronaut columns
R.FileName will change for example into
# To read in your own data you can use read_protti() spectronaut_data <- read_protti(filename = "mydata/spectronaut.csv")
Depending on which analysis you are performing you will have to use different outputs. For peptide-centric analyses we would recommend to use the
evidence.txt file. If you want to perform a protein-centric analysis and you want to use protein quantities calculated by MaxQuant, you need the
proteinGroups.txt file. However, you can also apply the maximal peptide ratio extraction algorithm from the
iq R package implemented in the
protein_abundance_calculation() function of protti. This allows you to only use the
evidence.txt file. The resulting protein intensities are identical since they were calculated with the same algorithm.
In case you are interested in performing a peptide-centric analysis (necessary for LiP-MS), you should use the
evidence.txt file provided in the search output of MaxQuant.
evidence.txt file basically contains all the information we need to run protti. It is also contained in a long format which makes it easy to read in and use directly. One thing to take into consideration is the lack of a column for information on proteotypicity of peptides. However, this information can be inferred from the
Proteins column if it contains more than one protein ID. You can extract this information and create a new column called
is_proteotypic containing logicals that will be
TRUE if the
Proteins column does not contain a semicolon and
FALSE if it does (this indicates that the peptide belongs to more than one protein). As mentioned in the data analysis vignettes this information is necessary for the analysis of LiP-MS data but it could be also considered for the correct calculation of protein abundances.
Another column that is required for the analysis of your data is a column indicating conditions to which certain samples belong. This can be easily added to the evidence file by joining a data frame containing the specific annotations. You can create such a data frame in Excel and import it into R for a large number of samples or just create it directly in R.
MaxQuant output provides information on decoy hits contained in the column
reverse and also has information on whether your hit is a contaminant
potential_contaminant. You should filter these out before the analysis. However, the contaminant column can be used for quality control.
One important thing for MaxQuant data is to make sure that you only have one intensity assigned to each peptide or precursor. You can do this by summing up all intensities that MaxQuant exports (these can be MULTI-MSMS, MSMS, ISO-MSMS, MULTI-MATCH, ISO-SECPEP) or you can filter for example for precursors with MULTI-MSMS quantification and only use these.
In this section we will show you how to read in the file with
read_protti() and how to create the
is_proteotypic column and the
condition column (minimally required) with the help of the
dplyr packages. How to filter your data best is described in the data analysis vignettes.
# To read in your own data you can use read_protti() evidence <- read_protti(filename = "yourpath/evidence.txt") evidence_proteotypic <- evidence %>% # adds new column with logicals that are TRUE if the peptide can be assigned # to only one protein and FALSE if it can be assigned to multiple mutate(is_proteotypic = str_detect( string = proteins, pattern = ";", negate = TRUE )) %>% # adds new column with logicals indicating if peptide is coming from a potential contaminant mutate(is_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE)) # Make an annotation data frame and merge it with your data frame to obtain conditions # We are annotating sample 1-3 as controls and samples 4-6 as treated conditions file_name <- c( # make sure that the names are the same name as in your report "sample1", "sample2", "sample3", "sample4", "sample5", "sample6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation evidence_annotated <- evidence_proteotypic %>% left_join(y = annotation, by = "file_name")
For protein-centric analyses you can use the
proteinGroups.txt file provided by MaxQuant. This file contains information in a wide format where each sample has its own column containing intensity values. Therefore, we need to transform this data into a long format to meet the conditions of tidy data.
We will filter the data and use
pivot_longer() to change the format to long format. Furthermore, we produce an annotation data frame to create a
conditions column. The filtering is only done in order to remove proteins with potentially low quality. Further filtering for decoys and potential contaminants should be performed based on the data analysis vignettes.
# To read in your own data you can use read_protti() protein_groups <- read_protti(filename = "yourpath/proteinGroups.txt") %>% # adds new column with logicals indicating if protein is a potential contaminant, # you can filter these out later on. You should also consider filtering out proteins # that were "only identified by site" and reverse hits, as well as proteins with only # one identified peptide mutate(is_potential_contaminant = ifelse(potential_contaminant == "+", TRUE, FALSE)) # Change wide format to long format and create new columns called `r_file_name`and `intensity` protein_groups_long <- protein_groups %>% pivot_longer( cols = starts_with("intensity_"), names_to = "file_name", values_to = "intensity" ) # Make an annotation data frame and merge it with your data frame to obtain conditions # We are annotating sample 1-3 as controls and samples 4-6 as treated conditions file_name <- c( # make sure that the names are the same name as in your report "intensity_sample1", "intensity_sample2", "intensity_sample3", "intensity_sample4", "intensity_sample5", "intensity_sample6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation protein_groups_annotated <- protein_groups_long %>% left_join(y = annotation, by = "file_name")
The Skyline output is already in long format, however, to process it you need to sum up the transition intensities to obtain the intensity of one precursor. If you prefer to analyse your data on the fragment level, you should create a column that uniquely identifies each fragment of each precursor. You could do that by pasting together the peptide sequence with the charge and the product m/z.
The required Skyline output columns include:
You can add replicate and condition annotations in Skyline directly. However, we will explain in this section how you can also do it in R. If you want to analyse your data on the protein abundance level you will have to combine the precursor intensities to obtain one value for protein abundance. This could be done using the
calculate_protein_abundance() function from protti.
# Load data skyline_data <- read_protti(filename = "yourpath/skyline.csv") skyline_data_int <- skyline_data %>% # create a column with precursor information mutate(precursor = paste0(peptide_sequence, "_", charge)) %>% group_by(replicate_name, precursor) %>% # making a new column containing the summed up intensities of all transitions of one precursor mutate(sum_intensity = sum(area)) %>% select(-c(product_mz, area)) %>% # removing the columns we don't need distinct() # removing duplicated rows from the data frame # Add annotation # make sure that the names are the same name as in your report replicate_name <- c( "sample_1", "sample_2", "sample_3", "sample_1", "sample_2", "sample_3" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(replicate_name, condition) # Combine your long data frame with the annotation skyline_annotated <- skyline_data_int %>% left_join(y = annotation, by = "replicate_name")
The Proteome Discoverer output contains data in wide format (one column for each sample). Similar to MaxQuant there is also the option for a peptide or a protein-centric export. We will discuss both cases in this segment.
For a peptide-centric or a LiP-MS analysis please export the "Peptide Groups" report. Before preparing your export you can add the column "sequence" to your table otherwise Proteome Discoverer will only export the "annotated sequence" column which includes the preceding and following amino acids in the protein sequence.
The required columns include:
After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such.
We will read in the file using
read_protti() and then select the columns we are interested in. You can use the
contaminant column for qualitiy control. The
number_proteins column contains information on the proteotypicity. If this is 1 then the peptide is proteotypic. If you want to analyse your data qualitatively only with quality control functions of protti you can keep peptides without quantifications. Before you start your quantitative analysis remove observations that are labeled
"No Quan Values" in the
quan_info column. In the below example they are filtered out at this step, but you can keep them and only filter them out later.
# Load data pd_pep_data <- read_protti("yourpath/PDpeptides.csv") # Select relevant columns pd_pep_selected <- pd_pep_data %>% select( sequence, modifications, number_proteins, contaminant, master_protein_accessions, starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped" quan_info ) # Filter data frame pd_pep_filtered <- pd_pep_selected %>% filter(contaminant == FALSE) %>% # remove annotated contaminants filter(number_proteins == 1) %>% # select proteotypic peptides filter(quan_info != "No Quan Values") # remove peptides that have no quantification values # Convert into long format pd_pep_long <- pd_pep_filtered %>% pivot_longer( cols = starts_with("abundances"), names_to = "file_name", values_to = "intensity" ) %>% # combine peptide sequence and modifications to make a precursor column mutate(precursor = paste(sequence, modifications)) # Make annotation data frame file_name <- c( # make sure that the names are the same name as in your report "abundances_grouped_f1", "abundances_grouped_f2", "abundances_grouped_f3", "abundances_grouped_f4", "abundances_grouped_f5", "abundances_grouped_f6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation pd_pep_long_annotated <- pd_pep_long %>% left_join(y = annotation, by = "file_name")
For a protein-centric or analysis please export the "Proteins" report.
The required columns include:
After saving the report as an Excel file please convert it to a .csv file, simply by opening it and saving it as such.
We will read in the file using
read_protti() and then select the columns we are interested in. Similar to above you can either filter the
number_peptides columns now or later.
# Load data pd_prot_data <- read_protti("yourpath/PDproteins.csv") # Select relevant columns pd_prot_selected <- pd_prot_data %>% select( accession, description, contaminant, number_peptides, starts_with("abundances_grouped"), # select all columns that start with "abundances_grouped" ) # Filter data frame pd_prot_data_filtered <- pd_prot_selected %>% filter(contaminant == FALSE) %>% # remove annotated contaminants filter(number_peptides > 1) # select proteins with more than one identified peptide # Convert into long format pd_prot_long <- pd_prot_data_filtered %>% pivot_longer( cols = starts_with("abundances"), names_to = "file_name", values_to = "intensity" ) # Make annotation data frame file_name <- c( # make sure that the names are the same name as in your report "abundances_grouped_f1", "abundances_grouped_f2", "abundances_grouped_f3", "abundances_grouped_f4", "abundances_grouped_f5", "abundances_grouped_f6" ) condition <- c( "control", "control", "control", "treated", "treated", "treated" ) annotation <- data.frame(file_name, condition) # Combine your long data frame with the annotation pd_prot_long_annotated <- pd_prot_long %>% left_join(y = annotation, by = "file_name")
As mentioned in the beginning of this vignette you can use the output of any search engine as long as it contains the minimally required columns. If it is not in the right format you can see if some of the above transformations can be applied to your data. It is also always useful to check if you can find additional columns that help you in your analysis and that you can export from your search engine. Always make sure that all of your observations are ones you are interested in. Check if there are decoys, contaminants or non-proteotypic peptides in your data. For protein-centric analysis, potentially remove quantifications that rely on only a few peptides.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.