Data Description

The DepLabData package uses data from experiments done by a proteomics lab at Weill Cornell. Maxquant outputs 276 variables for 5,208 proteins identified in the PCP experiments (2 replicates each for the EV, or empty vector; WT, or overexpressed ERK2 for recapitulating the epithelial phenotype; and DN, or the mutant ERK2 which accelerates the epithelial to mesenchymal transition.

Data Cleaning

We provided the cleaned data in the rdata objects in /data. Here we outline how the raw data is processed with the function clean_MQ

1. Initial check: The function will stop if no cleaning steps are given.

2. By default, contaminants from the experiment are removed. In proteomics experiments, pig trypsin is used as a control for assessment of the mass spec experiment efficiency. Also, decoys, which is simulated fake peptides generated by MaxQuant and used for calculating false predictions from the mass spectrometer, are removed.

3. The organism name given to poi, is used for extracting proteins with identifiers specific to that organism and no other organism.

4. Spike ins, which by default are none, are removed from the data set.

#not to be run-only to understand the cleaning function
remove.contaminants = TRUE
remove.decoys = TRUE
poi = NULL
spikeIn = NULL

  # Currently, this function will subset the data.frame more and more, thus
  # multiple filtering options may clash. E.g., if the data.frame is already
  # filtered to only contain trypsin-related entries, it will most likely not
  # find anything related to a Uniprot search for non-trypsin proteins.

  if(dim(mq.df)[1] == 0)(warning("The input to cleaning_MQ is empty."))

 mq.out <- mq.df

 if(!remove.contaminants && !remove.decoys && is.null(poi) && is.null(spikeIn)){
    warning("Note that none of the offered filtering options is set.
            The in-going data frame should be the same as the out-going one.")
    }

 if(remove.contaminants){
    mq.out <- subset(mq.out, !grepl("CON", mq.out$Protein.IDs))
  }

 if(remove.decoys){
    mq.out <- subset(mq.out, !grepl("REV", mq.out$Protein.IDs))
  }

 if(!is.null(poi)){

   if(poi == "yeast"){

     mq.out <- subset(mq.out, grepl("^[YQ]+", Protein.IDs))

   }else if(poi == "human"){

     # the massive regex in the middle is from TrEMBL (http://www.uniprot.org/help/accession_numbers)
      mq.out <- subset(mq.out, grepl("([OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2})", Protein.IDs))

     }else{stop("If you would like to retrieve the proteins of interest for the
                 organism for which the experiment was done, specify one of the
                 available options: ‘yeast’ or ‘human’.")
        }
  }

 # extracting spiked-in proteins
  if(!is.null(spikeIn)){

   if(!is.null(poi)){
      stop("The option to retrieve spike-in entries is not compatible with
           retrieving yeast or human entries. Set `poi = NULL`.")
    }

   if(remove.contaminants){
      warning("Extracting the results for spike-ins while at the same time
              removing contaminants will probably not yield the desired results
              (if anything). Recommended settings: remove.contaminants = FALSE,
              remove.decoys = TRUE, poi = NULL")
    }

   if(all(spikeIn == "trypsin")){

     mq.out <- subset(mq.out, grepl("P00761$|P00761[^a-zA-Z0-9]", Protein.IDs) &
                           grepl("CON", Protein.IDs))
      }else{
        ID.check <- check_nomenclature(spikeIn)
        if(!all(ID.check)){stop("The ID(s) you supplied to `spikeIn =` do(es)
            not meet the UniProt or yeast gene nomenclature criteria.")}
        reg.1 <- paste(paste(spikeIn, "$", sep = ""), collapse="|")
        reg.2 <- paste(paste(spikeIn, "[^a-zA-Z0-9]", sep = ""), collapse="|")
        reg.combi <- paste(reg.1, reg.2, sep = "|", collapse = "")
        mq.out <- subset(mq.out, grepl( reg.combi, Protein.IDs))
        }
    }

 # done cleaning
  if(dim(mq.out)[1] == 0){
    warning("None of the entries in the MaxQuant output survived the cleaning.
            Check that you selected the correct organism for the data that you uploaded.")
    }

 dim(mq.out)


#code for cleaning data

#code for reading:
reading_MQ <- function(filename){

 mq.in <- read.table(filename, header=TRUE, sep="\t",
                   strip.white = TRUE, fill = TRUE, stringsAsFactors = FALSE,
                   comment.char = "") # important to also capture cases with a # in the Fasta header

dim(mq.in)
}

A short description of the processed datasets are given by pulling up the help pages of the cleaned objects, For example (you'll need to do this in the console/RStudio to view the help page),

path = "/Users/nickgiangreco/GitHub/DepLabData/data/"

objfile<-list.files(path=path)[1]

load( paste0( path, objfile) )

dim(DN_trial1)

require(devtools)
#DepLabData package location in local
#install( "/Users/nickgiangreco/GitHub/DepLabData" )
library(DepLabData)

?DN_trial1

MaxQuant outputs processed MS/MS data with seven descriptive variables. These variables and other relevant information is described below.

1) Raw Intensity: The total ion current of all ions associated with the given peptide. This may provide a rough idea of the abundance of a specific peptide but cannot be used as a reliable estimate of exact quantities. Many factors influence peptide abundance regardless of the amount of starting material (protein isolated from experiment) provided as input in the MS instrument.

2) LFQ: MQ utilizes a method called "Label-free quantification" to compute an estimate of the quantity of a given protein in the experimental sample as compared to the control samples. This serves as a surrogate for experimental methods of protein quantification such as SILAC (stable isotope labeling by amino acids in culture). This works by comparing the amounts of a given sequenced peptides between two or more samples.

3) MS/MS Count: The number of sequencing events that have been recorded for the given peptide by tandem mass spectrometry. This is determined by the number of MS/MS spectra that are determined to match the given peptide. A low MS/MS count may indicate an inadequate amount of starting material or low quality MS/MS spectra.

4) Peptide Count: The total number of peptides associated with the given protein that have been identified by the MS/MS spectra.

5) Unique Peptides only: Of the peptides associated with an identified protein, unique peptides are those that have been assigned to only one protein group.

6) Razor and Unique Peptides: A "Razor" peptide is a peptide assigned to a protein group where it is more abundant than any other peptide assigned to that group. Razor peptides may be assigned to multiple protein groups, in which case they are considered "razor peptides" only in the groups in which they are most abundant. However, a "Razor and Unique peptide" is one in which it is assigned to only one protein group and in addition is the most abundant peptide within that group.

7) Sequence Coverage: For a given protein identified by MS/MS sequencing, the sequence coverage refers to the percentage of the total amino acid sequence of that protein present in nature that is recaptulated by the MS/MS peptides sequences used to identify the presense of that protein. 100% sequence coverage would indicate that every amino acid in the protein sequence is represented at least once in the peptides that identify that protein.

Notes on QC

Spike-ins : Spike-ins are a form of a positive control in MS experiments that allow you to validate that the spectrometer is functioning properly. A small amount of a known protein (typically this is pig trypsin) is added to the experimental sample before the MS analysis. We then look for the expected spectra for the spike-in protein to show up in the processed data of the experiment. Abnormalities in the resulting spectra matched to the spike-in may be indicative that the spectrometer is not functioning properly.

Contaminants : Common contaminants include keratin (from human hair, nails, hair) and proteins present in BSA (bovine serum albumin). Peptides identified from species other than the species that is the target of the experiment in question indicate outside sources of contamination (for a human experiment, bacterial or mouse proteins may be some common contaminants). Contaminant peptides are automatically identified by MaxQuant.

Decoy peptides : MS experiments involving the analysis and identification of large number of input proteins in a single experiment (which may be several thousand) are faced with an issue of resolving correctly interpreted spectra from 'false positives' in which a peptide or protein is incorrectly inferred (false discovery rate). A strategy known as the target-decoy method for the estimation of this error rate in large-scale MS experiments works by including manufactured protein sequences in the database search space that do not exist in nature. Decoy sequences are considered alongside real sequences derived from the species of interest while analyzing the MS spectra, and will be 'matched' at some non-zero rate. The rate and specifics of matched decoy sequences then can be used to estimate error rates and also to refine the search parameters for better sensitivity and more accurate hits. Decoy peptides are typically generated as reverse sequences from the organism of interest.



julia-wrobel/DepLabData documentation built on May 24, 2019, 4:07 a.m.