preprocess_MSnSet: Preprocess MSnSet objects

View source: R/preprocess_MaxQuant.R

preprocess_MSnSetR Documentation

Preprocess MSnSet objects

Description

This function allows to perform a standard preprocessing pipeline on MSnSet objects (Gatto et al., 2012). By default, intensity values are log2 transformed and then quantile normalized. Next, the smallestUniqueGroups function is applied, which removes proteins groups for which any of its member proteins is present in a smaller protein group. Then, peptides that need to be filtered out are removed. Next, irrelevant columns are dropped. Then, peptide sequences that are identified only once in a single mass spec run are removed because with only 1 identification, the model will be perfectly confounded. Finally, potential experimental annotations are added to the data frame.

Usage

preprocess_MSnSet(MSnSet, accession, exp_annotation = NULL,
  type_annot = NULL, aggr_by = NULL, aggr_function = "sum",
  logtransform = TRUE, base = 2, normalisation = "quantiles",
  weights = NULL, smallestUniqueGroups = TRUE, split = NULL,
  useful_properties = NULL, filter = NULL, filter_symbol = NULL,
  minIdentified = 2, external_filter_file = NULL,
  external_filter_accession = NULL, external_filter_column = NULL,
  colClasses = "keep", droplevels = TRUE, printProgress = FALSE,
  shiny = FALSE, message = NULL, details = NULL)

Arguments

MSnSet

An MSnSet object.

accession

A character indicating the column that contains the the protein identifiers. This is only used if smallestUniqueGroups is TRUE and/or and external_filter_file is provided. Thus, the accession parameter can safely be specified even when you are not interested in comparing proteins later on.

exp_annotation

Either the path to the file which contains the experiment annotation or a data frame containing the experiment annotation. Exactly one colum in the experiment annotation should contain the mass spec run names. Annotation in a file can be both a tab-delimited text document or an Excel file. For more details, see read.table and read.xlsx. As an error protection measurement, leading and trailing spaces in each column are trimmed off. The default, NULL indicates there is no annotation to be added (in contrast to the default from the preprocess_generic function!).

type_annot

If exp_annotation is a path to a file, the type of file. type_annot is mostly obsolete as supported files will be automatically recognized by their extension. Currently only "tab-delim" (tab-delimited file), "xlsx" (Office Open XML Spreadsheet file) and NULL (file type decided based on the extension) are supported. If the extension is not recognized, the file will be assumed to be a tab-delimited file. Defaults to NULL.

aggr_by

A character indicating the column by which the data should be aggregated. We advise to aggregate the data by peptide sequence (thus aggregate over different charge states and modification statuses of the same peptide). If you only want to aggregate over charge states, set aggr_by to the column corresponding to the modified sequences. The default, NULL, indicates that no aggregation will be performed.

aggr_function

Only used when aggr_by is not NULL. The function used to aggregate intensity data. Defaults to "sum".

logtransform

A logical value indicating whether the intensities should be log-transformed. Defaults to TRUE.

base

A positive or complex number: the base with respect to which logarithms are computed. Defaults to 2.

normalisation

A character vector of length one that describes how to normalise the MSnSet object. See normalise for details. Defaults to "quantiles". If no normalisation is wanted, set normalisation="none".

weights

Only used when normalisation is set to or "rlr", "loess.fast", "loess.affy" or "loess.pairs". A numeric vector of weights for each row in the MSnSet object to be used for the fitting during the normalisation step. Defaults to NULL.

smallestUniqueGroups

A logical indicating whether protein groups for which any of its member proteins is present in a smaller protein group should be removed from the dataset. Defaults to TRUE.

split

A character string that indicates the separator between protein groups. Only used when smallestUniqueGroups is set to TRUE.

useful_properties

The columns of the featureData slot that are useful in the further analysis and/or inspection of the data and should be retained. Defaults to NULL, in which case no additional columns will be retained.

filter

A vector of names corresponding to the columns in the featureData slot of the MSnSet object that contain a filtersymbol that indicates which rows should be removed from the data. Typical examples are contaminants or reversed sequences. Defaults to NULL, in which case no filtering will be performed.

filter_symbol

Only used when filter is not NULL. A character indicating the symbol in the columns corresponding to the filter argument that is used to indicate rows that should be removed from the data. Defaults to NULL, which will throw an error if filter is not NULL to alert the user to specify a filter symbol.

minIdentified

A numeric value indicating the minimal number of times a peptide sequence should be identified in the dataset in order not to be removed. Defaults to 2.

external_filter_file

The name of an external protein filtering file. Sometimes, users want to filter out proteins based on a separate protein file. This file should contain at least a column with name equal to the value in external_filter_accession containing proteins, and one or more columns on which to filter, with names equal to the input in external_filter_column. Proteins that need to be filtered out should have the filter_symbol in their external_filter_column. Defaults to NULL, in which case no filtering based on an external protein file will be done.

external_filter_accession

Only used when external_filter_file is not NULL. A character indicating the column that contains the protein identifiers in the external_filter_file. Defaults to NULL, which will throw an error if external_filter_file is not NULL to alert the user to specify a filter column.

external_filter_column

Only used when external_filter_file is not NULL. A vector of names containing the column name(s) on which to filter in the external_filter_file. Defaults to NULL, which will throw an error if external_filter_file is not NULL to alert the user to specify a filter column.

colClasses

Only used when the exp_annotation argument is a filepath. A vector of classes to be assumed for the columns of the experimental annotation data frame. Recycled if necessary. If named and shorter than required, names are matched to the column names with unspecified values are taken to be NA.

droplevels

A logical indicating if levels of factors that disappeared during preprocessing should be removed from the data. Defaults to TRUE. Possible values are "keep" (the default, when the colClasses are unchanged for data frames and type.convert is used for files), NA (when type.convert is always used), NULL (when the column is skipped), one of the atomic vector classes ("logical", "integer", "numeric", "complex", "character", "raw"), or "factor", "Date" or "POSIXct". Otherwise there needs to be an as method (from package methods) for conversion from "character" to the specified formal class.

printProgress

A logical indicating whether the R should print a message before performing each preprocessing step. Defaults to FALSE.

shiny

A logical indicating whether this function is being used by a Shiny app. Setting this to TRUE only works when using this function in a Shiny app and allows for dynamic progress bars. Defaults to FALSE.

message

Only used when shiny=TRUE. A single-element character vector: the message to be displayed to the user, or NULL to hide the current message (if any).

details

Only used when shiny=TRUE or printProgress=TRUE. A character vector containing the detail messages to be displayed to the user, or NULL to hide the current detail messages (if any). The detail messages will be shown with a de-emphasized appearance relative to the message.

Value

A preprocessed MSnSet object that is ready to be converted into a protdata object.

References

Gatto L, Lilley KS. MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. https://doi.org/10.1093/bioinformatics/btr645. PubMed PMID:22113085.


statOmics/MSqRob documentation built on Dec. 8, 2022, 6 a.m.