preprocess_long: Preprocess data in "long" format

View source: R/preprocess_long.R

preprocess_longR Documentation

Preprocess data in "long" format

Description

Performs a standard preprocessing pipeline on data frames in "long" format (i.e. the data frame has one observation row per measurement (thus, multiple rows per subject)). By default, data are aggregated by the aggr_by column (typically the peptides column) via a prespecified aggregation function. Next, intensity values are log2 transformed and then quantile normalized. Next, the smallestUniqueGroups function is applied, which removes proteins groups for which any of its member proteins is present in a smaller protein group. Then, unwanted sequences (such as reverse sequences or unwanted sequences) are filtered out. Next, irrelevant columns are dropped. Then, peptide sequences that are identified only once in a single mass spec run are removed because with only 1 identification, the model will be perfectly confounded. Finally, potential experimental annotations are added to the data frame.

Usage

preprocess_long(df, accession, split, exp_annotation = NULL,
  type_annot = NULL, quant_col = "quant_value", run_col, aggr_by = NULL,
  aggr_function = "sum", logtransform = TRUE, base = 2,
  normalisation = "quantiles", smallestUniqueGroups = TRUE,
  useful_properties = NULL, filter = NULL, filter_symbol = NULL,
  minIdentified = 2, colClasses_df = NA, colClasses_exp = NA,
  printProgress = FALSE, shiny = FALSE, message = NULL, ...)

Arguments

df

A data frame that contains data in "long" format.

accession

A character indicating the column that contains the unit on which you want to do inference (typically the protein identifiers).

split

A character indicating which string is used to separate accession groups.

exp_annotation

Either the path to the file which contains the experiment annotation or a data frame containing the experiment annotation. Exactly one colum in the experiment annotation should contain the mass spec run names. Annotation in a file can be both a tab-delimited text document or an Excel file. For more details, see read.table and read.xlsx. As an error protection measurement, leading and trailing spaces in each column are trimmed off. The default, NULL indicates there is no (extra) annotation to be added.

type_annot

If exp_annotation is a path to a file, the type of file. type_annot is mostly obsolete as supported files will be automatically recognized by their extension. Currently only "tab-delim" (tab-delimited file), "xlsx" (Office Open XML Spreadsheet file) and NULL (file type decided based on the extension) are supported. If the extension is not recognized, the file will be assumed to be a tab-delimited file. Defaults to NULL.

quant_col

A character indicating the column that contains the quantitative values of interest (mostly peptide intensities or peptide areas under the curve). Defaults to "quant_value".

run_col

A character indicating the column in data frame df that contains the mass spec run names.

aggr_by

A character indicating the column by which the data should be aggregated. We advise to aggregate the data by peptide sequence (thus aggregate over different charge states and modification statuses of the same peptide). If you only want to aggregate over charge states, set aggr_by to the column corresponding to the modified sequences. If no aggregation at all is desired, leave aggr_by at NULL (default). Data will never be aggregated over different run_col.

aggr_function

Only used when aggr_by is not NULL. The function used to aggregate intensity data. Defaults to "sum".

logtransform

A logical value indicating whether the intensities should be log-transformed. Defaults to TRUE.

base

Only used when logtransform is TRUE. A positive or complex number: the base with respect to which logarithms are computed. Defaults to 2.

normalisation

A character vector of length one that describes how to normalise the data frame df. See normalise for details. Defaults to "quantiles". If no normalisation is wanted, set normalisation="none".

smallestUniqueGroups

A logical indicating whether proteins groups for which any of its member proteins is present in a smaller protein group should be removed from the dataset. Defaults to TRUE.

useful_properties

Character vector of column names of the data frame df that are useful in the further analysis and/or inspection of the data and should be retained. All columns that are not in useful_properties, accession, quant_col, run_col or aggr_by will be dropped. Defaults to NULL, in which case only accession, quant_col, run_col and aggr_by will be retained.

filter

A vector of names corresponding to the columns in the data frame df that contain a filtersymbol that indicates which rows should be removed from the data. Typical examples are contaminants or reversed sequences. Defaults to NULL, indicating no filtering should be applied.

filter_symbol

Only used when filter is not NULL. A character indicating the symbol in the columns corresponding to the filter argument that is used to indicate rows that should be removed from the data. Defaults to NULL.

minIdentified

A numeric value indicating the minimal number of times a peptide sequence should be identified in the dataset in order not to be removed. Defaults to 2.

colClasses_df

character. A vector of classes to be assumed for the columns of the data frame df. Recycled if necessary. If named and shorter than required, names are matched to the column names with unspecified values are taken to be NA. Possible values are NA (the default, when type.convert is used), NULL (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or factor, Date or POSIXct. Otherwise there needs to be an as method (from package methods) for conversion from character to the specified formal class.

colClasses_exp

character. Only used when the exp_annotation argument is a filepath. A vector of classes to be assumed for the columns of the experimental annotation data frame. Recycled if necessary. If named and shorter than required, names are matched to the column names with unspecified values are taken to be NA. Possible values are NA (the default, when type.convert is used), NULL (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or factor, Date or POSIXct. Otherwise there needs to be an as method (from package methods) for conversion from character to the specified formal class.

printProgress

A logical indicating whether the R should print a message before performing each preprocessing step. Defaults to FALSE.

shiny

A logical indicating whether this function is being used by a Shiny app. Setting this to TRUE only works when using this function in a Shiny app and allows for dynamic progress bars. Defaults to FALSE.

message

Only used when printProgress=TRUE and shiny=TRUE. A single-element character vector: the message to be displayed to the user, or NULL to hide the current message (if any).

...

Optional arguments to be passed to the normalisation methods.

Value

A preprocessed data frame that is ready to be converted into a protdata object.


statOmics/MSqRob documentation built on Dec. 8, 2022, 6 a.m.