View source: R/preprocess_wide.R
preprocess_wide | R Documentation |
Performs a standard preprocessing pipeline on data frames in "wide" format (i.e. the data frame has one observation row per subject with each measurement present as a different variable).
By default, intensity values are log2 transformed and then quantile normalized. Next, the smallestUniqueGroups
function is applied,
which removes proteins groups for which any of its member proteins is present in a smaller protein group. Then, unwanted sequences (such as reverse sequences or unwanted sequences) are filtered out.
Next, irrelevant columns are dropped. Then, peptide sequences that are identified only once in a single mass spec run are removed because with only 1 identification, the model will be perfectly confounded. Finally, potential experimental annotations are added to the data frame.
preprocess_wide(df, accession, split, exp_annotation = NULL, type_annot = NULL, quant_cols, aggr_by = NULL, aggr_function = "sum", logtransform = TRUE, base = 2, normalisation = "quantiles", smallestUniqueGroups = TRUE, useful_properties = NULL, filter = NULL, filter_symbol = NULL, minIdentified = 2, colClasses = NA, printProgress = FALSE, shiny = FALSE, message = NULL, ...)
df |
A data frame that contains data in "wide" format. |
accession |
A character indicating the column that contains the unit on which you want to do inference (typically the protein identifiers). |
split |
A character indicating which string is used to separate accession groups. |
exp_annotation |
Either the path to the file which contains the experiment annotation or a data frame containing the experiment annotation. Exactly one colum in the experiment annotation should contain the mass spec run names. Annotation in a file can be both a tab-delimited text document or an Excel file. For more details, see |
type_annot |
If |
quant_cols |
Either a character or numeric vector indicating the columns that contain the quantitative values of interest (mostly peptide intensities or peptide areas under the curve) or a character string of length one indicating a pattern that is unique for the column names of the columns that contain the quantitative values of interest. |
aggr_by |
A character indicating the column by which the data should be aggregated. We advise to aggregate the data by peptide sequence (thus aggregate over different charge states and modification statuses of the same peptide). If you only want to aggregate over charge states, set |
aggr_function |
Only used when |
logtransform |
A logical value indicating whether the intensities should be log-transformed. Defaults to |
base |
A positive or complex number: the base with respect to which logarithms are computed. Defaults to 2. |
normalisation |
A character vector of length one that describes how to normalise the data frame |
smallestUniqueGroups |
A logical indicating whether proteins groups for which any of its member proteins is present in a smaller protein group should be removed from the dataset. Defaults to |
useful_properties |
The columns of the data frame |
filter |
A vector of names corresponding to the columns in the data frame |
filter_symbol |
Only used when |
minIdentified |
A numeric value indicating the minimal number of times a peptide sequence should be identified in the dataset in order not to be removed. Defaults to 2. |
colClasses |
character. Only used when the |
printProgress |
A logical indicating whether the R should print a message before performing each preprocessing step. Defaults to |
shiny |
A logical indicating whether this function is being used by a Shiny app. Setting this to |
message |
Only used when |
... |
Extra parameters to be passed to the normalisation functions. |
A preprocessed MSnSet
object that is ready to be converted into a protdata
object.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.