dataProcess: Process MS data: clean, normalize and summarize before...

View source: R/dataProcess.R

dataProcessR Documentation

Process MS data: clean, normalize and summarize before differential analysis

Description

Process MS data: clean, normalize and summarize before differential analysis

Usage

dataProcess(
  raw,
  logTrans = 2,
  normalization = "equalizeMedians",
  nameStandards = NULL,
  featureSubset = "all",
  remove_uninformative_feature_outlier = FALSE,
  min_feature_count = 2,
  n_top_feature = 3,
  summaryMethod = "TMP",
  equalFeatureVar = TRUE,
  censoredInt = "NA",
  MBimpute = TRUE,
  remove50missing = FALSE,
  fix_missing = NULL,
  maxQuantileforCensored = 0.999,
  use_log_file = TRUE,
  append = FALSE,
  verbose = TRUE,
  log_file_path = NULL,
  numberOfCores = 1
)

Arguments

raw

name of the raw (input) data set.

logTrans

base of logarithm transformation: 2 (default) or 10.

normalization

normalization to remove systematic bias between MS runs. There are three different normalizations supported: 'equalizeMedians' (default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals 'globalStandards' represents normalization with global standards proteins. If FALSE, no normalization is performed.

nameStandards

optional vector of global standard peptide names. Required only for normalization with global standard peptides.

featureSubset

"all" (default) uses all features that the data set has. "top3" uses top 3 features which have highest average of log-intensity across runs. "topN" uses top N features which has highest average of log-intensity across runs. It needs the input for n_top_feature option. "highQuality" flags uninformative feature and outliers.

remove_uninformative_feature_outlier

optional. Only required if featureSubset = "highQuality". TRUE allows to remove 1) noisy features (flagged in the column feature_quality with "Uninformative"), 2) outliers (flagged in the column, is_outlier with TRUE, before run-level summarization. FALSE (default) uses all features and intensities for run-level summarization.

min_feature_count

optional. Only required if featureSubset = "highQuality". Defines a minimum number of informative features a protein needs to be considered in the feature selection algorithm.

n_top_feature

optional. Only required if featureSubset = 'topN'. It that case, it specifies number of top features that will be used. Default is 3, which means to use top 3 features.

summaryMethod

"TMP" (default) means Tukey's median polish, which is robust estimation method. "linear" uses linear mixed model.

equalFeatureVar

only for summaryMethod = "linear". default is TRUE. Logical variable for whether the model should account for heterogeneous variation among intensities from different features. Default is TRUE, which assume equal variance among intensities from features. FALSE means that we cannot assume equal variance among intensities from features, then we will account for heterogeneous variation from different features.

censoredInt

Missing values are censored or at random. 'NA' (default) assumes that all 'NA's in 'Intensity' column are censored. '0' uses zero intensities as censored intensity. In this case, NA intensities are missing at random. The output from Skyline should use '0'. Null assumes that all NA intensites are randomly missing.

MBimpute

only for summaryMethod = "TMP" and censoredInt = 'NA' or '0'. TRUE (default) imputes 'NA' or '0' (depending on censoredInt option) by Accelated failure model. FALSE uses the values assigned by cutoffCensored.

remove50missing

only for summaryMethod = "TMP". TRUE removes the proteins where every run has at least 50% missing values for each peptide. FALSE is default.

fix_missing

Optional, same as the 'fix_missing' parameter in MSstatsConvert::MSstatsBalancedDesign function

maxQuantileforCensored

Maximum quantile for deciding censored missing values, default is 0.999

use_log_file

logical. If TRUE, information about data processing will be saved to a file.

append

logical. If TRUE, information about data processing will be added to an existing log file.

verbose

logical. If TRUE, information about data processing wil be printed to the console.

log_file_path

character. Path to a file to which information about data processing will be saved. If not provided, such a file will be created automatically. If 'append = TRUE', has to be a valid path to a file.

numberOfCores

Number of cores for parallel processing. When > 1, a logfile named 'MSstats_dataProcess_log_progress.log' is created to track progress. Only works for Linux & Mac OS. Default is 1.

Value

A list containing:

FeatureLevelData

A data frame with feature-level information after processing. Columns include:

PROTEIN

Identifier for the protein associated with the feature.

PEPTIDE

Identifier for the peptide sequence.

TRANSITION

Identifier for the transition, typically representing a specific ion pair.

FEATURE

Unique identifier for the feature, which could be a combination of peptide and transition.

LABEL

Specifies the isotopic labeling of peptides, notably for SRM-based experiments. "L" indicates light-labeled peptides while "H" denotes heavy-labeled peptides.

GROUP

Experimental group identifier.

RUN

Identifier for the specific MS run.

SUBJECT

Subject identifier within the experimental group.

FRACTION

Fraction identifier if fractionation was performed.

originalRUN

Original run identifier before any processing.

censored

Logical indicator of whether the intensity value is considered missing or below limit of detection.

INTENSITY

Original intensity measurement of the feature in the given run.

ABUNDANCE

Processed abundance or intensity value after log-transformation and normalization.

newABUNDANCE

The ABUNDANCE column but includes imputed missing values. It is the column that is used for protein summarization.

predicted

Predicted intensity values for censored data, typically derived from a statistical model.

ProteinLevelData

A data frame with run-level summarized information for each protein. Columns include:

RUN

Identifier for the specific MS run.

Protein

Identifier for the protein.

LogIntensities

Log-transformed intensities for the protein in each run.

originalRUN

Original run identifier before any processing.

GROUP

Experimental group identifier.

SUBJECT

Subject identifier within the experimental group.

TotalGroupMeasurements

Total number of feature measurements for the protein in the given group.

NumMeasuredFeatures

Number of features measured for the protein in the given run.

MissingPercentage

Percentage of missing feature values for the protein in the given run.

more50missing

Logical indicator of whether more than 50 percent of the features values are missing for the protein in the given run.

NumImputedFeature

Number of features for which values were imputed due to missing or censored data for the protein in the given run.

Examples

# Consider a raw data (i.e. SRMRawData) for a label-based SRM experiment from a yeast study
# with ten time points (T1-T10) of interests and three biological replicates.
# It is a time course experiment. The goal is to detect protein abundance changes
# across time points.
head(SRMRawData)
# Log2 transformation and normalization are applied (default)
QuantData<-dataProcess(SRMRawData, use_log_file = FALSE)
head(QuantData$FeatureLevelData)
# Log10 transformation and normalization are applied
QuantData1<-dataProcess(SRMRawData, logTrans=10, use_log_file = FALSE)
head(QuantData1$FeatureLevelData)
# Log2 transformation and no normalization are applied
QuantData2<-dataProcess(SRMRawData,normalization=FALSE, use_log_file = FALSE)
head(QuantData2$FeatureLevelData)


Vitek-Lab/MSstats documentation built on Nov. 29, 2024, 8:38 a.m.