proBatch: proBatch: A package for diagnostics and correction of batch...

proBatchR Documentation

proBatch: A package for diagnostics and correction of batch effects, primarily in proteomics

Description

The proBatch package contains functions for analyzing and correcting batch effects (unwanted technical variation) from high-thoughput experiments. Although the package has primarily been developed for mass spectrometry proteomics (DIA/SWATH), it has been designed be applicable to most omic data with minor adaptations. It addresses the following needs:

  • prepare the data for analysis

  • Visualize batch effects in sample-wide and feature-level;

  • Normalize and correct for batch effects.

Arguments

df_long

data frame where each row is a single feature in a single sample. It minimally has a sample_id_col, a feature_id_col and a measure_col, but usually also an m_score (in OpenSWATH output result file). See help("example_proteome") for more details.

data_matrix

features (in rows) vs samples (in columns) matrix, with feature IDs in rownames and file/sample names as colnames. See "example_proteome_matrix" for more details (to call the description, use help("example_proteome_matrix"))

sample_annotation

data frame with:

  1. sample_id_col (this can be repeated as row names)

  2. biological covariates

  3. technical covariates (batches etc)

. See help("example_sample_annotation")

sample_id_col

name of the column in sample_annotation table, where the filenames (colnames of the data_matrix are found).

measure_col

if df_long is among the parameters, it is the column with expression/abundance/intensity; otherwise, it is used internally for consistency.

feature_id_col

name of the column with feature/gene/peptide/protein ID used in the long format representation df_long. In the wide formatted representation data_matrix this corresponds to the row names.

batch_col

column in sample_annotation that should be used for batch comparison (or other, non-batch factor to be mapped to color in plots).

order_col

column in sample_annotation that determines sample order. It is used for in initial assessment plots (plot_sample_mean_or_boxplot) and feature-level diagnostics (feature_level_diagnostics). Can be 'NULL' if sample order is irrelevant (e.g. in genomic experiments). For more details, order definition/inference, see define_sample_order and date_to_sample_order

facet_col

column in sample_annotation with a batch factor to separate plots into facets; usually 2nd to batch_col. Most meaningful for multi-instrument MS experiments (where each instrument has its own order-associated effects (see order_col) or simultaneous examination of two batch factors (e.g. preparation day and measurement day). For single-instrument case should be set to 'NULL'

color_by_batch

(logical) whether to color points and connecting lines by batch factor as defined by batch_col.

peptide_annotation

long format data frame with peptide ID and their corresponding protein and/or gene annotations. See help("example_peptide_annotation").

color_scheme

a named vector of colors to map to batch_col, names corresponding to the levels of the factor. For continuous variables, vector doesn't need to be named.

color_list

list, as returned by sample_annotation_to_colors, where each item contains a color vector for each factor to be mapped to the color.

factors_to_plot

vector of technical and biological covariates to be plotted in this diagnostic plot (assumed to be present in sample_annotation)

protein_col

column where protein names are specified

no_fit_imputed

(logical) whether to use imputed (requant) values, as flagged in qual_col by qual_value for data transformation

qual_col

column to color point by certain value denoted by qual_value. Design with inferred/requant values in OpenSWATH output data, which means argument value has to be set to m_score.

qual_value

value in qual_col to color. For OpenSWATH data, this argument value has to be set to 2 (this is an m_score value for imputed values (requant values).

plot_title

title of the plot (e.g., processing step + representation level (fragments, transitions, proteins) + purpose (meanplot/corrplot etc))

keep_all

when transforming the data (normalize, correct) - acceptable values: all/default/minimal (which set of columns be kept).

theme

ggplot theme, by default classic. Can be easily overriden

filename

path where the results are saved. If null the object is returned to the active window; otherwise, the object is save into the file. Currently only pdf and png format is supported

width

option determining the output image width

height

option determining the output image width

units

units: 'cm', 'in' or 'mm'

Details

To learn more about proBatch, start with the vignettes: browseVignettes(package = "proBatch")

Section

Common arguments to the functions.


symbioticMe/proBatch documentation built on April 9, 2023, 11:59 a.m.