mbecCorrection: Batch Effect Correction Wrapper

Description Usage Arguments Details Value Examples

View source: R/mbecs_corrections.R

Description

Either corrects or accounts for (known) batch effects with one of several algorithms.

Usage

1
2
3
4
5
6
7
8
mbecCorrection(
  input.obj,
  model.vars = c("batch", "group"),
  method = c("lm", "lmm", "sva", "ruv2", "ruv4", "ruv3", "bmc", "bat", "rbe", "pn",
    "svd"),
  type = "clr",
  nc.features = NULL
)

Arguments

input.obj

An MbecData object with 'tss' and 'clr' matrices.

model.vars

Vector of covariate names. First element relates to batch.

method

Denotes the algorithms to use. One of 'lm, lmm, sva, ruv2, ruv4' for assessment methods or one of 'ruv3, bmc, bat, rbe, pn, svd' for correction algorithms.

type

Which abundance matrix to use, one of 'otu, tss, clr'. DEFAULT is 'clr' but percentile normalization is supposed to work on tss-abundances.

nc.features

(OPTIONAL) A vector of features names to be used as negative controls in RUV-2/3/4. If not supplied, the algorithm will use a linear model to find pseudo-negative controls

Details

ASSESSMENT METHODS The assessment methods 'lm, lmm, sva, ruv-2 and ruv-4" estimate the significance of the batch effect and update the attribute 'assessments' with vectors of p-values.

Linear (Mixed) Models: A simple linear mixed model with covariates 'treatment' and 'batch', or respective variables in your particular data-set, will be fitted to each feature and the significance for the treatment variable extracted.

Surrogate variable Analysis (SVA): Surrogate Variable Analysis (SVA): Two step approach that (1.) identify the number of latent factors to be estimated by fitting a full-model with effect of interest and a null-model with no effects. The function num.sv then calculates the number of latent factors. In the next (2.) step, the sva function will estimate the surrogate variables. And adjust for them in full/null-model . Subsequent F-test gives significance values for each feature - these P-values and Q-values are accounting for surrogate variables (estimated BEs).

Remove unwanted Variation 2 (RUV-2): Estimates unknown BEs by using negative control variables that, in principle, are unaffected by treatment/biological effect, i.e., aka the effect of interest in an experiment. These variables are generally determined prior to the experiment. An approach to RUV-2 without the presence of negative control variables is the estimation of pseudo-negative controls. To that end, an lm or lmm (depending on whether or not the study design is balanced) with treatment is fitted to each feature and the significance calculated. The features that are not significantly affected by treatment are considered as pseudo-negative control variables. Subsequently, the actual RUV-2 function is applied to the data and returns the p-values for treatment, considering unwanted BEs (whatever that means).

Remove Unwanted Variation 4 (RUV-4): The updated version of RUV-2 also incorporates the residual matrix (w/o treatment effect) to estimate the unknown BEs. To that end it follows the same procedure in case there are no negative control variables and computes pseudo-controls from the data via l(m)m. As RUV-2, this algorithm also uses the parameter 'k' for the number of latent factors. RUV-4 brings the function 'getK()' that estimates this factor from the data itself. The calculated values are however not always reliable. A value of k=0 fo example can occur and should be set to 1 instead. The output is the same as with RUV-2.

CORRECTION METHODS The correction methods 'ruv3, bmc, bat, rbe, pn, svd' attempt to mitigate the batch effect and update the attribute 'corrections' with the resulting abundance matrices of corrected counts.

Remove Unwanted Variation 3 (RUV-3): This algorithm requires negative control-features, i.e., OTUs that are known to be unaffected by the batch effect, as well as technical replicates. The algorithm will check for the existence of a replicate column in the covariate data. If the column is not present, the execution stops and a warning message will be displayed.

Batch Mean Centering (BMC): For known BEs, this method takes the batches, i.e., subgroup of samples within a particular batch, and centers them to their mean.

Combat Batch Effects (ComBat): This method uses an non-/parametric empirical Bayes framework to correct for BEs. Described by Johnson et al. 2007 this method was initially conceived to work with gene expression data and is part of the sva-package in R.

Remove Batch Effects (RBE): As part of the limma-package this method was designed to remove BEs from Microarray Data. The algorithm fits the full- model to the data, i.e., all relevant covariates whose effect should not be removed, and a model that only contains the known BEs. The difference between these models produces a residual matrix that (should) contain only the full- model-effect, e.g., treatment. As of now the mbecs-correction only uses the first input for batch-effect grouping. ToDo: think about implementing a version for more complex models.

Percentile Normalization (PN): This method was actually developed specifically to facilitate the integration of microbiome data from different studies/experimental set-ups. This problem is similar to the mitigation of BEs, i.e., when collectively analyzing two or more data-sets, every study is effectively a batch on its own (not withstanding the probable BEs within studies). The algorithm iterates over the unique batches and converts the relative abundance of control samples into their percentiles. The relative abundance of case-samples within the respective batches is then transformed into percentiles of the associated control-distribution. Basically, the procedure assumes that the control-group is unaffected by any effect of interest, e.g., treatment or sickness, but both groups within a batch are affected by that BE. The switch to percentiles (kinda) flattens the effective difference in count values due to batch - as compared to the other batches. This also introduces the two limiting aspects in percentile normalization. It can only be applied to case/control designs because it requires a reference group. In addition, the transformation into percentiles removes information from the data.

Singular Value Decomposition (SVD): Basically perform matrix factorization and compute singular eigenvectors (SEV). Assume that the first SEV captures the batch-effect and remove this effect from the data. The interesting thing is that this works pretty well (with the test-data anyway) But since the SEVs are latent factors that are (most likely) confounded with other effects it is not obvious to me that this is the optimal approach to solve this issue.

The input for this function is supposed to be an MbecData object that contains total sum-scaled and cumulative log-ratio transformed abundance matrices. Output will be as input, but assessments or corrections-lists will contain the result of the respective chosen method.

Value

An updated object of class MbecData.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# This call will use 'ComBat' for batch effect correction on CLR-transformed
# abundances and store the new counts in the 'corrections' attribute.
study.obj <- mbecCorrection(input.obj=dummy.mbec,
model.vars=c("batch","group"), method="bat", type="clr")

# This call will use 'Percentile Normalization' for batch effect correction
# on TSS-transformed counts and store the new counts in the 'corrections'
# attribute.
study.obj <- mbecCorrection(dummy.mbec, model.vars=c("batch","group"),
method="pn", type="tss")

buschlab/MBECS documentation built on Jan. 21, 2022, 1:27 a.m.