Explore batch effect, exploBATCH, is an R package for evaluating and correcting for batch effect in genomic studies. It contains two main functions, findBATCH
, for performing formal statistical tests to detect batch effect, and correctBATCH
, for removing batch effect in the data. Although here we provide a quick tour of exploBATCH, we recommend reading Nyamundanda $et$ $al$ (2017) for more details on exploBATCH.
exploBATCH is based on the probabilistic version of principal component analysis which allows for covariates (PPCCA). The findBATCH
function (within exploBATCH) employs PPCCA to statistically test if samples are distributed according to batches in the principal subspace by applying the following test statistic on each probabilistic principal component (pPC):
$$ Δ_{bk} = \frac{\beta_{bk}}{\mbox{SE}(\beta_{bk})}, $$
where $\beta_{bk}$ is the regression coefficient that quantifies batch effect $b$ on the $k^{\mbox{th}}$ pPC, and SE is the corresponding standard error. exploBATCH correct for batch effect by calling correctBATCH
function which subtracts batch effect on each affected pPC as follows,
$$ \underline{u}{ck} = \underline{u}{ak} – \underline{x}b \beta{bk} $$
where $\underline{u}{ak}$ and $\underline{u}{ck}$ is a vector of scores of the $k^{th}$ pPC affected and corrected for batch effect, respectively, whilst $\underline{x}_b$ is the variable defining batches. The batch effect corrected data is recovered by predicting the observed data using PPCCA conditioning on the scores, u = (uc, uu), where uc and uu are scores of corrected and uncorrected pPCs, respectively.
exploBATCH is freely available on github (here) and it is implemented within the R environment. It requires:
http://www.r-
project.org/
.exploBATCH can be installed straight from our github account, syspremed, using devtools R package. Here are two lines of R code to install exploBATCH from syspremed on github.
require(devtools) install_github("syspremed/exploBATCH")
In order to run the two examples in Nyamundanda $et$ $al$ (2017) install the two R packages in syspremed with the two gene expression datasets, Breast
and Colon
.
install_github("syspremed/exploBATCHbreast") install_github("syspremed/exploBATCHcolon")
The exploBATCH package can be run using a single interface expBATCH
which calls findBATCH
and correctBATCH
to detect and correct for batch effect, respectively.
expBATCH(D, batchCL, Conf, mindim, maxdim, method, scale, SDselect)
D
is a matrix of expression data with features on the column.
batchCL
is a vector of the same sample size as D
identifying which samples belong to which batch.
Conf
is a vector of the same sample size as D
identifying samples in different biological classes of interest.
mindim
is the minimum number of pPCs to be considered. It should be at least 2
.
maxdim
is the maximum number of pPCs to be considered. it should be at least equal to mindim
.
method
is the method for batch correction, either ppcca
or combat
.
scale
is the method for scaling the data, none
no scaling, unit
unit variance scaling and pareto
square root of standard deviation scaling.
SDselect
is the standard deviation (SD) for filtering the number of genes to improve computational time.
The results of exploBATCH are saved in the current working directory. A folder exploBATCHresults
with seven sub-folders of exploBATCH output. Here is the folder structure
of the output of exploBATCH,
exploBATCHresults: + pcaBeforeBatchCorrection: results of applying PCA on data before batch correction. + ppccaBeforeBatchCorrection: results of applying PPCCA on data before batch correction. + findBATCH: containts forest plots showing which pPC has batch effect. + correctBATCH: results of batch correction using correctBATCH. + assessCorrectBATCH: assessment of results of batch correction using correctBATCH. + correComBat: results of batch correction using ComBat. + assessComBat: assessment of results of batch correction using ComBat.
To demostrate the application of exploBATCH we are applying it on the gene expression data of three batches (30, 22 and 18) of primary human breast tumors. This is the first example in Nyamundanda $et$ $al$ (2017) but for illustration purpose here we used SDselect=2
to filter genes to 346 with SD>2 to improve computational time.
Firstly, to run this example you need to load the exploBATCHbreast R package which contains the breast cancer dataset Breast
and a vector identifying the three batches batchBreast
.
require(exploBATCH) require(exploBATCHbreast) # the package with breast cancer dataset data(Breast) # load the breast cancer data data(batchBreast) # load the variable defining batches
After loading the two packages you ready to run exploBATCH. You only need the function expBATCH
to run exploBATCH,
expBATCH( D=Breast, # the breast cancer expression data matrix batchCL=batchBreast, # the variable identifying the three batches Conf=NA, # no biological variable of interest in this example mindim=2, # the minimum number of pPCs maxdim=9, # the maximum number of pPCs, we don't want this argument to be large, # otherwise it will slow down exploBATCH. method="ppcca", # use correctBATCH as well as ComBat to correct batch effect SDselect=2 # set SD at 2 to reduce computational time. )
This folder contains results of PCA on the breast cancer dataset. PCA results allow for exploring batch effect by simply visually inspecting PCA plots.
The clustering of breast tumors in the principal subspace was mainly driven by the three batches. This folder also contains plot of proportion of variation explained by each PC and pairwise plots of the first ten PCs.
Firstly, exploBATCH has to determine the number of pPCs that can be used to represent this breast cancer dataset. The PPCCA model within exploBATCH employs the Bayesian information criterion (BIC) to identify the optimal number of pPCs. The higher the BIC value the better the model. The BIC plot below shows that the optimal number of pPCs is seven. All batch evaluation and correction for this dataset is based on these first seven pPCs.
This folder ppccaBeforeBatchCorrection
contains BIC plot and other pairwise plots of pPCs from fitting PPCCA on this breast cancer dataset.
The function findBATCH
employs Jack-knifing approach to quantify uncertainity associated with batch effect. This allows to assess statistical significance of batch effect on each pPC by building 95% confidence intervals (95% CIs) around this parameter of interest.
The forest plot shows the estimated batch effect and the corresponding 95% CIs for each of the seven pPCs. There is significant batch effect on pPC1 and pPC2 since the corresponding 95% CIs don't include zero. The folder findBATCH
contains this forest plot showing which pPCs are associated with batch effect.
After identifying which of the seven pPCs is associated with batch effect correctBATCH
is employed to subtract this batch effect in the data. The two figures below show PCA plots before and after batch correction.
The folder correctBATCH
contains breast cancer dataset which has been corrected for batch effect using correctBATCH
.
After correcting for batch effect the function findBATCH
is again employed as a formal statistical test to assess if batch effect has been properly corrected. Below are findBATCH
plots before and after batch correction.
The folder assessCorrectBATCH
contains results of assessing batch effect after applying correctBATCH
.
The exploBATCH package also allows the user to correct for batch effect using ComBat. Below is a comparison of batch correction using correctBATCH (left panel) and ComBat (right panel).
The folder correctComBat
contains breast cancer dataset which has been corrected for batch effect using ComBat.
After correcting for batch effect using ComBat findBATCH
can be used to assess if batch effect has been properly corrected. Below are findBATCH
plots after batch correction using correctBATCH (left panel) and ComBat (right panel).
The folder assessComBat
contains results of assessing batch effect after applying ComBat.
This example briefly demonstrate that exploBATCH corrects for batch effect without distorting important biological structures in the data. The colon cancer dataset is a gene expression data consisting of two batches of 52 and 58 samples. Most (78%) of these samples are colorectal cancer (CRC) samples whilst the rest are normal. The question is does batch correction using exploBATCH retains the normal/tumor biological effect in this dataset.
This is the second example in Nyamundanda $et$ $al$ (2017) but for illustration purpose here we used SDselect=1.2
to filter genes to 1549 with SD>1.2. To run this example you need to load the exploBATCHcolon R package, the colon dataset Colon
, a vector identifying the three batches batchBreast
and a variable defining the biological effect of interest bioCL
.
require(exploBATCH) require(exploBATCHcolon) data(Colon) data(batchColon) data(bioCL)
After loading the two packages you can run exploBATCH on the colon cancer dataset as follows
expBATCH( D=Colon, batchCL=batchColon, Conf=bioCL, # this is the variable defining the biological effect of interest mindim=2, maxdim=11, method="ppcca", SDselect=1.2 # set SD at 2 to reduce computational time. )
By inspecting the PCA plot on the top left pannel of the figure below, the samples cluster by batch denoted by the two colours. In addition, the CRC samples (dots) seem to be clustering away from the normal samples (squares). However, we can perform formal statistical tests using findBATCH
to confirm if there is batch and normal/tumor biological effect. The BIC plot on the top right pannel shows that the first nine pPCs are enough to represent the variability in the colon cancer dataset. The two forest plots on the bottom of the
pannel show that batch and normal/tumor biological effects are intertwined in the first two pPCs. The challenge is to correct this batch effect without blurring the normal/tumor biological effect.
The figures below show results of applying correctBATCH on the colon cancer dataset. The PCA plot highlight that the samples are no longer clustering by batch but the seperation between normal and tumor samples can still be seen. The forest plot from findBATCH confirms that none of the pPCs has batch effect but normal/tumor biological effect is retained in pPC 1 & 2.
Figure A below shows the runtime profiles and memory usage of exploBATCH applied to gene expression datasets with 70-breast cancer samples but varying the number of genes from 50 to 20 000. Figure B highlights the run time profiles of exploBATCH on several simulated data varying the number of samples and genes
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.