In computational-metabolomics/sbcms: Signal batch correction for (LC)-MS data

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

Introduction

R package sbcms (Signal & Batch Correction for Mass Spectrometry) is an implementation of Quality Control-Robust Spline Correction (QC-RSC) [@kirwan2013] algorithm for signal drift and batch effect correction within/across a multi-batch direct infusion mass spectrometry (DIMS) and liquid chromatography mass spectrometry (LCMS) datasets.

Installation

You should have R version 3.4.1 or above and Rstudio installed to be able to run this notebook.

Execute following commands from the R terminal.

install.packages("ggplot2")
install.packages("gridExtra")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("sbcms")

Load the required libraries into the R environment

library(sbcms)
library(ggplot2)
library(reshape2)
library(gridExtra)
library(pmp)

Data set

We will be using a small subset of 20 features of an DIMS data set consisting of 172 samples measured across 8 batches. More detailed description of the data set is available from @kirwan2014 and MTBLS79.

# DIMS data set is included in 'sbcms' package.
# (MTBLS79) [https://www.ebi.ac.uk/metabolights/MTBLS79]

data <- sbcdata$data
class <- sbcdata$class
batch <- sbcdata$batch
sample_order <- c(1:nrow(data))

# Input data frame or matrix should have features in rows and samples in columns
data <- t(data)

# Input data structure
data[1:5, 1:10]

class[1:10]
batch[1:10]
sample_order[1:10]

Applying signal and batch correction

Function QCRSC should be used to apply signal batch correction. Input peak table should be in format where features are in rows and samples in columns.

Argument 'order' should be numeric vector containing sample injection order during analytical measurement and should be the same length as number of columns in peak table.

Argument `batch' should be numeric vector containing values of sample batch number. If all samples were measured in 1 batch, then all values in the 'batch' vector should be '1'.

Values for 'classes' should be character vector containing sample class labels. Class label for quality control sample has to be "QC".

corrected_data <- sbcms::QCRSC(df=data, order=sample_order, batch=batch, 
    classes=class, spar=0, minQC=4)

Visualising results

Function 'sbcmsPlot' is providing visual comparison of data before and after correction. Fore example we can check output for features '1', '5', and '30' in peak matrix.

plots <- sbcmsPlot (df=data, corrected_df=corrected_data, classes=class, 
    batch=batch, output=NULL, indexes=c(1, 5, 30))
plots

The scores plots of principal components analysis (PCA) before and after correction can be used to asses effects of data correction.

In this example, PQN method used to normalise data, KNN used for missing value imputation and glog for data scaling.

A more detailed overview on data pre-processing is detailed in @guida2016.

manual_color = c("#386cb0", "#ef3b2c", "#7fc97f", "#fdb462", "#984ea3", 
    "#a6cee3", "#778899", "#fb9a99", "#ffff33")

pca_data <- pmp::pqn_normalisation(data, classes=class, qc_label="QC")[[1]]
pca_data <- pmp::mv_imputation(pca_data, method="KNN", k=5, rowmax=0.5,
    colmax=0.5, maxp=NULL, check_df=FALSE)
pca_data <- pmp::glog_transformation(pca_data, classes=class, qc_label="QC")

pca_corrected_data <- pmp::pqn_normalisation(corrected_data, classes=class,
    qc_label="QC")[[1]]
pca_corrected_data <- pmp::mv_imputation(pca_corrected_data, method="KNN", k=5,
    rowmax=0.5, colmax=0.5, maxp=NULL, check_df=FALSE)
pca_corrected_data <- pmp::glog_transformation(pca_corrected_data, 
    classes=class, qc_label="QC")

pca_data <- prcomp(t(pca_data), center=TRUE, scale=FALSE)
pca_corrected_data <- prcomp(t(pca_corrected_data), center=TRUE, scale=FALSE)

# Calculate percentage of explained variance of the first two PC's
exp_var_pca <- round(((pca_data$sdev^2)/sum(pca_data$sdev^2)*100)[1:2],2)
exp_var_pca_corrected <- round(((pca_corrected_data$sdev^2) /
    sum(pca_corrected_data$sdev^2)*100)[1:2],2)

plots <- list()

plotdata <- data.frame(PC1=pca_data$x[, 1], PC2=pca_data$x[, 2], 
    batch=as.factor(batch), class=class)

plots[[1]] <- ggplot(data=plotdata, aes(x=PC1, y=PC2, col=batch)) +
    geom_point(size=2) + theme(panel.background=element_blank()) +
    scale_color_manual(values=manual_color) +
    ggtitle("PCA scores, before correction") +
    xlab(paste0("PC1 (", exp_var_pca[1] ," %)")) +
    ylab(paste0("PC2 (", exp_var_pca[2] ," %)"))

plots[[2]] <- ggplot(data=plotdata, aes(x=PC1, y=PC2, col=class)) +
    geom_point(size=2) + theme(panel.background=element_blank()) +
    scale_color_manual(values=manual_color) +
    ggtitle("PCA scores, before correction") +
    xlab(paste0("PC1 (", exp_var_pca[1] ," %)")) +
    ylab(paste0("PC2 (", exp_var_pca[2] ," %)"))

plotdata_corr <- data.frame(PC1=pca_corrected_data$x[, 1], 
    PC2=pca_corrected_data$x[, 2], batch=as.factor(batch), class=class)

plots[[3]] <- ggplot(data=plotdata_corr, aes(x=PC1, y=PC2, col=batch)) +
    geom_point(size=2) +
    theme(panel.background=element_blank()) +
    scale_color_manual(values=manual_color) +
    ggtitle("PCA scores, after correction") +
    xlab(paste0("PC1 (", exp_var_pca_corrected[1] ," %)")) +
    ylab(paste0("PC2 (", exp_var_pca_corrected[2] ," %)"))

plots[[4]] <- ggplot(data=plotdata_corr, aes(x=PC1, y=PC2, col=class)) +
    geom_point(size=2) +
    theme(panel.background=element_blank()) +
    scale_color_manual(values=manual_color) +
    ggtitle("PCA scores, after correction") +
    xlab(paste0("PC1 (", exp_var_pca_corrected[1] ," %)")) +
    ylab(paste0("PC2 (", exp_var_pca_corrected[2] ," %)"))

grid.arrange(ncol=2, plots[[1]], plots[[2]], plots[[3]], plots[[4]])