In stanfill/QC-ART: Real time quality control of mass spectrometry data

Introduction

This document shows by example how to use the package QCART to track the quality of mass spectrometry experiments in near real time. In this illustration we use publicly available data that was analyzed by Amidan et al. (2014).

Data

The Amidan et al. data is supplied with the QCART package. It can be loaded into your environment with the following.

library(QCART)
data("amidan_et_al")

library(lubridate)
amidan$Acq_Time_Start <- mdy_hm(amidan$Acq_Time_Start)
chosen_inst <- "VOrbiETD04"

Now there should be a data frame called amidan loaded into your environment. In short, the amdian data frame consists of r ncol(amidan) variables measured on r nrow(amidan) LC-MS experiments. The data were collected using r length(unique(amidan$Instument)) different instruments from r round_date(min(amidan$Acq_Time_Start),unit='day') to r round_date(max(amidan$Acq_Time_Start),unit='day').

QC-ART

The goal of QC-ART is to identify changes in the quality of data produced by MS instruments. We therefore subset the amidan data set to a single instrument before analysis using the dplyr::filter function. For this example we consider the instrument called r chosen_inst, which consists of r length(which(amidan$Instrument==chosen_inst)) instrument runs. In the final line of the following, the first six instrument runs from the analysis are removed because they occur well before the rest of the data are collected.

library(dplyr)
chosen_inst <- "VOrbiETD04"
vorb_04 <- filter(amidan,Instrument==chosen_inst)
vorb_04 <- vorb_04[-c(1:6),]

To get a feel for the r chosen_inst instrument data, two of the variables thought to be associated with data quality ("P_2C" and "MS1_Count") are plotted over time. The packages ggplot2 and lubridate are used to make the figures below. Note that lubridate is suggested by the QCART package but ggplot2 is not.

library(lubridate)
amidan$Acq_Time_Start <- mdy_hm(amidan$Acq_Time_Start)
library(ggplot2)
qplot(Acq_Time_Start,P_2C,data=vorb_04,xlab="Date")
qplot(Acq_Time_Start,MS1_Count,data=vorb_04,xlab="Date")

Computing QC-ART Scores

To apply QC-ART, the variables used to compute QC-ART scores need to be chosen. Based on the results of the initial analysis of these data (by Amidan et al. (2014)), the following variables were chosen.

chosen_vars <- c("P_2C","MS1_Count","MS2_Count","MS1_2B","MS2_1","MS2_2","MS2_3","MS2_4A","MS2_4B","RT_MS_Q1","RT_MS_Q4","RT_MSMS_Q1","RT_MSMS_Q4","XIC_WideFrac")

Next a baseline set against which all future experiments will be compared must be chosen. Assume for now that the experiments run early in the study are of good quality and we want to determine if the later runs are of the same quality. Thus the first 50 instrument runs serve as the baseline and all further experiments are compared against those 50. With that, the QC-ART scores are compueted using the qcart function them plotted over time.

vorb_04 <- arrange(vorb_04,Acq_Time_Start)
bline_ob <- 1:50
vorb_04$Baseline <- FALSE
vorb_04$Baseline[bline_ob] <- TRUE
vorb_04$QC_Art <- qcart(all_data = vorb_04, baseline = bline_ob, variables = chosen_vars)
qplot(Acq_Time_Start,QC_Art,data=vorb_04,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")

Canned Threshold Routines

Once scores are computed, threshold values need to be determined. Because this is a post-hoc analysis, a QC-ART score that is above the computed threshold should be investigated to see if the sample should be rerun. The goal of QC-ART, however, is for the scores and baselines to be computed in real-time. That way, changes in data quality could suggest an instrumentation issue. The compute_threshold function takes the QC-ART scores (computed by qcart), baseline indicators and time-stamps then returns a data.frame that contains the scores, time-stamps and threshold values.

s_threshold <- compute_threshold(scores = vorb_04$QC_Art, baseline = vorb_04$Baseline, time_stamps = vorb_04$Acq_Time_Start, type='static')
sthresh_df <- s_threshold$Results

qplot(Time_Stamp,Scores,data=sthresh_df,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")+
  geom_line(aes(Time_Stamp,Static_Threshold),colour=1)

The column Static_Threshold contains the static threshold value corresponding to each QC-ART score. Internally, the compute_threshold function fits two models to the QC-ART scores: a simple log-linear regression model (SLR) where the time since the first instrument run is used as the only covariate, and a single mean model. If the SLR model accounts for a statistically significant amount of the variability in the baseline scores (at the 0.05 level) relative to the single mean model, then threshold values from the SLR model are returned. Otherwise, the single mean model is used. The former will return threshold values that differ for each QC-ART score depending upon the time since the first instrument run, while the latter will return the same threshold value for all QC-ART scores.

To compute a dynamic threhsold, change the type option in the compute_threshold function to "dynamic". Below the dynamic threshold is compuated then plotted.

d_threshold <- compute_threshold(scores = vorb_04$QC_Art, baseline = vorb_04$Baseline, time_stamps = vorb_04$Acq_Time_Start, type = 'dynamic')
dthresh_df <- d_threshold$Results

qplot(Time_Stamp,Scores,data=dthresh_df,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")+
  geom_line(aes(Time_Stamp,Dynamic_Threshold),colour=1)

Creating a Custom Threshold

The compute_threshold function selects an appropriate static or dynamic model from a small set of possible models. It is quite possible that a more appropriate threshold model could be created for your data. This section covers what should be considered when fitting a dynamic threshold model and how to assess it's performance. For a more detailed explanation of dynamic linear models, see Petris et al. (2009).

The dynamic linear models fit by compute_threshold are described by the pair of equations

$$ \log(s_t)=\mu_t+v_t \;\;\text{ and }\;\; \mu_t=\mu_{t-1}+w_t $$ where for $i\geq 1$, $s_t$ is the QC-ART score at time $t$, $\mu_t$ is the true log-scale process mean, $v_t$ is an error term in the observation equation and $w_t$ is the error term in the process model. The error terms $v_t$ and $w_t$ are assumed to be independent and follow the log-normal distribution.

Currently compute_threshold considers the model above with $\mu_t=\beta_{0,t}+\beta_{1,t}x_t$ where $\beta_{\cdot,t}$ are time-varying regression parameters adn $x_t$ is the clock time since the cohort study began. Further it's assumed that $v_t\sim N(0,\sigma^2)$ and $w_{t}\sim N_2(0_2,W_t)$ where $\sigma$ is an unknown variance parameter to be estimated from the baseline data and $N_2(0, W_t)$ is the bivariate normal distribution with mean a $2\by 1$ mean vector of zeros and $2\times2$ variance-covariance matrix $W_t$.

The dynamic threshold is then the $95^{th}$ percentile of the so called filtering density of $\mu_t$. The following code fits this model and computes the dynamic threshold:

library(dlm)

#Convert time stamps into time since cohort study began
vorb_04 <- vorb_04%>%arrange(Acq_Time_Start)%>%
  mutate(Time=difftime(Acq_Time_Start,min(Acq_Time_Start),units='hours'))

buildCutpt <- function(theta,xvar){
  dlmModReg(X=xvar, dV=exp(theta[1]), dW=exp(theta[2:3]),addInt=TRUE) 
}

#Estimate parameters based on baseline data only
fit <- dlmMLE(log(vorb_04$QC_Art),parm=rep(0,3),buildCutpt, xvar=matrix(vorb_04$Time,ncol=1))

#Now apply the model to the full dataset
modCutpt <- buildCutpt(fit$par, xvar=vorb_04$Time)

#Filterd estiamte for full data set
filtQC <- dlmFilter(log(vorb_04$QC_Art),modCutpt)
filtPred <- filtQC$f
filtSD <- residuals(filtQC,sd=TRUE)$sd

Dynamic_Threshold <- qlnorm(0.95,meanlog=filtPred,sdlog=filtSD)

This model can be varied in several ways, depending on if the model assumptions are satisfied. In particular, are the variance compoenents roughly i.i.d. and normally distributed? The following code can be used to assess the normality of the residuals; perfect agreement implies a perfect fit, but that doesn't happen with real data.

resids <- residuals(filtQC,sd=FALSE)
qplot(qnorm(ppoints(length(resids))),sort(resids))+geom_abline(aes(intercept=0,slope=1))+
  xlab("Theoretical Quantiles")+ylab("Sample Quantiles")

The function tsdiag will produce plots that can be used to visually assess if there is any more temporal trend in the residuals and for temporal independence.

tsdiag(filtQC,gof.lag = 25)

The fact that the lag 1 and lag 3 rediduals still exhibit some dependence and the Ljung-Box p-values are vey close to zero indicates a more complex error variance is needed to capture the temporal correlation between these data. Try adding a second order moving average component (MA(2)) to the model fit above.

buildCutpt_MA2 <- function(theta,xvar){
    dlmModReg(X=xvar, dV=1e-7, dW=exp(theta[1:2]),addInt=TRUE)+
      dlmModARMA(ma=exp(theta[3:4]),sigma2=exp(theta[4]))
}

#Estimate parameters based on baseline data only
fit_MA2 <- dlmMLE(log(vorb_04$QC_Art),parm=rep(0,5),buildCutpt_MA2, xvar=matrix(vorb_04$Time,ncol=1))

#Now apply the model to the full dataset
modCutpt_MA2 <- buildCutpt_MA2(fit_MA2$par, xvar=vorb_04$Time)

#Filterd estiamte for full data set
filtQC_MA2 <- dlmFilter(log(vorb_04$QC_Art),modCutpt_MA2)

tsdiag(filtQC_MA2,gof.lag = 25)

Now the lagged residual ACV values are very near 0 and Ljung-Box p-values are very near 1 indiciateing a good fit to the data. We can now update the dynamic thresholds with this new model.

filtPred_MA2 <- filtQC_MA2$f
filtSD_MA2 <- residuals(filtQC_MA2,sd=TRUE)$sd

dthresh_df$NewThreshold <- qlnorm(0.95,meanlog=filtPred_MA2,sdlog=filtSD_MA2)

qplot(Time_Stamp,Scores,data=dthresh_df,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")+
  geom_line(aes(Time_Stamp,NewThreshold),colour=1)+ylim(c(0,250))