This document shows by example how to use the package QCART
to track the quality of mass spectrometry experiments in near real time. In this illustration we use publicly available data that was analyzed by Amidan et al. (2014).
The Amidan et al. data is supplied with the QCART
package. It can be loaded into your environment with the following.
library(QCART) data("amidan_et_al")
library(lubridate) amidan$Acq_Time_Start <- mdy_hm(amidan$Acq_Time_Start) chosen_inst <- "VOrbiETD04"
Now there should be a data frame called amidan
loaded into your environment. In short, the amdian
data frame consists of r ncol(amidan)
variables measured on r nrow(amidan)
LC-MS experiments. The data were collected using r length(unique(amidan$Instument))
different instruments from r round_date(min(amidan$Acq_Time_Start),unit='day')
to r round_date(max(amidan$Acq_Time_Start),unit='day')
.
The goal of QC-ART is to identify changes in the quality of data produced by MS instruments. We therefore subset the amidan
data set to a single instrument before analysis using the dplyr::filter
function. For this example we consider the instrument called r chosen_inst
, which consists of r length(which(amidan$Instrument==chosen_inst))
instrument runs. In the final line of the following, the first six instrument runs from the analysis are removed because they occur well before the rest of the data are collected.
library(dplyr) chosen_inst <- "VOrbiETD04" vorb_04 <- filter(amidan,Instrument==chosen_inst) vorb_04 <- vorb_04[-c(1:6),]
To get a feel for the r chosen_inst
instrument data, two of the variables thought to be associated with data quality ("P_2C" and "MS1_Count") are plotted over time. The packages ggplot2
and lubridate
are used to make the figures below. Note that lubridate
is suggested by the QCART
package but ggplot2
is not.
library(lubridate) amidan$Acq_Time_Start <- mdy_hm(amidan$Acq_Time_Start) library(ggplot2) qplot(Acq_Time_Start,P_2C,data=vorb_04,xlab="Date") qplot(Acq_Time_Start,MS1_Count,data=vorb_04,xlab="Date")
To apply QC-ART, the variables used to compute QC-ART scores need to be chosen. Based on the results of the initial analysis of these data (by Amidan et al. (2014)), the following variables were chosen.
chosen_vars <- c("P_2C","MS1_Count","MS2_Count","MS1_2B","MS2_1","MS2_2","MS2_3","MS2_4A","MS2_4B","RT_MS_Q1","RT_MS_Q4","RT_MSMS_Q1","RT_MSMS_Q4","XIC_WideFrac")
Next a baseline set against which all future experiments will be compared must be chosen. Assume for now that the experiments run early in the study are of good quality and we want to determine if the later runs are of the same quality. Thus the first 50 instrument runs serve as the baseline and all further experiments are compared against those 50. With that, the QC-ART scores are compueted using the qcart
function them plotted over time.
vorb_04 <- arrange(vorb_04,Acq_Time_Start) bline_ob <- 1:50 vorb_04$Baseline <- FALSE vorb_04$Baseline[bline_ob] <- TRUE vorb_04$QC_Art <- qcart(all_data = vorb_04, baseline = bline_ob, variables = chosen_vars) qplot(Acq_Time_Start,QC_Art,data=vorb_04,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")
Once scores are computed, threshold values need to be determined. Because this is a post-hoc analysis, a QC-ART score that is above the computed threshold should be investigated to see if the sample should be rerun. The goal of QC-ART, however, is for the scores and baselines to be computed in real-time. That way, changes in data quality could suggest an instrumentation issue. The compute_threshold
function takes the QC-ART scores (computed by qcart
), baseline indicators and time-stamps then returns a data.frame
that contains the scores, time-stamps and threshold values.
s_threshold <- compute_threshold(scores = vorb_04$QC_Art, baseline = vorb_04$Baseline, time_stamps = vorb_04$Acq_Time_Start, type='static') sthresh_df <- s_threshold$Results qplot(Time_Stamp,Scores,data=sthresh_df,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")+ geom_line(aes(Time_Stamp,Static_Threshold),colour=1)
The column Static_Threshold
contains the static threshold value corresponding to each QC-ART score. Internally, the compute_threshold
function fits two models to the QC-ART scores: a simple log-linear regression model (SLR) where the time since the first instrument run is used as the only covariate, and a single mean model. If the SLR model accounts for a statistically significant amount of the variability in the baseline scores (at the 0.05 level) relative to the single mean model, then threshold values from the SLR model are returned. Otherwise, the single mean model is used. The former will return threshold values that differ for each QC-ART score depending upon the time since the first instrument run, while the latter will return the same threshold value for all QC-ART scores.
To compute a dynamic threhsold, change the type
option in the compute_threshold
function to "dynamic"
. Below the dynamic threshold is compuated then plotted.
d_threshold <- compute_threshold(scores = vorb_04$QC_Art, baseline = vorb_04$Baseline, time_stamps = vorb_04$Acq_Time_Start, type = 'dynamic') dthresh_df <- d_threshold$Results qplot(Time_Stamp,Scores,data=dthresh_df,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")+ geom_line(aes(Time_Stamp,Dynamic_Threshold),colour=1)
The compute_threshold
function selects an appropriate static or dynamic model from a small set of possible models. It is quite possible that a more appropriate threshold model could be created for your data. This section covers what should be considered when fitting a dynamic threshold model and how to assess it's performance. For a more detailed explanation of dynamic linear models, see Petris et al. (2009).
The dynamic linear models fit by compute_threshold
are described by the pair of equations
$$ \log(s_t)=\mu_t+v_t \;\;\text{ and }\;\; \mu_t=\mu_{t-1}+w_t $$ where for $i\geq 1$, $s_t$ is the QC-ART score at time $t$, $\mu_t$ is the true log-scale process mean, $v_t$ is an error term in the observation equation and $w_t$ is the error term in the process model. The error terms $v_t$ and $w_t$ are assumed to be independent and follow the log-normal distribution.
Currently compute_threshold
considers the model above with $\mu_t=\beta_{0,t}+\beta_{1,t}x_t$ where $\beta_{\cdot,t}$ are time-varying regression parameters adn $x_t$ is the clock time since the cohort study began. Further it's assumed that $v_t\sim N(0,\sigma^2)$ and $w_{t}\sim N_2(0_2,W_t)$ where $\sigma$ is an unknown variance parameter to be estimated from the baseline data and $N_2(0, W_t)$ is the bivariate normal distribution with mean a $2\by 1$ mean vector of zeros and $2\times2$ variance-covariance matrix $W_t$.
The dynamic threshold is then the $95^{th}$ percentile of the so called filtering density of $\mu_t$. The following code fits this model and computes the dynamic threshold:
library(dlm) #Convert time stamps into time since cohort study began vorb_04 <- vorb_04%>%arrange(Acq_Time_Start)%>% mutate(Time=difftime(Acq_Time_Start,min(Acq_Time_Start),units='hours')) buildCutpt <- function(theta,xvar){ dlmModReg(X=xvar, dV=exp(theta[1]), dW=exp(theta[2:3]),addInt=TRUE) } #Estimate parameters based on baseline data only fit <- dlmMLE(log(vorb_04$QC_Art),parm=rep(0,3),buildCutpt, xvar=matrix(vorb_04$Time,ncol=1)) #Now apply the model to the full dataset modCutpt <- buildCutpt(fit$par, xvar=vorb_04$Time) #Filterd estiamte for full data set filtQC <- dlmFilter(log(vorb_04$QC_Art),modCutpt) filtPred <- filtQC$f filtSD <- residuals(filtQC,sd=TRUE)$sd Dynamic_Threshold <- qlnorm(0.95,meanlog=filtPred,sdlog=filtSD)
This model can be varied in several ways, depending on if the model assumptions are satisfied. In particular, are the variance compoenents roughly i.i.d. and normally distributed? The following code can be used to assess the normality of the residuals; perfect agreement implies a perfect fit, but that doesn't happen with real data.
resids <- residuals(filtQC,sd=FALSE) qplot(qnorm(ppoints(length(resids))),sort(resids))+geom_abline(aes(intercept=0,slope=1))+ xlab("Theoretical Quantiles")+ylab("Sample Quantiles")
The function tsdiag
will produce plots that can be used to visually assess if there is any more temporal trend in the residuals and for temporal independence.
tsdiag(filtQC,gof.lag = 25)
The fact that the lag 1 and lag 3 rediduals still exhibit some dependence and the Ljung-Box p-values are vey close to zero indicates a more complex error variance is needed to capture the temporal correlation between these data. Try adding a second order moving average component (MA(2)) to the model fit above.
buildCutpt_MA2 <- function(theta,xvar){ dlmModReg(X=xvar, dV=1e-7, dW=exp(theta[1:2]),addInt=TRUE)+ dlmModARMA(ma=exp(theta[3:4]),sigma2=exp(theta[4])) } #Estimate parameters based on baseline data only fit_MA2 <- dlmMLE(log(vorb_04$QC_Art),parm=rep(0,5),buildCutpt_MA2, xvar=matrix(vorb_04$Time,ncol=1)) #Now apply the model to the full dataset modCutpt_MA2 <- buildCutpt_MA2(fit_MA2$par, xvar=vorb_04$Time) #Filterd estiamte for full data set filtQC_MA2 <- dlmFilter(log(vorb_04$QC_Art),modCutpt_MA2) tsdiag(filtQC_MA2,gof.lag = 25)
Now the lagged residual ACV values are very near 0 and Ljung-Box p-values are very near 1 indiciateing a good fit to the data. We can now update the dynamic thresholds with this new model.
filtPred_MA2 <- filtQC_MA2$f filtSD_MA2 <- residuals(filtQC_MA2,sd=TRUE)$sd dthresh_df$NewThreshold <- qlnorm(0.95,meanlog=filtPred_MA2,sdlog=filtSD_MA2) qplot(Time_Stamp,Scores,data=dthresh_df,colour=Baseline)+xlab("Date")+ylab("QC-ART Score")+ geom_line(aes(Time_Stamp,NewThreshold),colour=1)+ylim(c(0,250))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.