Quantitative real-time PCR (qPCR) measures gene expression for a subset of genes through repeated cycles of sequence-specific DNA amplification and expression measurements. During the exponential amplification phase, each cycle results in an approximate doubling of the quanitity of each target transcript. The threshold cycle (Ct) -- the cycle at which the target gene's expression first exceeds a predetermined threshold -- is used to quantify the expression of each target gene. These Ct values typically represent the raw data from a qPCR experiment.

One challenge of qPCR data is the presence of *non-detects* those reactions failing to attain the expression threshold. While most
current software replaces these non-detects with the maximum possible
Ct value (typically 40), recent work has shown that this introduces
large biases in estimation of both absolute and differential
expression. Here, we treat the non-detects as missing data, model the
missing data mechanism, and use this model to impute Ct values for the
non-detects.

We propose the following generative model for qPCR data in which $Y_{ij}$ is the measured expression value for gene $i$ in sample $j$, some of which are missing (non-detects), $X_{ij}$ represents the fully observed expression values, and $Z_{ij}$ indicates whether a value is detected: $$ X_{ij} = f(\theta_{ij}, \eta) + \varepsilon_{ij} $$ $$ Y_{ij} = \left{ \begin{array} {rr} X_{ij} & \textrm{if $Z_{ij}=1$}\ \textrm{non-detect} & \textrm{if $Z_{ij}=0$} \end{array} \right.$$

In this model, we assume that the fully observed expression values, $X_{ij}$ are a function of the true gene expression, $\theta_{ij}$, non-biological factors, $\eta$, and random measurement error, $\varepsilon_{ij}$. The probability of an expressed value being detected is assumed to be a function of the expression value itself, $g(X_{ij})$, for values below the detection limit, $S$. The following logistic regression model is a natural choice for such a relationship:

$$ Pr(Z_{ij}=1) = \left{ \begin{array} {rr} g(X_{ij}) & \textrm{if $X_{ij} < 40$} \ 0 & \textrm{otherwise} \end{array} \right.$$

Here, $g(X_{ij})$ can be estimated via the following logistic regression: [ logit(Pr(Z_{ij}=1)) = \beta_0 + \beta_1 X_{ij} ]

Two cell types -- young adult mouse colon (YAMC) cells and mutant-p53/activated-Ras transformed YAMC cells -- in combination with three treatments -- untreated, sodium butyrate, or valproic acid. Four replicates were performed for each cell-type/treatment combination [ @sampson2012gene].

library(HTqPCR) library(mvtnorm) library(nondetects) data(oncogene2013)

Normalize to Becn1:

normCt <- normalizeCtData(oncogene2013, norm = "deltaCt", deltaCt.genes = "Becn1")

Calculate residuals for each set of replicates:

conds <- paste(pData(normCt)$sampleType,pData(normCt)$treatment,sep=":") resids <- matrix(nrow=nrow(normCt), ncol=ncol(normCt)) for(i in 1:nrow(normCt)){ for(j in 1:ncol(normCt)){ ind <- which(conds==conds[j]) resids[i,j] <- exprs(normCt)[i,j]-mean(exprs(normCt)[i,ind]) } }

Create boxplots of residuals stratified by the presence of a non-detect:

iND <- which(featureCategory(normCt)=="Undetermined", arr.ind=TRUE) iD <- which(featureCategory(normCt)!="Undetermined", arr.ind=TRUE) boxes <- list("observed"=-resids[iD], "non-detect"=-resids[iND])

boxplot(boxes, main="",ylim=c(-5,5), ylab=expression(paste("-",Delta,"Ct residuals",sep="")))

oncogene2013_1 <- qpcrImpute(oncogene2013, groupVars=c("sampleType","treatment"), outform = c("Multy"), vary_fit=FALSE, vary_model=TRUE, add_noise=TRUE, numsam=2, linkglm = c("logit"))

Normalize to Becn1:

normCt <- normalizeCtData(oncogene2013_1[[1]], norm = "deltaCt", deltaCt.genes = "Becn1")

Remove the normalization gene:

normCt <- normCt[-which(featureNames(normCt)=="Becn1"),]

Calculate residuals for each set of replicates: ``` {r, echo=TRUE} conds <- paste(pData(normCt)$sampleType, pData(normCt)$treatment,sep=":") resids <- matrix(nrow=nrow(normCt), ncol=ncol(normCt)) for(i in 1:nrow(normCt)){ for(j in 1:ncol(normCt)){ ind <- which(conds==conds[j]) resids[i,j] <- exprs(normCt)[i,j]-mean(exprs(normCt)[i,ind]) } }

Create boxplots of residuals stratified by the presence of a non-detect: ```r iI <- which(featureCategory(normCt)=="Imputed", arr.ind=TRUE) iD <- which(featureCategory(normCt)!="Imputed", arr.ind=TRUE) boxes <- list("observed"=-resids[iD], "imputed"=-resids[iI])

boxplot(boxes, main="",ylim=c(-5,5), ylab=expression(paste("-",Delta,"Ct residuals",sep="")))

Two additional example data sets are used in the paper and included in the package. These are each briefly described below.

Cells transformed to malignancy by mutant p53 and activated Ras are perturbed with the aim of restoring gene expression to levels found in non-transformed parental cells via retrovirus-mediated re-expression of corresponding cDNAs or shRNA-dependent stable knock-down. The data contain 4-6 replicates for each perturbation, and each perturbation has a corresponding control sample in which only the vector has been added [ @almudevar2011fitting].

library(nondetects) data(sagmb2011)

A study of the effect of p53 and/or Ras mutations on gene expression. The third dataset is a comparison between four cell types -- YAMC cells, mutant-p53 YAMC cells, activated-Ras YAMC cells, and p53/Ras double mutant YAMC cells. Three replicates were performed for the untransformed YAMC cells, and four replicates were performed for each of the other cell types [ @mcmurray2008synergistic].

library(nondetects) data(nature2008)

This work was supported by National Institutes of Health [grant numbers CA009363, CA138249, HG006853]; and an Edelman-Gardner Foundation Award.

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.