A Motivation on the project

The Sequencing Quality Control (SEQC) project is coordinated by the US FDA. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, authors Assess RNA sequencing (RNA-seq) performance for junction discovery and differential expression profiling. Using complementary metrics, compare it to microarray data, and compare it to quantitative PCR (qPCR) data. The paper ”A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium.” Nature Biotechnology (2014), by SEQC/MAQC-III Consortium, inspired to think about the assessment of qPCR accuracy based on RNA-seq.

”Gold” standard

Validation has been an important part in RNA-seq publication. The differentially expressed genes (at least some) identified using RNA-seq are often validated using q-PCR.

One challenge of qPCR data is the presence of non-detects (those reactions failing to attain the expression threshold). While most current software replaces these non-detects with the maximum possible Ct value, recent work has shown that this introduces large biases in estimation of both absolute and differential expression [2]. Considerable differences in expression level measurements from different PCR-based assays can be observed.

I am going to identify specific genes, where results of RNA-seq and q-PCR disagree. If the disagreement caused by low detection in q-PCR experiment, it may be possible to estimate non-detected Ct values, and repeat the validation procedure.

Getting Data

qPCR Data

Data downloaded from GEO

library(GEOquery)
library(HTqPCR)
library(nondetects)
library("seqc")
library("TaqManRNAseq")
# Get the data from GEO
gseA2 <- getGEO("GPL4097",GSEMatrix=TRUE) 
gsms <- Meta(gseA2)$sample_id
gsm1 <- getGEO(GEO=gsms[1]) 
# back calculate Ct expression value
dct <- nd <- matrix(nrow=nrow(Table(gsm1)), ncol=length(gsms))
for(k in 1:length(gsms)){
  tmp <- getGEO(GEO=gsms[k])
  dct[,k] <- as.numeric(Table(tmp)$VALUE)
  nd[,k] <- Table(tmp)$Flag_Detection
}
# gene names
gene<-as.vector(Table(gseA2)$ORF)
geneContr<-which(gene=="POLR2A")
for(i in 1:length(gene)) {
  gene[i]<-paste0(gene[i],"_",i)
}
# after the data was saved in the file
# TaqManDat<-list(dct=dct, nd=nd, gene=gene)
# safe the data
# save(TaqManDat, file = "TaqManDat.rda")

This data set can be loaded from the package dirrectly

data("TaqManDat")
dct <- TaqManDat$dct
nd  <- TaqManDat$nd
gene<- TaqManDat$gene

RNAseq Data

RNA-seq data was prosessed to CPM (counts per million), and then the mean of log2(CPM) was calculated.

RNAcounts<-apply(ILM_refseq_gene_COH[,5:132],2,sum)/10^6
RNAseqCPM<-t(t(ILM_refseq_gene_COH[,5:132])/RNAcounts)
grp<-as.numeric(as.factor(gsub("\\_.+","",colnames(RNAseqCPM))))
seqGeneSum<-t(apply(log(RNAseqCPM+1),1,by,grp,mean))
rownames(seqGeneSum)<-ILM_refseq_gene_COH$Symbol
# after we saved the data in the file
# save(seqGeneSum, file = "seqGeneSum.rda")

This data set can be loaded from the package dirrectly

data("seqGeneSum")

Create 11 datasets with all the genes

for(i in c(1, 97, 193, 289, 385, 481, 577, 673, 769, 865, 961)) {
  assign(paste0("tst_",i),value=combData(dct,nd,i,gene, delNondet = FALSE))
}
tst<-rbind(tst_1, tst_97, tst_193, tst_289, tst_385, tst_481, tst_577, tst_673, tst_769, tst_865, tst_961)
dim(tst)

Create 11 class qPCR datasets

tst1  <-qPCRdata("tst_1","POLR2A_42")
tst97 <-qPCRdata("tst_97","POLR2A_98")
tst193<-qPCRdata("tst_193","POLR2A_194")
tst289<-qPCRdata("tst_289","POLR2A_311")
tst385<-qPCRdata("tst_385","POLR2A_412")
tst481<-qPCRdata("tst_481","POLR2A_516")
tst577<-qPCRdata("tst_577","POLR2A_579")
tst673<-qPCRdata("tst_673","POLR2A_705")
tst769<-qPCRdata("tst_769","POLR2A_772")
tst865<-qPCRdata("tst_865","POLR2A_866")
tst961<-qPCRdata("tst_961","POLR2A_971")

Combine all the plates into one dataset

data1 <- qPCRexprs(tst1,tst97)
data1 <- qPCRexprs(data1,tst193)
data1 <- qPCRexprs(data1,tst289)
data1 <- qPCRexprs(data1,tst385)
data1 <- qPCRexprs(data1,tst481)
data1 <- qPCRexprs(data1,tst577)
data1 <- qPCRexprs(data1,tst673)
data1 <- qPCRexprs(data1,tst769)
data1 <- qPCRexprs(data1,tst865)
data1 <- qPCRexprs(data1,tst961)

Correlation plots

par(mfrow=c(2,2))
GMData<-CommonGenesCorPlot(data1,seqGeneSum, extrVal=F)

Now 11 data sets, removing genes, non-detected in all the samples and all the replicates.

for(i in c(1, 97, 193, 289, 385, 481, 577, 673, 769, 865, 961)) {
  assign(paste0("tst_",i),value=combData(dct,nd,i,gene, delNondet = TRUE))
}
# create 11 class qPCR-datasets 
tst1  <-qPCRdata("tst_1","POLR2A_42")
tst97 <-qPCRdata("tst_97","POLR2A_98")
tst193<-qPCRdata("tst_193","POLR2A_194")
tst289<-qPCRdata("tst_289","POLR2A_311")
tst385<-qPCRdata("tst_385","POLR2A_412")
tst481<-qPCRdata("tst_481","POLR2A_516")
tst577<-qPCRdata("tst_577","POLR2A_579")
tst673<-qPCRdata("tst_673","POLR2A_705")
tst769<-qPCRdata("tst_769","POLR2A_772")
tst865<-qPCRdata("tst_865","POLR2A_866")
tst961<-qPCRdata("tst_961","POLR2A_971")
# combine all the plates into one dataset
data1 <- qPCRexprs(tst1,tst97)
data1 <- qPCRexprs(data1,tst193)
data1 <- qPCRexprs(data1,tst289)
data1 <- qPCRexprs(data1,tst385)
data1 <- qPCRexprs(data1,tst481)
data1 <- qPCRexprs(data1,tst577)
data1 <- qPCRexprs(data1,tst673)
data1 <- qPCRexprs(data1,tst769)
data1 <- qPCRexprs(data1,tst865)
data1 <- qPCRexprs(data1,tst961)
par(mfrow=c(2,2))
GMData<-CommonGenesCorPlot(data1,seqGeneSum, extrVal=F)

Since there are non-detected values in the data, we can impute those values.

# create starting PYFIT based on all the data (borrowing info from other genes)
allData<-rbind(tst_1,tst_97,tst_193,tst_289,tst_385,tst_481,tst_577,tst_769)
gavg<-t(apply(allData,1,by,pData(tst1)$sampleType, median))
p.nd<-t(apply(allData==35,1, by,pData(tst1)$sampleType, mean))
pyfit <- glm(as.vector(p.nd)~as.vector(gavg), family=binomial(link=logit), 
             weights=rep(4,length(gavg)))
# impute nondetected values using qpcrImpute function from package "nondetects" based on PYFIT
tst1Impute2<-qpcrImpute(tst1, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst97Impute<-qpcrImpute(tst97, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst193Impute<-qpcrImpute(tst193, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst289Impute<-qpcrImpute(tst289, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst385Impute<-qpcrImpute(tst385, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst481Impute<-qpcrImpute(tst481, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst577Impute<-qpcrImpute(tst577, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst673Impute<-qpcrImpute(tst673, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst769Impute<-qpcrImpute(tst769, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst865Impute<-qpcrImpute(tst865, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
tst961Impute<-qpcrImpute(tst961, groupVars=c("sampleType"), 
                        outform="Single", pyfit=pyfit)
# combine all the plates into one dataset
data2 <- qPCRexprs(tst1Impute2,tst97Impute)
data2 <- qPCRexprs(data2,tst193Impute)
data2 <- qPCRexprs(data2,tst289Impute)
data2 <- qPCRexprs(data2,tst385Impute)
data2 <- qPCRexprs(data2,tst481Impute)
data2 <- qPCRexprs(data2,tst577Impute)
data2 <- qPCRexprs(data2,tst673Impute)
data2 <- qPCRexprs(data2,tst769Impute)
data2 <- qPCRexprs(data2,tst865Impute)
data2 <- qPCRexprs(data2,tst961Impute)
# create plots
par(mfrow=c(2,2))
GMData<-CommonGenesCorPlot(data2, seqGeneSum, extrVal=F)
par(mfrow=c(1,1))

Results

Several not matching genes were identified during this project. Correlation between RNAseq and qPCR results found to be similar to the results from the motivating paper. The R package ”TaqManRNAseq” was created. It contains total of 5 functions. This project gives a motivation for future research in this area.

Funding

There was no additional funding provided for this project.

Session Info

sessionInfo()


valery-sherina/TaqManRNAseq_project documentation built on May 3, 2019, 2:58 p.m.