Weinholdt Claus 2019-04-19
Claus Weinholdt, Henri Wichmann, Johanna Kotrba, David H. Ardell, Matthias Kappler, Alexander W. Eckert, Dirk Vordermark and Ivo Grosse Prediction of regulatory targets of alternative isoforms of the epidermal growth factor receptor in a glioblastoma cell line BMC Bioinformatics (2019)
Install and load packages containing the functions for the analyses
install.packages("devtools")
devtools::install_github("GrosseLab/BGSC")
library(BGSC)
data(ExpData)
We use the function neqc function from the limma package which was developed for normalizing Illumina BeadChips data. The neqc function performs background correction using negative control probes followed by quantile normalization using negative and positive control probes. The Illumina GenomeStudio calculates and reports a detection p-value, which represents the confidence that a given transcript is expressed above background defined by negative control probes. For further analysis, we used only those probes for which the detection p-values for all six probes was below 0.05.
normData <- normalizeExpData(set = 'SF767')
## [1] "Dataset is SF767"
## genes with detection pval <= 0.05 in 6 of 6 Samples --> 16742 of 47322
We define that:
Experimental design where the rows present the RNAi treatment -- without RNAi, RNAi against EGFR splice variant I (siRNAI), and RNAi against all EGFR splice variants (siRNAALL) -- and the columns present the EGF treatment. The six corresponding logarithmic expression values per gene are denoted by x1, …, x6.
| | no EGF | EGF | |-----------------------------------|-----------------|-----------------| | no RNAi | x1 | x2 | | RNAi by siRNAI | x3 | x4 | | RNAi by siRNAALL | x5 | x6 |
For group a we assume that all six expression levels stem from the same normal distribution. In this case, the mean μ and standard deviation σ of this normal distribution (black) is equal to μ and σ of the six expression levels. For gene groups b-d, we assume that all six expression levels stems from a mixture of two normal distributions with independent means μ0 and μ1, and one pooled standard deviation σ. For gene groups b − d, we assume that the expression levels [x1, x3, and x5], [x1, x3, x4, and x5], and [x1, x3, x4, x5, and x6] stem from the normal distribution based on μ0 (red), respectively. For gene groups b − d, we assume that the expression levels [x2, x4, and x6], [x2 and x4], and [x2] stem from the normal distribution based on μ1 (blue), respectively.
Lsets <- get.Lset() #getting a list with the schematic expression pattern
normDataLogLikData <- logLikelihoodOfnormData(normData$E) #calc loglik for each gene in each group
normDataLogLik <- normDataLogLikData[['logL']]
ALL.MUs <- normDataLogLikData[['ALL.MUs']] #table with mean for each gene in each group
ALL.VARs <- normDataLogLikData[['ALL.VARs']] #table with var for each gene in each group
As an example, we show for each group a gene having the minimum log-likelihood. For the groups a − d, the examples are ABCB7, ACSL1, TPR, and ADAR, respectively. In each figure, we plot the probability density of the normal distribution for the group a as a black curve and mark the six log2-expression values with black circles. For groups b − d, we plot with red and blue curves the probability densities of the normal distributions and mark the six log2-expression values with circles, which are colored according to classes for class 0 in red and for class 1 in blue.
GeneExample <- c('ILMN_1687840','ILMN_1684585','ILMN_1730999','ILMN_2320964')
names(GeneExample) <- c('g1','g2','g3','g4')
tmpPlot <- purrr::map2(GeneExample,c('a','b','c','d'),function(.x,.y) Density.NV.fit.plot(id = .x ,normData, ALL.MUs, ALL.VARs, useGroup = .y ,DOplot = FALSE) )
grid.arrange( tmpPlot$g1 + theme(legend.position = "none"),
tmpPlot$g2 + theme(legend.position = "none"),
tmpPlot$g3 + theme(legend.position = "none"),
tmpPlot$g4 + theme(legend.position = "none") ,ncol = 2,nrow = 2,
bottom = textGrob("The group having the minimum log-likelihood is highlighted with yellow.",gp = gpar(fontsize = 12,font = 1))
)
Performing classification through model selection based on minimum log-likelihood is problematic when the number of free model parameters is not identical among all models under comparison. Here, model a has two free model parameters, while models b, c, and d have three. Hence, a naive classification based on a minimum log-likelihood criterion would give a spurious advantage to models b, c, and d with three free model parameters over model a with only two free parameters. To eliminate that spurious advantage, we compute marginal likelihoods p(x|z) using the approximation of Schwarz et al. commonly referred to as Bayesian Information Criterion.
npar <- sapply(Lsets, function(x) sum(!sapply(x,is.null ) )) + 1 ## number parameters for log-likelihood -> mean + var
k <- sapply(Lsets, function(x) sum( sapply(x,length) )) ## number of samples
print(rbind(npar,k))
## a b c d
## npar 2 3 3 3
## k 6 6 6 6
normDataBIC <- get.IC(normDataLogLik , npar, k , IC = 'BIC')
## Number of genes assigned to group with the minimal Bayesian Information Criterion
## a b c d
## #genes assigned to group 3446 5646 5015 2635
We assume that 70% of all genes are not regulated by EGF, so we define the prior probability for group a by p(a)=0.70. Further, we assume that the remaining 30% of the genes fall equally in groups with EGF-regulation, so we define the prior probabilities for groups b, c, and d by p(b)=p(c)=p(d)=0.1. We can compute for z ∈ {a, b, c, d} the posterior p(z|x)≈p(x|z)⋅p(z) and then perform Bayesian model selection by assigning each gene to that group z with the maximum approximate posterior p(z|x). Further, we define as putative target genes for each group the subset of genes with an approximate posterior probability exceeding 0.75.
normDataPosterior <- get.Posterior( normDataBIC ,Pis = c(0.7,0.1,0.1,0.1))
POSTmaxInd <- apply(normDataPosterior, MARGIN = 1 ,FUN = maxIndex)
PostClass <- get.gene.group(data = normDataPosterior,indexing = "maximal",filter = 0.75, DoPlot = TRUE)
## [1] "Histograms of approximate posterior"
## [1] "Number of genes assigned to group with the maximal approximate posterior"
## a b c d
## #genes assigned to group 8449 3822 3143 1328
## #genes assigned to group with Filter0.75 4209 1868 1140 390
Genes of the group c are putative target genes regulated by EGFR isoforms II-IV and not by other receptors.
After calculating the log2-fold change for group c by [μc1 - μc0], we validated three up-regulated genes, namely CKAP2L, ROCK1, and TPR and three down-regulated genes, namely ALDH4A1, CLCA2, and GALNS.
### calculating mean and log2-fold change
MeanFoldChangeClass <- get.log2Mean.and.log2FC(normData = normData)
### load qPCR data
data(qPCR_SF767, envir = environment())
qgenes <- c('CKAP2L','ROCK1','TPR','ALDH4A1','CLCA2','GALNS')
comparative_qPCR <- list()
qPCR_Mean <- qPCR_log2FC <- data.frame()
for(n in qgenes){
Ref_GAPDH <- as.double(qPCR_SF767[,'GAPDH'])
tmp <- as.double(qPCR_SF767[,n])
comparative_qPCR[[n]] <- comparativeMethod_qPCR.analysis(Gene = tmp,Ref = Ref_GAPDH)
comparative_qPCR[[n]]$CellLine <- qPCR_SF767$CellLine
comparative_qPCR[[n]]$Treatment <- qPCR_SF767$Treatment
}
SF.log2FC <- comparativeMethod_qPCR.RNAi.log2FC(comparative_qPCR)
qCPRdataC <- SF.log2FC$dCT.C1C0
data.table::setkey(qCPRdataC,Gene)
### annotation of gene examples
IDs.dt <- data.table::data.table(normData$genes,keep.rownames = T,key = 'rn')
IDs.dt.c <- IDs.dt[rownames(PostClass$resFilter$c),]
data.table::setkey(IDs.dt.c,'SYMBOL')
qgenesIDs <- lapply(as.character(qCPRdataC$Gene), function(qg) as.character(IDs.dt.c[qg,][['rn']]) )
names(qgenesIDs) <- as.character(qCPRdataC$Gene)
We show the normalized expression for the six genes of group c. The normalized expression is shown in a similar way as the Schematic Expression Pattern
. For three up-regulated genes (CKAP2L, ROCK1, and TPR) the expression is higher for class c1 (dark red) and for three down-regulated genes (ALDH4A1, CLCA2, and GALNS) the expression is lower for class c1 (light red).
We show the probability density distributions of the log2-normalized expression for the six genes of group c.
We show the log2-normalized expression of the group c for the six genes.
## GeneName IlluminaID meanC1 meanC0 log2-fold change
## 1: CKAP2L ILMN_1751776 8.997306 7.786610 1.2106958
## 2: ROCK1 ILMN_1808768 5.910558 5.129696 0.7808615
## 3: TPR ILMN_1730999 8.384344 7.165260 1.2190839
## 4: ALDH4A1 ILMN_1696099 6.373942 7.608376 -1.2344339
## 5: CLCA2 ILMN_1803236 6.341448 7.590288 -1.2488407
## 6: GALNS ILMN_1737949 5.972909 6.848607 -0.8756972
We have found that the six log2-fold changes of the Illumina microarray expression levels, and those of the qPCR expression levels show a Pearson correlation coefficient of 0.99 (p-value = 0.00002). Therefore, we can suggest that the set of 1,140 genes might contain some further putative target genes of isoforms II-IV of the epidermal growth factor receptor in tumor cells.
##
## Pearson's product-moment correlation
##
## data: PlotDataFC[PlotDataFC$Set == "Microarray", "FC"] and PlotDataFC[PlotDataFC$Set == "qPCR", "FC"]
## t = 24.469, df = 4, p-value = 1.655e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9684969 0.9996537
## sample estimates:
## cor
## 0.9966761
## data of "Comparison of Illumina data and RT-qPCR"-plot
| Gene | Set | FC| stderr| |:--------|:-----------|------:|-------:| | CKAP2L | Microarray | 1.21| 0.16| | CKAP2L | qPCR | 1.09| 0.23| | ROCK1 | Microarray | 0.78| 0.07| | ROCK1 | qPCR | 0.82| 0.34| | TPR | Microarray | 1.22| 0.13| | TPR | qPCR | 1.09| 0.32| | ALDH4A1 | Microarray | -1.23| 0.16| | ALDH4A1 | qPCR | -0.84| 0.48| | CLCA2 | Microarray | -1.25| 0.20| | CLCA2 | qPCR | -0.97| 0.34| | GALNS | Microarray | -0.88| 0.11| | GALNS | qPCR | -0.43| 0.29|
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] purrr_0.2.4 gtable_0.2.0 gridExtra_2.3
## [4] ggplot2_2.2.1 BGSC_0.0.1.3 limma_3.34.8
## [7] data.table_1.10.4-3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.15 knitr_1.19 magrittr_1.5
## [4] MASS_7.3-48 munsell_0.4.3 colorspace_1.3-2
## [7] rlang_0.3.1 highr_0.6 stringr_1.2.0
## [10] plyr_1.8.4 tools_3.4.1 plotrix_3.7
## [13] htmltools_0.3.6 yaml_2.1.16 lazyeval_0.2.1
## [16] rprojroot_1.3-2 digest_0.6.15 tibble_1.4.2
## [19] RColorBrewer_1.1-2 rlist_0.4.6.1 evaluate_0.10.1
## [22] rmarkdown_1.8 labeling_0.3 pheatmap_1.0.8
## [25] stringi_1.1.6 pillar_1.1.0 compiler_3.4.1
## [28] scales_0.5.0 backports_1.1.2
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.