If many genes are perturbed in a population of cells, this can lead to diseases like cancer. The perturbations can happen in different ways, e.g. via mutations, copy number abberations or methylation. However, not all perturbations are observed in all samples.
Nested Effects Model-based perturbation inference (NEM$\pi$) uses observed perturbation profiles and gene expression data to infer unobserved perturbations and augment observed ones. The causal network of the perturbed genes (P-genes) is modelled as an adjacency matrix $\phi$ and the genes with observed gene expression (E-genes) are modelled with the attachment $\theta$ with $\theta_{ij}=1$, if E-gene $j$ is attached to S-gene $i$. If E-gene $j$ is attached to P-gene $i$, $j$ shows an effect for a perturbation of P-gene $i$. Hence, $\phi\theta$ predicts gene expression profiles, which can be compared to the real data. NEM$\pi$ iteratively infers a network $\phi$ based on gene expression profiles and a perturbation profile, and the perturbation profile based on a network $\phi$.
knitr::opts_chunk$set(message=FALSE, out.width="125%", fig.align="center", strip.white=TRUE, warning=FALSE, tidy=TRUE, #out.extra='style="display:block; margin:auto;"', fig.height = 4, fig.width = 8, error=FALSE) fig.cap0 <- "Heatmap of the simulated log odds. Effects are blue and no effects are red. Rows denote the observed E-genes and columns the samples annoted by P-genes. Each P-gene has been perturbed in many cells. The E-genes are annotated as how they are attached in the ground truth. E.g. E-genes named '1' are attached to S-gene '1' in the ground truth." fig.cap1 <- "Heatmap of the probabilsitic perturbation matrix." paltmp <- palette() paltmp[3] <- "blue" paltmp[4] <- "brown" palette(paltmp)
Use devtools to install the latest version from github or use the BiocManahger to install the package from bioconductor.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("nempi")
Load the package with the library function.
library(nempi)
We look at a small example for which we first simulate data and then infer the unobserved perturbations. We draw a random network for the perturbed genes (P-genes). Then we simulate gene expression for each sample given the subset of P-genes that has been perturbed in the sample. E.g., if P-gene A is upstream of P-gene B and P-gene A has been perturbed, all E-genes attached to B also show an effect.
library(mnem) seed <- 8675309 Pgenes <- 10 Egenes <- 5 samples <- 100 edgeprob <- 0.5 uninform <- floor((Pgenes*Egenes)*0.1) Nems <- mw <- 1 noise <- 1 multi <- c(0.2, 0.1) set.seed(seed) simmini <- simData(Sgenes = Pgenes, Egenes = Egenes, Nems = Nems, mw = mw, nCells = samples, uninform = uninform, multi = multi, badCells = floor(samples*0.1), edgeprob=edgeprob) data <- simmini$data ones <- which(data == 1) zeros <- which(data == 0) data[ones] <- rnorm(length(ones), 1, noise) data[zeros] <- rnorm(length(zeros), -1, noise) epiNEM::HeatmapOP(data, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent", xrot = 0, dendrogram = "both")
The typical data input for NEM$\pi$ consists of a data matrix with samples as columns and E-genes as rows. The columns are either labeled by their perturbed gene(s) or unlabeled (default: ""). After the data simulation all samples are labeled. We unlabel $50\%$ of the sample and pretend we do not know, which P-gene has been perturbed.
lost <- sample(1:ncol(data), floor(ncol(data)*0.5)) colnames(data)[lost] <- ""
We use NEM$\pi$ and other methods to infer the perturbations. We use the area under the precision-recall curve as the two measures of accuracy. We also plot the NEM$\pi$ result.
res <- nempi(data) fit <- pifit(res, simmini, data) print(fit$auc) ressvm <- classpi(data) fit <- pifit(ressvm, simmini, data, propagate = FALSE) print(fit$auc) resnn <- classpi(data, method = "nnet") fit <- pifit(resnn, simmini, data, propagate = FALSE) print(fit$auc) resrf <- classpi(data, method = "randomForest") fit <- pifit(resrf, simmini, data, propagate = FALSE) print(fit$auc) col <- rgb(seq(0,1,length.out=10),seq(1,0,length.out=10), seq(1,0,length.out=10)) plot(res,heatlist=list(col="RdBu"),barlist=list(col=col))
Compared to support vector machines (svm), neural nets (nnet) and random forest (rf) class prediction, NEM$\pi$ achieves a higher accuracy.
Note that NEM$\pi$ is in general more powerful, if the P-genes are connected in a denser network. The other methods perform equally well for sparse or even disconnected network $\phi$. However, they usually profit from combinatorial perturbations.
Alternatively, we can also provide NEM$\pi$ with a probabilistic perturbation matrix $\Gamma$ as a prior. In the rows are the potentially perturbed genes and in the columns are the samples. The entries are between $0$ and $1$, with the sum of all entries of a sample summing to $1$.
Gamma <- matrix(0, Pgenes, ncol(data)) rownames(Gamma) <- seq_len(Pgenes) colnames(Gamma) <- colnames(data) for (i in seq_len(Pgenes)) { Gamma[i, grep(paste0("^", i, "_|_", i, "$|_", i, "_|^", i, "$"), colnames(data))] <- 1 } Gamma <- apply(Gamma, 2, function(x) return(x/sum(x))) Gamma[is.na(Gamma)] <- 0 epiNEM::HeatmapOP(Gamma, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent", xrot = 0, dendrogram = "both") colnames(data) <- sample(seq_len(Pgenes), ncol(data), replace = TRUE) res <- nempi(data, Gamma = Gamma) fit <- pifit(res, simmini, data) print(fit$auc)
The final perturbation matrix $\Omega$ over all samples is slightly different from the matrix $\Gamma$ we used as input. $\Gamma$ only denotes the most upstream P-gene perturbed. E.g., if A is upstream of B, than $\Gamma$ only denotes a perturbation of A, even though B is also perturbed in every samples in which A is perturbed. We can compute it by the matrix multiplication $\Omega = \phi^T \times \Gamma$.
Omega <- t(mnem::transitive.closure(res$res$adj))%*%res$Gamma epiNEM::HeatmapOP(Omega, col = "RdBu", cexRow = 0.75, cexCol = 0.75, bordercol = "transparent", xrot = 0, dendrogram = "both")
sessionInfo()
Markowetz, F., Bloch, J., and Spang, R. (2005). Non-transcriptional pathway features reconstructed from secondary effects of rna interference. Bioinformatics, 21(21), 4026–4032.
Markowetz, F., Kostka, D., Troyanskaya, O. G., and Spang, R. (2007). Nested effects models for high-dimensional phenotyping screens. Bioinformatics, 23(13), i305–i312.
Pirkl, M., Beerenwinkel, N.; Single cell network analysis with a mixture of Nested Effects Models, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i964–i971, https://doi.org/10.1093/bioinformatics/bty602.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.