knitr::opts_chunk$set( collapse = TRUE, comment = ">" )
In this vignette, we introduce the functionality of the fssemR
package to estimate the differential gene regulatory network by gene expression and genetic perturbation data. To meet the space and time constraints in building this vignette within the fssemR
package, we are going to simulate gene expression and genetic perturbation data instead of using a real dataset. For this purpose, we will use function randomFSSEMdata2
in fssemR
to generate simulated data, and then apply fused sparse structural equation model (FSSEM) to estimate the GRNs under two different conditions and their differential GRN. Also, please go to https://github.com/Ivis4ml/fssemR/tree/master/inst
for more large dataset analysis. In conlcusion, this vignette is composed by three sections as follow,
For user using package fssemR
, please cite the following article:
Xin Zhou and Xiaodong Cai. Inference of Differential Gene Regulatory Networks Based on Gene Expression and Genetic Perturbation Data. Bioinformatics, submitted.
We are going to simulate two GRNs and their corresponding gene expression and genetic perturbation data in the following steps:
library(fssemR) library(network) library(ggnetwork) library(Matrix)
n = c(100, 100) # number of observations in two conditions p = 20 # number of genes in our simulation k = 3 # each gene has nonzero 3 cis-eQTL effect sigma2 = 0.01 # simulated noise variance prob = 3 # average number of edges connected to each gene type = "DG" # `fssemR` also offers simulated ER and directed graph (DG) network dag = TRUE # if DG is simulated, user can select to simulate DAG or DCG ## seed = as.numeric(Sys.time()) # any seed acceptable seed = 1234 # set.seed(100) set.seed(seed) data = randomFSSEMdata2(n = n, p = p, k = p * k, sparse = prob / 2, df = 0.3, sigma2 = sigma2, type = type, dag = T)
# genes 1 to 20 are named as g1, g2, ..., g20 rownames(data$Vars$B[[1]]) = colnames(data$Vars$B[[1]]) = paste("g", seq(1, p), sep = "") rownames(data$Vars$B[[2]]) = colnames(data$Vars$B[[2]]) = paste("g", seq(1, p), sep = "") rownames(data$Data$Y[[1]]) = rownames(data$Data$Y[[2]]) = paste("g", seq(1, p), sep = "") names(data$Data$Sk) = paste("g", seq(1, p), sep = "") # qtl 1 to qtl 60 are named as rs1, rs2, ..., rs60 rownames(data$Vars$F) = paste("g", seq(1, p), sep = "") colnames(data$Vars$F) = paste("rs", seq(1, p * k), sep = "") rownames(data$Data$X[[1]]) = rownames(data$Data$X[[2]]) = paste("rs", seq(1, p * k), sep = "")
g{%d}
and eQTLs as rs{%d}
.# data$Vars$B[[1]] ## simulated GRN under condition 1 GRN_1 = network(t(data$Vars$B[[1]]) != 0, matrix.type = "adjacency", directed = TRUE) plot(GRN_1, displaylabels = TRUE, label = network.vertex.names(GRN_1), label.cex = 0.5)
# data$Vars$B[[2]] ## simulated GRN under condition 2 GRN_2 = network(t(data$Vars$B[[2]]) != 0, matrix.type = "adjacency", directed = TRUE) plot(GRN_2, displaylabels = TRUE, label = network.vertex.names(GRN_2), label.cex = 0.5)
# data$Vars$B[[2]] ## simulated GRN under condition 2 diffGRN = network(t(data$Vars$B[[2]] - data$Vars$B[[1]]) != 0, matrix.type = "adjacency", directed = TRUE) ecol = 3 - sign(t(data$Vars$B[[2]] - data$Vars$B[[1]])) plot(diffGRN, displaylabels = TRUE, label = network.vertex.names(GRN_2), label.cex = 0.5, edge.col = ecol)
library(Matrix) print(Matrix(data$Vars$F, sparse = TRUE))
Therefore, the $B$ matrices and $F$ matrix in data$Vars
are the true values in our simulated model. We then need to estimated the $\hat{B}$ and $\hat{F}$ by the FSSEM algorithm.
We need to input the gene expression and corresponding genotype data of two conditions into the FSSEM algorithm. They are stored in the data$Data
.
head(data$Data$Y[[1]]) head(data$Data$Y[[2]])
head(data$Data$X[[1]] - 1) head(data$Data$X[[2]] - 1)
data$Data$Sk
stores each gene's cis-eQTL's indices. In real data application, we recommend to use package MatrixEQTL
to search the significant cis-eQTLs for genes of interested and build Sk
for your researchhead(data$Data$Sk)
fssemR
by ridge regressionWe implement our fssemR by the observed gene expression data and genetic perturbations data that stored in data$Data
, and it is initialized by ridge regression, the $l_2$ norm penalty's hyperparameter $\gamma$ is selected by 5-fold cross-validation.
Xs = data$Data$X ## eQTL's genotype data Ys = data$Data$Y ## gene expression data Sk = data$Data$Sk ## cis-eQTL indices gamma = cv.multiRegression(Xs, Ys, Sk, ngamma = 50, nfold = 5, n = data$Vars$n, p = data$Vars$p, k = data$Vars$k) fit0 = multiRegression(data$Data$X, data$Data$Y, data$Data$Sk, gamma, trans = FALSE, n = data$Vars$n, p = data$Vars$p, k = data$Vars$k)
Then, we chose the fit0
object from ridge regression as intialization, and implement the fssemR
algorithm, BIC is used to select optimal hyperparameters $\lambda, \rho$, where nlambda
is the number of candidate lambda values for $l_1$ regularized term, and nrho
is the number
of candidate rho values for fused lasso regularized term.
fitOpt <- opt.multiFSSEMiPALM2(Xs = Xs, Ys = Ys, Bs = fit0$Bs, Fs = fit0$Fs, Sk = Sk, sigma2 = fit0$sigma2, nlambda = 10, nrho = 10, p = data$Vars$p, q = data$Vars$k, wt = TRUE) fit <- fitOpt$fit
cat("Power of two estimated GRNs = ", (TPR(fit$Bs[[1]], data$Vars$B[[1]]) + TPR(fit$Bs[[2]], data$Vars$B[[2]])) / 2) cat("FDR of two estimated GRNs = ", (FDR(fit$Bs[[1]], data$Vars$B[[1]]) + FDR(fit$Bs[[2]], data$Vars$B[[2]])) / 2) cat("Power of estimated differential GRN = ", TPR(fit$Bs[[1]] - fit$Bs[[2]], data$Vars$B[[1]] - data$Vars$B[[2]])) cat("FDR of estimated differential GRN = ", FDR(fit$Bs[[1]] - fit$Bs[[2]], data$Vars$B[[1]] - data$Vars$B[[2]]))
From these 4 metrics, we can get the performance of our fssemR
algorithm comparing to the ground truth (if we know)
# data$Vars$B[[2]] ## simulated GRN under condition 2 diffGRN = network(t(fit$Bs[[2]] - fit$Bs[[1]]) != 0, matrix.type = "adjacency", directed = TRUE) # up-regulated edges are colored by `red` and down-regulated edges are colored by `blue` ecol = 3 - sign(t(fit$Bs[[2]] - fit$Bs[[1]])) plot(diffGRN, displaylabels = TRUE, label = network.vertex.names(GRN_2), label.cex = 0.5, edge.col = ecol)
Additionally, the differeitial effect of two GRN are also estimated. Therefore, we can tell how the interactions in two GRNs change.
diffGRN = Matrix::Matrix(fit$Bs[[1]] - fit$Bs[[2]], sparse = TRUE) rownames(diffGRN) = colnames(diffGRN) = rownames(data$Vars$B[[1]]) diffGRN
From the diffGRN, we can determined how the gene-gene interactions in GRN changes across two conditions, then, we can find out the key genes for condition-specific gene regulatory network.
Additionally, for more applications and the replications of our real data analysis, please go to the https://github.com/Ivis4ml/fssemR/tree/master/inst
for more cases.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.