knitr::opts_chunk$set( collapse = TRUE, comment = ">" )
In this vignette, we introduce the functionality of the fssemR package to estimate the differential gene regulatory network by gene expression and genetic perturbation data. To meet the space and time constraints in building this vignette within the fssemR package, we are going to simulate gene expression and genetic perturbation data instead of using a real dataset. For this purpose, we will use function randomFSSEMdata2 in fssemR to generate simulated data, and then apply fused sparse structural equation model (FSSEM) to estimate the GRNs under two different conditions and their differential GRN. Also, please go to https://github.com/Ivis4ml/fssemR/tree/master/inst for more large dataset analysis. In conlcusion, this vignette is composed by three sections as follow,
For user using package fssemR, please cite the following article:
Xin Zhou and Xiaodong Cai. Inference of Differential Gene Regulatory Networks Based on Gene Expression and Genetic Perturbation Data. Bioinformatics, submitted.
We are going to simulate two GRNs and their corresponding gene expression and genetic perturbation data in the following steps:
library(fssemR) library(network) library(ggnetwork) library(Matrix)
n = c(100, 100) # number of observations in two conditions p = 20 # number of genes in our simulation k = 3 # each gene has nonzero 3 cis-eQTL effect sigma2 = 0.01 # simulated noise variance prob = 3 # average number of edges connected to each gene type = "DG" # `fssemR` also offers simulated ER and directed graph (DG) network dag = TRUE # if DG is simulated, user can select to simulate DAG or DCG ## seed = as.numeric(Sys.time()) # any seed acceptable seed = 1234 # set.seed(100) set.seed(seed) data = randomFSSEMdata2(n = n, p = p, k = p * k, sparse = prob / 2, df = 0.3, sigma2 = sigma2, type = type, dag = T)
# genes 1 to 20 are named as g1, g2, ..., g20 rownames(data$Vars$B[[1]]) = colnames(data$Vars$B[[1]]) = paste("g", seq(1, p), sep = "") rownames(data$Vars$B[[2]]) = colnames(data$Vars$B[[2]]) = paste("g", seq(1, p), sep = "") rownames(data$Data$Y[[1]]) = rownames(data$Data$Y[[2]]) = paste("g", seq(1, p), sep = "") names(data$Data$Sk) = paste("g", seq(1, p), sep = "") # qtl 1 to qtl 60 are named as rs1, rs2, ..., rs60 rownames(data$Vars$F) = paste("g", seq(1, p), sep = "") colnames(data$Vars$F) = paste("rs", seq(1, p * k), sep = "") rownames(data$Data$X[[1]]) = rownames(data$Data$X[[2]]) = paste("rs", seq(1, p * k), sep = "")
g{%d} and eQTLs as rs{%d}.# data$Vars$B[[1]] ## simulated GRN under condition 1 GRN_1 = network(t(data$Vars$B[[1]]) != 0, matrix.type = "adjacency", directed = TRUE) plot(GRN_1, displaylabels = TRUE, label = network.vertex.names(GRN_1), label.cex = 0.5)
# data$Vars$B[[2]] ## simulated GRN under condition 2 GRN_2 = network(t(data$Vars$B[[2]]) != 0, matrix.type = "adjacency", directed = TRUE) plot(GRN_2, displaylabels = TRUE, label = network.vertex.names(GRN_2), label.cex = 0.5)
# data$Vars$B[[2]] ## simulated GRN under condition 2 diffGRN = network(t(data$Vars$B[[2]] - data$Vars$B[[1]]) != 0, matrix.type = "adjacency", directed = TRUE) ecol = 3 - sign(t(data$Vars$B[[2]] - data$Vars$B[[1]])) plot(diffGRN, displaylabels = TRUE, label = network.vertex.names(GRN_2), label.cex = 0.5, edge.col = ecol)
library(Matrix) print(Matrix(data$Vars$F, sparse = TRUE))
Therefore, the $B$ matrices and $F$ matrix in data$Vars are the true values in our simulated model. We then need to estimated the $\hat{B}$ and $\hat{F}$ by the FSSEM algorithm.
We need to input the gene expression and corresponding genotype data of two conditions into the FSSEM algorithm. They are stored in the data$Data.
head(data$Data$Y[[1]]) head(data$Data$Y[[2]])
head(data$Data$X[[1]] - 1) head(data$Data$X[[2]] - 1)
data$Data$Sk stores each gene's cis-eQTL's indices. In real data application, we recommend to use package MatrixEQTL to search the significant cis-eQTLs for genes of interested and build Sk for your researchhead(data$Data$Sk)
fssemR by ridge regressionWe implement our fssemR by the observed gene expression data and genetic perturbations data that stored in data$Data, and it is initialized by ridge regression, the $l_2$ norm penalty's hyperparameter $\gamma$ is selected by 5-fold cross-validation.
Xs = data$Data$X ## eQTL's genotype data Ys = data$Data$Y ## gene expression data Sk = data$Data$Sk ## cis-eQTL indices gamma = cv.multiRegression(Xs, Ys, Sk, ngamma = 50, nfold = 5, n = data$Vars$n, p = data$Vars$p, k = data$Vars$k) fit0 = multiRegression(data$Data$X, data$Data$Y, data$Data$Sk, gamma, trans = FALSE, n = data$Vars$n, p = data$Vars$p, k = data$Vars$k)
Then, we chose the fit0 object from ridge regression as intialization, and implement the fssemR algorithm, BIC is used to select optimal hyperparameters $\lambda, \rho$, where nlambda is the number of candidate lambda values for $l_1$ regularized term, and nrho is the number
of candidate rho values for fused lasso regularized term.
fitOpt <- opt.multiFSSEMiPALM2(Xs = Xs, Ys = Ys, Bs = fit0$Bs, Fs = fit0$Fs, Sk = Sk, sigma2 = fit0$sigma2, nlambda = 10, nrho = 10, p = data$Vars$p, q = data$Vars$k, wt = TRUE) fit <- fitOpt$fit
cat("Power of two estimated GRNs = ", (TPR(fit$Bs[[1]], data$Vars$B[[1]]) + TPR(fit$Bs[[2]], data$Vars$B[[2]])) / 2) cat("FDR of two estimated GRNs = ", (FDR(fit$Bs[[1]], data$Vars$B[[1]]) + FDR(fit$Bs[[2]], data$Vars$B[[2]])) / 2) cat("Power of estimated differential GRN = ", TPR(fit$Bs[[1]] - fit$Bs[[2]], data$Vars$B[[1]] - data$Vars$B[[2]])) cat("FDR of estimated differential GRN = ", FDR(fit$Bs[[1]] - fit$Bs[[2]], data$Vars$B[[1]] - data$Vars$B[[2]]))
From these 4 metrics, we can get the performance of our fssemR algorithm comparing to the ground truth (if we know)
# data$Vars$B[[2]] ## simulated GRN under condition 2 diffGRN = network(t(fit$Bs[[2]] - fit$Bs[[1]]) != 0, matrix.type = "adjacency", directed = TRUE) # up-regulated edges are colored by `red` and down-regulated edges are colored by `blue` ecol = 3 - sign(t(fit$Bs[[2]] - fit$Bs[[1]])) plot(diffGRN, displaylabels = TRUE, label = network.vertex.names(GRN_2), label.cex = 0.5, edge.col = ecol)
Additionally, the differeitial effect of two GRN are also estimated. Therefore, we can tell how the interactions in two GRNs change.
diffGRN = Matrix::Matrix(fit$Bs[[1]] - fit$Bs[[2]], sparse = TRUE) rownames(diffGRN) = colnames(diffGRN) = rownames(data$Vars$B[[1]]) diffGRN
From the diffGRN, we can determined how the gene-gene interactions in GRN changes across two conditions, then, we can find out the key genes for condition-specific gene regulatory network.
Additionally, for more applications and the replications of our real data analysis, please go to the https://github.com/Ivis4ml/fssemR/tree/master/inst for more cases.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.