(1) Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
(2) Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
(3) Department of Statis-tics, Allameh Tabataba'i University, Tehran, Iran
"maleknias@gmail.com"
date: "r Sys.Date()
"
This package has developed a tool for performing a novel pathway enrichment analysis based on Bayesian network (BNrich) to investigate the topology features of the pathways. This algorithm as a biologically intuitive, method, analyzes of most structural data acquired from signaling pathways such as causal relationships between genes using the property of Bayesian networks and also infer finalized networks more conveniently by simplifying networks in the early stages and using Least Absolute Shrinkage Selector Operator (LASSO). impacted pathways are ultimately prioritized the by Fisher’s Exact Test on significant parameters. Here, we provide an instance code that applies BNrich in all of the fields described above.
This document offers an introductory overview of how to use the package. The BNrich tool uses Bayesian Network (BN) in a new topology-based pathway analysis (TPA) method. The BN has been demonstrated as a beneficial technique for integrating and modeling biological data into causal relationships (1–4). The proposed method utilizes BN to model variations in downstream components (children) as a consequence of the change in upstream components (parents). For this purpose, The method employs 187 KEGG human non-metabolic pathways (5–7) which their cycles were eliminated manually by a biological intuitive, as BN structures and gene expression data to estimate its parameters (8,9). The cycles of inferred networks were eliminated on the basis of biologically intuitive rules instead of using computing algorithms (10). The inferred networks are simplified in two steps; unifying genes and LASSO. Similarly, the originally continuous gene expression data is used to BN parameters learning, rather than discretized data (8). The algorithm estimates regression coefficients by continuous data based on the parameter learning techniques in the BN (11,12). The final impacted pathways are gained by Fisher’s exact test. This method can represent effective genes and biological relations in impacted pathways based on a significant level.
install.packages("BNrich_0.1.0.tar.gz", type="source", repos=NULL) library("BNrich")
At first, we can load all the 187 preprocessed KEGG pathways which their cycles were removed, the data frame includes information about the pathways and vector of pathway ID.
destfile = tempfile("files", fileext = ".rda") files <- fetch_data_file() load(destfile)
Note that it's better to use (for example:) destfile = "./R/BNrich-start.rda"
to save essential files permanently.
The input data should be as two data frames in states disease and (healthy) control. The row names of any data frame are KEGG geneID and the number of subjects in any of them should not be less than 20, otherwise the user may encounters error in LASSO
step. Initially, we can load dataset example.
The example data extracted from a part of GSE47756
dataset, the gene expression data from colorectal cancer study (13).
Data <- system.file("extdata", "Test_DATA.RData", package = "BNrich", mustWork = TRUE) load(Data) head(dataH)
| | H1 | H2 | H3 | H4 | H5 | H6 | H7 | H8 | |-----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------| | hsa:1 | 3.37954 | 3.3469 | 3.78383 | 3.35186 | 3.2091 | 3.40245 | 4.06329 | 3.43424 | | hsa:100 | 3.1147 | 3.15981 | 3.37842 | 2.69868 | 3.43759 | 3.38588 | 2.95406 | 3.09631 | | hsa:10000 | 3.21876 | 2.93611 | 2.62708 | 3.13507 | 2.62864 | 2.61367 | 2.7336 | 2.70867 | | hsa:1001 | 3.4549 | 3.18683 | 3.34896 | 3.36903 | 3.49353 | 3.35175 | 3.27893 | 3.63678 | | hsa:10010 | 2.17522 | 2.59843 | 2.56868 | 2.95009 | 2.52181 | 2.24635 | 2.05092 | 2.10438 | | hsa:10013 | 2.992 | 2.94325 | 3.22677 | 2.87371 | 3.063 | 2.97679 | 3.07247 | 3.08168 |
head(dataD)
| | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | |-----------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------|---------------| | hsa:1 | 3.29082 | 3.15924 | 3.45716 | 3.15391 | 3.29514 | 3.36502 | 3.63823 | 3.22192 | | hsa:100 | 3.069 | 2.97546 | 2.99117 | 2.88929 | 3.00292 | 2.94948 | 2.93906 | 3.36357 | | hsa:10000 | 2.68424 | 3.24284 | 3.57435 | 2.46992 | 4.57649 | 3.87179 | 2.94405 | 3.54207 | | hsa:1001 | 3.27815 | 2.91081 | 3.53487 | 2.95122 | 2.67742 | 2.72358 | 3.10172 | 3.07123 | | hsa:10010 | 2.68051 | 3.22719 | 3.58798 | 2.61269 | 3.72397 | 3.29004 | 2.5843 | 2.95756 | | hsa:10013 | 3.05107 | 2.86273 | 3.06863 | 3.05318 | 3.04536 | 2.92021 | 3.12596 | 3.0468 |
Initially, we need to unify gene products based on 187 imported signaling pathways (mapkG
list) in two states disease (dataD) and control (dataH). This is the first simplification step, unifying nodes in signaling pathways with genes those exist in gene expression data.
unify_results <- unify_path(dataH, dataD, mapkG, pathway.id)
The unify_path
function performs the following processes:
• Split datasets into KEGG pathways
• Delete all gene expression data are not in pathways
• Removes all gene products in pathways are not in dataset platforms
• Remove any pathways with the number of edges is less than 5
This function returns a list contain data_h
,data_d
,mapkG1
and pathway.id1
. data_h
and data_d
are lists contain data frames related to control and disease objects unified for any signaling pathways. The mapkG1
is a list contains unified signaling pathways and pathway.id1
is new pathway ID vector based on remained pathways.
In the example dataset, the number of edges in the one pathway becomes less than 5 and are removed:
length(mapkG)
187
mapkG1 <- unify_results$mapkG1 length(mapkG1)
186
As well, the number of edges reduces in the remaining pathways. In first pathway hsa:01521
the number of edges from 230 reduces to 204:
pathway.id[1]
"hsa:01521"
mapkG[[1]]
A graphNEL graph with directed edges Number of Nodes = 79 Number of Edges = 230
pathway.id1 <- unify_results$pathway.id1 pathway.id1[1]
"hsa:01521"
mapkG1[[1]]
A graphNEL graph with directed edges Number of Nodes = 71 Number of Edges = 204
Now we can construct BN structures based on unified signaling pathways and consequently need the results of unify_path
function.
BN <- BN_struct(unify_results$mapkG1)
The BN_struct
function returns a list contains BNs structures reconstructed from all mapkG1
.
Given that the data used is continuous, each node is modeled as a regression line on its parents (11,14). Thus, on some of these regression lines, the number of these independent variables is high, so in order to avoid the collinearity problem, we need to use the Lasso regression (15,16).
We perform this function for any node with more than one parent, in all BNs achieved by BN_struct
function, based on control and disease data obtained by unify_results
function.
data_h <- unify_results$data_h data_d <- unify_results$data_d LASSO_results <- LASSO_BN(BN, data_h, data_d)
The LASSO_BN function returns a list contains two lists BN_H and BN_D are simplified BNs structures based on LASSO regression related to healthy and disease objects. This function lead to reduce number of edges too:
nrow(arcs(BN[[1]]))
204
nrow(arcs(LASSO_results$BN_H[[1]]))
116
nrow(arcs(LASSO_results$BN_D[[1]]))
116
Now we can estimate (learn) parameters for any BNs based on healthy and disease data lists.
BN_H <- LASSO_results$BN_H BN_D <- LASSO_results$BN_D esti_results <- esti_par(BN_H, BN_D, data_h, data_d)
The esti_par
function returns a list contains four lists. The BN_h
, BN_d
, are lists of BNs which their parameters learned by control and disease objects data. The coef_h
and coef_d
are lists of parameters of BN_h
and BN_d
.
As you can see in below, node hsa:1978
in the first BN has one parent. The coefficient in control (healthy) data is 0.6958609
and in disease data is 1.1870730
.
esti_results$BNs_h[[1]]$` hsa:1978`
Parameters of node hsa:1978 (Gaussian distribution) Conditional density: hsa:1978 | hsa:2475 Coefficients: (Intercept) hsa:2475 2.8841264 0.6958609 Standard deviation of the residuals: 0.3489612
esti_results$BNs_d[[1]]$`hsa:1978`
Parameters of node hsa:1978 (Gaussian distribution) Conditional density: hsa:1978 | hsa:2475 Coefficients: (Intercept) hsa:2475 0.9046357 1.1870730 Standard deviation of the residuals: 0.2713789
We require the variance of the BNs parameters to perform the T-test between the corresponding parameters.
BN_h <- esti_results$BNs_h BN_d <- esti_results$BNs_d coef_h <- esti_results$coef_h coef_d <- esti_results$coef_d var_mat_results<- var_mat (data_h, coef_h, BN_h, data_d, coef_d, BN_d)
The var_mat
function returns a list contains two lists var_mat_Bh
and var_mat_Bd
which are the variance-covariance matrixes for any parameters of BN_h
and BN_d
. The variance-covariance matrixes for the fifth node,hsa:1978
, in first BN in two states control and disease is as follow:
(var_mat_results$var_mat_Bh[[1]])[5]
[,1] [,2] [1,] 10.177073 -3.630152 [2,] -3.630152 1.296990
(var_mat_results$var_mat_Bd[[1]])[5]
[,1] [,2] [1,] 3.549338 -1.0392040 [2,] -1.039204 0.3053785
T-test perfoms between any corresponding parameters between each pair of learned BNs, BN_h
and BN_d
, in disease and control states. Assumptions are unequal sample sizes and unequal variances for all samples.
var_mat_Bh <- var_mat_results $var_mat_Bh var_mat_Bd <- var_mat_results $var_mat_Bd Ttest_results <- parm_Ttest(data_h, coef_h, BN_h, data_d, coef_d, BN_d, var_mat_Bh, var_mat_Bd, pathway.id1) head(Ttest_results)
| From | To | pathway.number | pathwayID | Pval | coefficient in disease | coefficient in control | fdr | |-----------------|----------------|----------------------|-----------------|----------------|--------------------------------|------------------------------|----------------| | intercept | hsa:2065 | 1 | hsa:01521 | 0.605294 | 4.893503 | 5.535163 | 6.72E-01 | | hsa:7039 | hsa:2065 | 1 | hsa:01521 | 2.04E-05 | 1.072296 | -0.21107 | 6.95E-05 | | hsa:1950 | hsa:2065 | 1 | hsa:01521 | 0.154223 | 0.125977 | -0.21675 | 2.11E-01 | | hsa:4233 | hsa:2065 | 1 | hsa:01521 | 0.083296 | -0.63254 | -0.33154 | 1.23E-01 | | hsa:3084 | hsa:2065 | 1 | hsa:01521 | 0.135981 | -0.55586 | -0.18792 | 1.89E-01 | | hsa:9542 | hsa:2065 | 1 | hsa:01521 | 0.373051 | -0.39859 | -0.11334 | 4.49E-01 |
This function returns a data frame contains T-test results for all parameters in all final BNs. The row that is intercept in From
variable, shows significance level for gene product that is shown in To
variable. The rest of the data frame rows shows significance level for any edge of networks.
In the last step we can determine enriched pathways by own threshold on p-value
or fdr
. Hence we run the Fisher's exact test for any final pathways. As stated above, the Ttest_results
is a data frame contains T-test results for all parameters in final BNs achieved by parm_Ttest
function and fdr.value
A numeric threshold to determine significant parameters (default is 0.05).
BNrich_results <- BNrich(Ttest_results, pathway.id1, PathName_final, fdr.value = 0.05) head(BNrich_results)
| pathwayID | p.value | fdr | pathway.number | Name | |-----------------|----------------|----------------|----------------------|-----------------------------------------------| | hsa:05016 | 2.66E-17 | 2.47E-15 | 123 | Huntington disease | | hsa:05202 | 1.64E-17 | 2.47E-15 | 156 | Transcriptional misregulation in cancer | | hsa:05012 | 2.92E-16 | 1.81E-14 | 121 | Parkinson disease | | hsa:05010 | 1.55E-11 | 7.19E-10 | 120 | Alzheimer disease | | hsa:04144 | 3.25E-08 | 1.21E-06 | 22 | Endocytosis | | hsa:04714 | 2.99E-07 | 9.26E-06 | 72 | Thermogenesis |
The following package and versions were used in the production of this vignette.
R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BNrich_0.1.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.2 codetools_0.2-16 lattice_0.20-38 corpcor_1.6.9 [5] foreach_1.4.7 glmnet_2.0-18 digest_0.6.20 grid_3.6.1 [9] stats4_3.6.1 evaluate_0.14 graph_1.63.0 Matrix_1.2-17 [13] rmarkdown_1.14 bnlearn_4.5 iterators_1.0.12 tools_3.6.1 [17] parallel_3.6.1 xfun_0.8 yaml_2.2.0 rsconnect_0.8.15 [21] compiler_3.6.1 BiocGenerics_0.31.5 htmltools_0.3.6 knitr_1.24
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.