In LucaGiudice/Esearch3D: Leveraging gene expression and chromatin architecture to predict intergenic enhancers

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Esearch3D advanced with promoter-genes connections

Welcome to the fourth vignette of our software for the prediction of the activity of integenic fragments. We suggest to start from the first vignette because introduces the software. While, this vignette proposes a real application of Esearch3D, advanced operations and requires at least 32GB of RAM for being finished properly. It goes through all the main operations of the software with real data. Precisely, this vignette replicates the results of our software proposed in the publication where genes are connected to the chromatin fragemnt where their promoter map to. This creates the CIN network called: WG-based mESC DNaseI-capture CIN. This vignette includes also the classification of enhancers with machine learning. We hope that providing this vignette can help to understand the data, the operations and the results of our software.

Enviroment set up

Let us clean the R enviroment, set the working directory, load the software package, set the random seed generator for reproducing always the same results and define the number of cores available in the computer to run the operations. Be careful: set up the number of cores based on your resources, if you are not secure how to, then just set equal to 2

#Clean workspace and memory ----
rm(list=ls())
gc()

#Set working directory ----
gps0=getwd()
gps0=paste(gps0,"/%s",sep="")
rootDir=gps0
setwd(gsub("%s","",rootDir))

#Load libraries ----
suppressWarnings(suppressMessages(
  library("Esearch3D", quietly = T)
  )
)

#Set variables ----
#Set seed to get always the same results out of this vignette
set.seed(8)
#Set number of cores to parallelize the tasks
n_cores=16

Data set up

For this vignette, We created a DNaseI-capture HiC derived CIN whereby captured regions harbour DNaseI sensible regions in mouse embryonic stem cells (mESC), enriching for interactions of chromatin accessible regions. A chromatin fragment representing a genomic locus is represented as a node; a fragment-fragment interaction as an edge. We then integrated genes as nodes within the CIN. In this case, they are connected by an edge to the node that their promoter map to; we obtained the WG-based mESC DNaseI-capture CIN.

#Load and set up the example data ----
data("wg_ann_data_l")
#gene - fragment interaction network generated from DNase_Prop1_mESC_TSS interactions data
gf_net=wg_ann_data_l$gf_net
#gene-fragment-fragment interaction network generated from mESC_DNase_Net interactions data
ff_net=wg_ann_data_l$ff_net
#sample profile with starting values for genes and fragments generated from mESC_bin_matrix_Prop1
input_m=wg_ann_data_l$input_m

Extra data for the classification step

This vignette includes a step where fragments are classified as enhancers and not. This operation requires a dataframe of metainformation related to the chromatin fragments. Specfically, a dataframe such that for each chromatin fragment there is the number of enhancer annotations associated to it. It does not require that all the fragments must be annotated.

#info dataframe containg for each node and fragment the number of enhancer annotations associated to it
info=wg_ann_data_l$info
info=info[info$Type!="Bait",]
info=info[,-3]
head(info)
#dataframe containg mmu nomenclature about genes
mouse_db=wg_ann_data_l$mouse_db;head(mouse_db);

Centrality measures

Esearch3D determines the centrality measures of the genes and fragments included in the CIN and that have a prior information about being or not being enhancer. Centrality measure define how much each node in the CIN network is important and is positioned. For example, high eigencentrality means that the fragment is crucial for the connectivity, central to the network and not a leaf node.

#Process info matrix to get fragments extra information ----
info=get_centr_info(gf_net,ff_net,info);head(info)

Nomenclature change

Esearch3D includes a database of the gene names related to the mmu specie. It allows to convert ENS genes to Symbols.

#Convert ENSG to SYMB ----
gf_net[,1]=mapvalues(gf_net[,1],from = mouse_db$ENS, to = mouse_db$Symbol, warn_missing = F)
gf_net[,2]=mapvalues(gf_net[,2],from = mouse_db$ENS, to = mouse_db$Symbol, warn_missing = F)
ens_row=rownames(input_m);sym_row=mapvalues(ens_row,from = mouse_db$ENS, to = mouse_db$Symbol, warn_missing = F);rownames(input_m)=sym_row

Tuned multi-gene two-step propagation

When Esearch3D has available metainformation about which nodes of the CIN are enhancers, it can find the values of the isolation parameter r which optimize the flow of the information from the genes to the enhancers. In the first propagation, the expression of the genes is propagated from their corresponding nodes into only the genic fragments. Be carefull: r is the isolation parameter, use low value for first step, use high value for second step In the second propagation, the gene expression is then propagated to the rest of the CIN. The genic and intergenic fragments receive an imputed activity score (IAS) reflecting the likelihood of enhancer activity.

#Tuning of propagation setting -----
res_tuning=tuning_prop_vars(gf_net,ff_net,input_m,n_cores=n_cores);head(res_tuning)

#Two step propagation -----
#Propagated for the network gene-fragment
gf_prop=rwr_OVprop(g=gf_net,input_m = input_m, no_cores=n_cores, 
                   r=res_tuning$best_comb$r1,
                   stop_step = res_tuning$best_comb$stop_iters)
#Propagated for the network fragment-fragment
ff_prop=rwr_OVprop(g=ff_net,input_m = gf_prop, no_cores=n_cores, 
                   r=res_tuning$best_comb$r2,
                   stop_step = res_tuning$best_comb$stop_iters)

Enhancer classification

Esearch3D merges the propagation results with the metainformation about the nodes of the CIN. Then it creates a random forest classifier to understand which centrality measure is predictive of intergenic enhancer and intergenic non-enhancer nodes.

#Merge propagation results with meta information about enhancer annotations
info=merge_prop_info(ff_prop,info)

#Build enhancer classifier and explainer
res_ml=enhancer_classifier(info, n_cores=n_cores)

Model explainer

Esearch3D explains the best classifier to understand how well the classifier predicted and which metainformation has been more predictive of the enhancer activity.

lrn_dalex is the dalex object containing the learner
fi_ranger is the dalex object containing the trained and tested classifier
fi_ranger_df is the dataframe indicating how much each node's feature contributed in the classification
bd_ranger_enh is the dalex object containing the classification of an enhancer
bd_ranger_no is the dalex object containing the classification of a non enhancer

#Build explainer
res_dalex=explain_classifier(res_ml, n_cores=n_cores)

#Plot results
res_dalex$fi_ranger_df

plot(res_dalex$fi_ranger)

plot(res_dalex$bd_ranger_enh)

plot(res_dalex$bd_ranger_no)

plot(res_dalex$pr_ranger)