knitr::opts_chunk$set(echo = TRUE)
Identifying causal noncoding variants remains a daunting task. Because noncoding variants exert their effects in the context of a gene regulatory network (GRN), we hypothesize that explicit use of disease-relevant GRN can significantly improve the inference accuracy of noncoding risk variants. We describe Annotation of Regulatory Variants using Integrated Networks (ARVIN), a general computational framework for predicting causal noncoding variants. For each disease, ARVIN first constructs a GRN using multi-dimensional omics data oncell/tissue-type relevant to the disease. ARVIN then uses a set of novel regulatory network-based features, combined with sequence-based features to make predictions.
This user guide explains ARVIN package. If you want to make a quick start and run ARVIN, please refer to the related document in this repository (ARVIN_Quick_Start.docx).
ARVIN uses genome annotation and motif data. You need to download http://tanlab4generegulation.org/arvin_annotation_data.tar.gz to run ARVIN. In addition, please make sure following dependency packages also installed including "caret", "randomForest", and "igraph".
As a first step of ARVIN, an integrative GRN for each disease-relevant cell/tissue type should be constructed. In this network, there are two types of nodes which are genes and snps/variants as well as two types of edges which are gene-gene interacstions and snp-gene interactions. Gene interaction network is used as the network backbone, and then this backbone is integrated with enhancer-promoter interactions(snp-gene interactions). For each type of nodes and edges, normalized scores are computed using cell/tissue speicific information.
ARVIN uses enhancer-promoter interaction for mapping SNPS to genes via enhancers. The enhancer-promoter interaction data must be in tab separated format as follows:
chr9 22124001 22126001 ENST00000452276 0.93
chr2 242792001 242794001 ENST00000485966 0.792
You can use IM-PET software for predicting enhancer-promoter interactions. For this purpose, you can download IM-PET from http://tanlab4generegulation.org/IM-PET.html . IM-PET uses enhancer predictions (computed using CSI-ANN) and gene expression data to compute enhancer-promoter interactions. You can use the output of IM-PET (which is as shown above) as input for ARVIN.
Gene interaction network can be obtained from multiple sources including protein-protein interaction networks and functional interaction networks such as BioGRID, BIND, HumantNet, STRING and etc.
To make different types of scores comparable, we used a min-max normalization to normalize scores within each category. For gene interaction network, the interaction score indicating how strong the interactions between genes can be directly used from corresponding database followed by normalization. SNP-gene interaction score can be assigned with enhancer-promoter score or scores indicating the interaction strength between enhancer and target promoters. SNPs weight can be represented by TF binding disruption score computed using our script/function. Genes can be weighted using differential expression information.
There are two types of network input files users need to prepare. One is the node attribute file and the other is network/edge attribute file. The node attribute file has 3 columns. The first column denotes snp id or gene id, and the second column denotes the score of this snp or gene. The third column specifiy if this node is a snp or gene. In the network file, there are 4 columns. The first two columns list two nodes of a given edge. The third collumn has the normalized score for this edge. The forth column specifies edge type indicating if this interaction is between snps and genes or genes.
edgeFile <- "example_1/EdgeFile.txt" nodeFile <- "example_1/NodeFile.txt" SNP_pFile <- "example_1/snp_pval.txt" edge_data <- read.table(edgeFile, sep="\t") node_data <- read.table(nodeFile, sep="\t") colnames(node_data) <- c("Node", "Node score", "Node type") colnames(edge_data) <- c("First node", "Second node", "Edge score", "Edge type") edge_data[98:103,] node_data[98:103,]
devtools::load_all(".") library(ARVIN) library(igraph) Net <- makeNet(edgeFile, nodeFile) node_data <- read.table(nodeFile) node_data <- subset(node_data, node_data$V3 == "eSNP") eSNP_seeds <- as.character(node_data$V1)#use all SNPs as seeds for search node_data <- NULL#free memory edge_data <- read.table(edgeFile) V_weight <- V(Net)$weight#put the node weight into an array to save time names(V_weight) <- V(Net)$name#set the names for the array E_adj <- as_adj(Net,attr="weight")#use adj matrix to save time for getting edge weight colnames(E_adj) <- names(V_weight) row.names(E_adj) <- names(V_weight) Adj_List <- c() batch <- 1000#split the matrix into a few parts in case of a memory error index <- floor(dim(E_adj)[1]/batch) for(i in 1:index){ ifelse(i != index, end <- i * batch, end <- dim(E_adj)[1] ) start <- (i-1) * batch + 1 cur_list <- apply(E_adj[start:end,], 1, function(x) x[x!=0]) Adj_List <- append(Adj_List, cur_list) }
For most of network features such as the centrality, we wrapped up functions from "igraph" package to calculate their values. We also implemented our module identification algorithm to find modules containing snps we are intereted in. To calculate all network based features, users can simply call NetFeature(). SNPs in the same enhancer usually have similar topological features. However, we can further distinguish them using their TF motif breaking scores.
Nodes <- as.character(edge_data[,2]) Net <- makeNet(edgeFile, nodeFile) topoFeature <-NetFeature(Net, nodeFile, edgeFile, SNP_pFile) head(topoFeature)
Betweenness is a centrality measure of a vertex within a graph. Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes.
bet_vals <- BetFeature(Net, edge_data) head(bet_vals)
Closeness is a measure of the degree to which an individual is near all other individuals in a network. It is the inverse of the sum of the shortest distances between each node and every other node in the network. Closeness is the reciprocal of farness.
close_vals <- CloseFeature(Net, edge_data) head(close_vals)
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. PageRank is a way of measuring the importance of website pages. In the biological networks, we can also use this algorithm to measure the importance of genes/nodes.
page_vals <- PageFeature(Net, edge_data) head(page_vals)
The weighted degree of a node is like the degree. It's based on the number of edge for a node, but ponderated by the weigtht of each edge. It's doing the sum of the weight of the edges.
wd_vals <- WDFeature(Adj_List, edge_data) head(wd_vals)
Gene modules downstream of an eSNP. Our overall hypothesis is that a causal eSNP contributes to disease risk by directly causing expression changes in genes of diseaserelevant pathways. Thus, in addition to the direct target gene of the eSNP, other genes in the same pathway can also provide discriminative information. With the weighted GRN, our goal is to identify “heavy” gene modules in the network that connects a given eSNP to a set of genes
mod_vals <- ModuleFeature(Adj_List, E_adj, eSNP_seeds, V_weight, Nodes) head(mod_vals)
ARVIN uses sequence features for the input SNPs generated by GWAVA. GWAVA is an open-source software developed by Sanger Institute. You can either upload the SNPs to GWAVA web page and get the output or download the source and run locally. For running GWAVA online navigate to https://www.sanger.ac.uk/sanger/StatGen_Gwava, upload the list of input SNPs and get the features in csv format, which will be input for ARVIN. If you prefer to run it locally, you need to dowload the source code from ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/src/ and annotation data from ftp://ftp.sanger.ac.uk/pub/resources/software/gwava/v1.0/source_data/ . Then you can run it local by running gwava_annotate.py and generate the features, which will be input for ARVIN.
gwava <- read.table("example_1/gwava_matrix.txt", header=T, sep="\t") gwava[1:8,168:174]
ARVIN also uses sequence features generated by FunSeq. FunSeq can also be run online or binaries can be downloaded to run locally.
For running FunSeq online, navigate to http://funseq.gersteinlab.org/analysis and upload the list of SNPs that you want to analyze. In the web page, it is noted that the input SNPs can be uploaded in bed format, SNP coordinates followed by reference and alternate alleles; but we discovered that it fails to process bed input. In order to have it run, you the first two seperators need to be two spaces and last two separators need to be tabs, as follows:
chr16··4526757··4526758 G A
chr14··52733136··52733137 C A
where each dot (·) represents a space.
Then, FunSeq will generate the features by selecting “bed” as the output format., which will be used as input by ARVIN.
If you prefer to run FunSeq locally, you can download FunSeq binaries from http://funseq.gersteinlab.org/static/funseq-0.1.tar.gz and extract it into your local. You will also need to download FunSeq annotation data from http://funseq.gersteinlab.org/static/data/data.tar.gz , extract it into directory that you saved the binaries. Then you can run FunSeq binary file by setting the output format to bed.
funseq <- read.table("example_1/funseq_matrix.txt", header=T, sep="\t") head(funseq)
Combining network, GWAVA features and FunSeq features, a random forest model can be trained to predict risk SNPs.
#combine 3 types of features group <- as.character(read.table("example_1/snp_labels.txt")[,1]) features <- data.frame(topoFeature, gwava, funseq, group) RFmodel <- trainMod(features)#train a random forest classifier
By providing the trained random forest model with feature values, users can compute the prediction score for a list of snps or candidate variants for being risk snps or non-risk snps.
prob <- predMod(features[,-dim(features)[2]], RFmodel)#estimate the probablity score using trained model after removing the label/group informatin head(prob)#display the probablity score as positive or negative snps
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.