The purpose of this package is to allow the user to find pairs of annotations that are enriched in a network. For instance, the network might be a gene regulatory network, where the nodes represent genes and the edges represent regulatory interactions between pairs of genes. Each gene might have one or more functional annotation labels (such as GO terms) associated with them. The user might be interested in learning whether a specific GO term is enriched upstream of a second GO term. This function works with directed networks, with and without edge weights. The results can be depicted as either a network or a heatmap. More complete information is available here: http://eprints.whiterose.ac.uk/154294/
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) knitr::opts_chunk$set(fig.width=8, fig.height=8)
When no edge weights are considered, enrichment between pairs of genes is determined by the binomial test.
When a gene network contains edge weights, we calculate the sum of the edge weights of each edge type, and we would like to know whether this value is higher than would be expected by chance. For two functional annotations $a$ and $b$, let us define $z_{a,b}$ as the sum of the edge weights of edge type $(a,b)$ in the network. Let us say that $c_{a,b}$ is the count of the number of edges of that type. $P(c_{a,b}=i)$ is the probability of observing exactly $i$ edges of type $(a,b)$ and $P(x \geq z_{a,b} | c_{a,b}=i)$ is the probability of observing a sum of edge weights greater than $z_{a.b}$ given that $c_{a,b}=i$. The probability of observing at least $z_{a,b}$ is: % a weighted average of the probability of observing at least $z_{a,b}$ if there were exactly $c_{a,b}$ edges of edge type $(a,b)$ for all $c_{a,b}$ from $1$ to $N$ (where $N$ is the number of edges of the network). \begin{equation} \label{eq:summation} P(x \geq z_{a,b})=\sum_{i=1}^N P(c_{a,b}=i) P(x \geq z_{a,b} | c_{a,b}=i) \end{equation}
where $N$ is the number of edges in the network. Note that $P(x \geq z_{a,b})$ is the p-value.
First, let us construct a random network with 300 nodes, 1000 edges and 14 GO terms.
library(PAFway) set.seed(123) #Make 300 nodes nodes=paste("node", c(1:300)) #First 3 node names: print(nodes[1:3]) #Assign them random GO terms randomGO=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N")[sample(c(1:14), 300, replace=T)] names(randomGO)=nodes #First 3 nodes and associated GO terms: print(randomGO[1:3]) #Make 1000 edges edgesRandom=t(sapply(c(1:1000), function(i){ nodes[sample(300, 2)] })) #First 3 edges: print(edgesRandom[1:3,])
We can also consider a random network, where each gene can have more than one functional annotation:
#Assign each node a second GO term, separated by a '_' symbol randomGO2=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N")[sample(c(1:14), 300, replace=T)] randomGO_multiple=sapply(c(1:300), function(i){paste(randomGO[i], randomGO2[i], sep="_")}) names(randomGO_multiple)=nodes #print first 5 elements, as a demo print(randomGO_multiple[1:5])
We can also select a sub-set of GO terms which we consider 'interesting':
#Interesting GO terms: GO_interesting=c("B", "C", "D", "F")
PAFway works with and without edge weights. Here we generate random edge weights:
random_edge_weights=rnorm(length(edgesRandom[,1]), 1, 0.001) print(random_edge_weights[1:5]) print(length(random_edge_weights))
Now that we've set up some networks, we can see whether there are any enriched pairwise relationships between edges.
First, we will use the main pafway function to find p-values for each function annotation pair. We will start off with the simplest example, which doesn't use any optional parameters:
#This will run pafway, with no edge weights, for all the GO terms a=pafway(randomGO, edgesRandom, unique(randomGO)) print(a[1:5, 1:5])
This can be displayed as either a heatmap or a network:
draw_network(a) draw_heatmap(a)
However, in this example, we do not correct for multiple hypothesis testing, so we get quite a few false positives, even though we know that we have started with a random network. Let us re-draw these, after adjustment for multiple hypotheses. We now see that no edges have a p-value<0.05.
draw_network(a, adjMethod = "bonferroni") draw_heatmap(a, adjMethod = "bonferroni")
Next, we'll repeat the previous analysis, but with edge weights. We will start off with the simplest example, which doesn't use any optional parameters:
#This will run pafway, with no edge weights, for all the GO terms b=pafway_edge_weights(randomGO, cbind(edgesRandom, random_edge_weights), unique(randomGO)) print(b[1:5, 1:5])
This can be displayed as either a heatmap or a network:
draw_network(b) draw_heatmap(b)
However, in this example, we do not correct for multiple hypothesis testing, so we get quite a few false positives, even though we know that we have started with a random network. Let us re-draw these, after adjustment for multiple hypotheses. We now see that no edges have a p-value<0.05.
draw_network(b, adjMethod = "bonferroni") draw_heatmap(b, adjMethod = "bonferroni")
If you use a larger database of GO terms, there may be more than one annotation per gene. These can be appended together, separated by a "_" symbol:
#Multiple GO terms are in: randomGO_multiple b=pafway(randomGO_multiple, edgesRandom, unique(randomGO), exact=F) print(b[1:5], b[1:5])
It is usually recommended that you only perform this analysis on GO terms that are of particular interest to you, because otherwise you will lose a lot of statistical power by comparing pairs of GO terms that you don't care about.
#GO terms of interest are in GO_interesting b=pafway(randomGO, edgesRandom, GO_interesting) print(b[1:5], b[1:5])
If you are running pafway (specifically with edge weights) on a really big network, it can take a while. It might be worth decreasing accuracy and increasing speed. One way to do this is to change the threshold at which a value in a pdf is rounded to zero. Another way is to change the step size.
#This will run pafway, with no edge weights, for all the GO terms b=pafway_edge_weights(randomGO, cbind(edgesRandom, random_edge_weights), unique(randomGO), step=0.001, thresholdZero=0.0001) print(b[1:5, 1:5])
Genes may have multiple functional annotations associated with them. For instance, it is often the case that \emph{all} genes that have one GO term will also have a different specific GO term, because GO terms have hierarchical arrangements (e.g. all genes involved in `blue light sensing' are also 'light sensing').
This does not directly impact the calculation of p-values in PAFway, but it would potentially impact how we interpret the results from a biological perspective. The aim of PAFway is to evaluate whether an edge between two functional annotations occurs more frequently than would be expected by a null model in which edges are randomly distributed in the network. In this scenario, it does not matter how much overlap there is between the functional annotations. Intuitively this makes sense: Let’s say you randomly select two words from a book and you want to calculate the probability that the first word you select contains the letter a’ and the second word you select contains the letter
b’. The expected probability of this happening is $p_a * p_b$, and it does not depend on whether the letters a’ and
b’ tend to appear together in the same word. This means that the calculation in PAFway is correct, as is.
However, this also means that the frequencies of pairs of edges are correlated with one another. For instance, if GO term a' is enriched upstream of
b', and many genes that have GO term a' are also labelled with GO term
c', then there may also be an enrichment for edges between genes containing the GO terms c' and
b'. If there is an overlap between GO terms b' and
d', then there may also be an enrichment for edges between c' and
d'. Biologists may be less interested in an enrichment between GO terms c' and
d' if this enrichment is completely explained by the enrichment between GO terms a' and
b'.
For this reason, we may want to know whether we observe more edges between genes containing the GO terms c' and
d', given (i) the co-occurrence of functional annotations a' and
c' (ii) the co-occurrence of functional annotations b' and
d' (iii) the frequency of edges containing a' and
b'. In particular, let $f_{i,j}$ be the number of times we observe an edge between genes with functional annotations $i$ and $j$. Let $p(j|i)$ be the probability of a gene having the functional annotation $j$ given that it has the functional annotation $i$. We use the notation $!i$ to signify \emph{not} i. We can calculate the expected value of $f_{c,d}$ using the following formula:
\begin{align} \mathbb{E}(f_{c,d})=& f_{a,b}p(c|a)p(d|b) + \ & f_{!a,b}p(c|!a)p(d|b) + \ & f_{a,!b}p(c|a)p(d|!b) + \ & f_{!a,!b}p(c|!a)p(d|!b) \end{align}
Then, we can compare the observed count with the expected count: $f_{c,d}/\mathbb{E}(f_{c,d})$.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.