In robertness/signalgraph: Signal Processing with Graphs

library(signalgraph)

Cell signaling is how cells sense and respond to their environment

Cell signaling refers to how cells perceive and respond to events in their environment. These environmental events, including detection of hormones, pathogens, and growth factors, provide an input signal. Within the cell, the signal is passed along a series of signaling events, meaning physical interactions between proteins. This culminates in a signaling response, such as activation of a gene.

Computational biologists discovering novel signaling events by comparing known signaling mechanisms to mechanisms identified by machine learning

The cascade of signaling events can be represented as a directed network called a signaling network, where nodes are proteins and edges reflect signaling events. The disregulation of cell signaling, or the "rewiring" of the signaling network, has been implicated in many processes of disease, including cancer.

Many experimentalists across time and many labs have discovered various signaling events across different species, tissues, and contexts. Biocurators assemble these into models of signaling networks that provide compact and computable models of current human knowledge about real signaling networks. These reference network models are available in public sources such as KEGG. One example from KEGG is the MAPK signaling pathway, a signaling network that turns on cell growth and proliferation:

MapK

Experimentalists study signaling by measuring the activity of signaling proteins. Computational biologists apply machine learning methods for learning graphical structures to this data. The objective is to discover novel signaling events, meaning signaling events occuring in the samples/patients that are unexpected given prior knowledge about the signaling mechanism. In essence, they look for edges in the learned network that are not present in the reference network, providing hypotheses for how the signaling network has been rewired in the subject.

signaling event

The experimentalist validates these hypotheses in follow up experiments.

In summary, the workflow is as follows;

Collect data
Use machine learning to build a network structure from the data
Identify edges not in the reference network
Validate the presence of such edges in follow up experiments

The T-cell dataset is a benchmark dataset for learning signaling network structure from data

The T-cell signaling dataset consists of simultaneous fluorescence flow cytometry measurements of 11 phosphorylated proteins and phospholypids derived from thousands of individual primary immune system cells, specifically T-cells.

Before measurement, the cells were exposed to stimuli that introduced signal into the signaling network. Each row of the dataset is a cell, each column a protein, and each value corresponds to the signaling activity of a given protein that was measured for a given cell. The dataset was published in a 2005 study published in Nature, Sachs et al.

In the publication the authors validated the learned network edges against the following validation network, comprised of signaling events well-described in the biomedical literature.

data(tcell_examples, package = "bninfo")
bnlearn::graphviz.plot(tcell_examples$net, main = "T-cell Signaling Validation Network")

My package bninfo walks through an R workflow for repeating the network inference. The algorithm is as follows:

Generate 500 random networks
For the ith network
Resample the data with replacement
Score the ith network against the resampled data using a likelihood-based metric that incorporates the intervention data.
Using this as the starting network, conduct a Tabu search that compares candidate network structures.
For each edge that appeared in the 500 searches, weight each edge by the proportion of times in 500 that it appeared in a search result.

The following demonstrates the workflow on the dataset the 2005 publication authors used to fit their original publication. This data was discretized into 3 levels; 0 (low), .5 (medium), and 1 (high). While this loses information on the range of the values for each protein, the discretization was conducted such that cross-protein relationships between the quantiles of protein values was preserved.

library(bninfo)
data(tcells, package = "bninfo")
tcell_data <- tcells$processed$.data
interventions <- tcells$processed$interventions

averaging_tabu <- causal_learning(tcell_data, interventions, "tabu", iss = 10, tabu = 50, resample = TRUE)

data(tcell_examples, package = "bninfo")
averaging_tabu <- tcell_examples$averaging_tabu

Example of edge weights:

class(averaging_tabu)
sample_n(averaging_tabu, 10) %>%
  mutate(weight = strength * direction)

This is an object of the class "bn.strength" from the bnlearn package. The product "strength" and "direction" columns is the proportion of times the directed edge appeared. I manually calculate this product in the "weight" column.

The strength_plot function visualizes the results of model averaging. A "consensus network" is created from all edges with weight greater than a threshold. Below (figure left) edges in the averaged network are plotted against the reference network, green edges correspond to detected edges (true positives), blue edges to undetected edges (false negatives). The thickness of the edge corresponds to the weight -- the thicker the edge, the more evidence for its presence in the data. By visualizing against the consensus network, we can see incorrectly detected edges (false positives) in red (figure, right).

par(mfrow=c(1, 2))
strength_plot(averaging_tabu, tcell_examples$net, plot_reference = TRUE, threshold = .5)
strength_plot(averaging_tabu, tcell_examples$net, plot_reference = FALSE, threshold = .5)
par(mfrow = c(1, 1))

As is typical in machine learning, we want a balance between minimizing false positives and maximizing true positives. In bninfo we can visualize this tradeoff with an ROC curve, with help of the ROCR package.

library(ROCR)
averaging_tabu %>% # Grab averaging results
  getPerformance(tcell_examples$net) %$% # Get performance against reference network
  prediction(prediction, label) %>% # Convert to ROCR 'prediction' object
  performance("tpr", "fpr") %>% # Calculate true positive rate and false positive rate statistics
  plot(main = "ROC curve on signaling event detection")

Selection of proteins for measurement in causal inference

To understand how signal flows through this pathway, I introduce nodes representing the introduction of signal. I model the specific activations by adding two binary (0, 1) nodes "PKC signal" and "PKA signal". There is no 'inactivated pathway' baseline, so I create a "general signal" dummy node that acts as a baseline and set its values to 0.

data(tcells, package = "signalgraph")
validated_net <- tcells$validated_net
sg_viz(validated_net, show_biases = FALSE)

Using sg_viz to view, the signal is introduced in the orange nodes, and the green nodes are proteins that are observed in the data. All the nodes are green because this validation network was constructed to include all the proteins they measured.

Often however, data is available on a smaller set of proteins than are included in the network. Consider the case of proteins Raf, Erk, Mek, and PIP3 missing in the pathway, as indicated in blue.

sg_viz(tcells$subset_example, show_biases = FALSE)

These are downstream of where signal is introduced, but upstream of at least one other protein which might be used to quantify signaling response. Each of them has one outgoing edge.

Since these proteins both recieve and exhert causal influence, it is important that they be included in the input data for the causal inference workflow. Suppose however, that under constrained resources, one can select only one or two but not all of these proteins.

A naive approach would be to select proteins based on the graph topology only, by ranking them with a graph-theoric method such as betweenness centrality where all of the edge weights are 1.

validated_net %>%
  betweenness(V(.), weight = rep(1, ecount(.))) %>%
  .[c("Erk", "Mek", "PIP3", "Raf")] %>%
  sort(decreasing = TRUE)

So according to this metric, Mek should be prioritized followed by Erk, Raf, then PIP3.

Signalgraph can incorporate both graph topology and available data for a subset of the proteins.

First I fit the signalgraph model on the truncated network multiple times and acquire a set of fits. I then recalculate the betweenness score using the average of the weights across fits.

net_list <- tcells$net_list
lapply(tcells$net_list, function(g){
  betweenness(g, weight = abs(E(g)$weight)) %>%
    .[c("Erk", "Mek", "PIP3", "Raf")] %>%
    sort(decreasing = TRUE)
})

mek_averaging <- tcells$subset_averaging$mek_averaging
mek_erk_averaging <- tcells$subset_averaging$mek_erk_averaging
erk_averaging <- tcells$subset_averaging$erk_averaging
erk_raf_averaging <- tcells$subset_averaging$erk_raf_averaging

base_proteins <- c("Akt" , "Jnk", "P38", "PIP2", "PKA", "PKC", "Plcg")
mek_averaging <- tcell_data[, c(base_proteins, "Mek")] %>%
  causal_learning(interventions, "tabu", iss = 10, tabu = 50, resample = TRUE)
erk_averaging <- tcell_data[, c(base_proteins, "Erk")] %>%
  causal_learning(interventions, "tabu", iss = 10, tabu = 50, resample = TRUE)

mek_averaging %>%
  getPerformance(tcell_examples$net) %$% # Get performance against reference network
  prediction(prediction, label) %>% # Convert to ROCR 'prediction' object
  performance("tpr", "fpr") %>% # Calculate true positive rate and false positive rate statistics
  plot(main = "Signaling event detection with top ranked protein", col = "blue")
erk_averaging %>%
  getPerformance(tcell_examples$net) %$% # Get performance against reference network
  prediction(prediction, label) %>% # Convert to ROCR 'prediction' object
  performance("tpr", "fpr") %>% # Calculate true positive rate and false positive rate statistics
  plot(main = "ROC curve on signaling event detection", add = TRUE, col = "green")
legend("bottomright", c("naive approach", "proposed approach"), lwd = c(1, 1), col = c("blue", "green"))

erk_raf_averaging <- tcell_data[, c(base_proteins, "Erk", "Raf")] %>%
  causal_learning(interventions, "tabu", iss = 10, tabu = 50, resample = TRUE)
mek_erk_averaging <- tcell_data[, c(base_proteins, "Mek", "Erk")] %>%
  causal_learning(interventions, "tabu", iss = 10, tabu = 50, resample = TRUE)

mek_erk_averaging %>%
  getPerformance(tcell_examples$net) %$% # Get performance against reference network
  prediction(prediction, label) %>% # Convert to ROCR 'prediction' object
  performance("tpr", "fpr") %>% # Calculate true positive rate and false positive rate statistics
  plot(main = "Signaling event detection with top two proteins", col = "blue")
erk_raf_averaging %>%
  getPerformance(tcell_examples$net) %$% # Get performance against reference network
  prediction(prediction, label) %>% # Convert to ROCR 'prediction' object
  performance("tpr", "fpr") %>% # Calculate true positive rate and false positive rate statistics
  plot(main = "ROC curve on signaling event detection", add = TRUE, col = "green")
legend("bottomright", c("naive approach", "proposed approach"), lwd = c(1, 1), col = c("blue", "green"))