library(signalgraph)

Motivation: we need to choose the nodes we want to measure in network inference

The goal of finding novel edges

A common method in systems biology is to conduct an experiment that quantifies different features (eg genes, proteins) in a cell, then infer a network structure from that data that illustrates how those features are vary together. The network reflects correlation, and in the stricter case of causal Bayesian networks, reflects causality.

Typically some of the edges in the inferred network will already be known from past investigations. Inferring such edges tells us nothing new, they only serve to validate the inference algorithm. The ultimate goal of network inference is detection of novel edges, meaning edges that were not previously known. The novel edges are hypotheses for novel discovery, which you then validate in follow up experiments.

Ranking novel edges for validation

Suppose a novel edge A -> B exists. A method called model averaging infers many different networks from the same data, then assigns a weight to an edge based on how frequently it appears in the set of inferred networks. The weight quantifies the amount of evidence for that edge in the data. Assuming you have several interesting inferred edges, the weights assign priority.

Choosing what to quantify in the network inference experiment

Biological networks can include 1000's of features. Depending on the technology and experimental design, you can't quantify or don't want to quantify every feature. This motivates the problem of selecting what features to quantify in the data.

Since the nodes that are quantified determine which edges can be inferred, then the problem of selecting which nodes to quantify directly influences the network inference, specifically the weights assigned to inferred nodes.

Use of prior knowledge to select features for quantification in causal inference

Select Markov neighborhoods, not just individual features

For causal inference, one must quantify a node along with its Markov neighborhood. I define the Markov neighborhood of a given node as the node itself and all the nodes with which it shares a causal relationship. Causal network inference doesn't work on individual nodes, it works on these sets.

Specifically this is the set of nodes containing itself, its parents, its children, and other parents of its children in the underlying network. This is based on more formal definition of a Markov blanket.

Use of graph data structures to select proteins

Biologists often take the state of knowledge about a system and represent it as a graph data structure stored in computer memory. Representing a graph/network as a data object comprised of vertices, edges and metadata on these vertices and edges is useful both for illustration as well as queries and other computations on the network. An example is the MapK signaling pathway map from the KEGG database. Each edge was validated in a lap experiment, and the edges are assembled together in understandable and computable data objects by biocurators.

Betweenness centrality, an indicator of a nodes centrality in a network, is ideal for prioritizing proteins with just such a graph object. Simply put, the more central a node, the more causal relationships it is involved in. Ranking nodes based on network centrality and quantifying their Markov neighborhoods is a systematic and informed approach for selecting features to quantify, assuming the conditions under which you are studying the system don't differ to much from those where past knowledge about the was gathered.

Incorpoation of prior data

But what if that assumption is false? Then what is already known about the system is less useful in helping you decide what to quantify.

Frequently, in addition to a network based representation about the system, you have some data you have gathered from previous experiments -- call this historic data. This data may not be sufficient for network inference, but it can be used in addition to known network structure to identify what proteins to quantify.

Simulating node selection for better network inference

In the following simulation, I use Signalgraph to explore the space of nodes before applying causal network inference to search the space of directed edges.

The simulation uses the following steps.

Step 1: Simulate a biologically representative graph structure.

set.seed(19)
g <- sim_system(11, 100)
sg_viz(g)
````

### Step 2:  Simulate a dataset from that graph structure

```r
.data <- recover_design(g)
head(.data)

Step 3: Remove an edge from the graph structure. This is the novel edge.

get_novel_edge <- function(g){
  E(g)[from(!V(g)$is.root)] %>% 
    as.numeric %>% 
    sample(1) %>% 
    {E(g)[.]} 
}
novel_edge <- get_novel_edge(g) 
novel_edge
sim_plotter <- function(g, main = NULL, sub = NULL, show_biases = FALSE){
  if(!show_biases)  g <- igraph::induced.subgraph(g, V(g)[!is.bias])
  fill <- structure(rep("white", vcount(g)), names = V(g)$name)
  fill[V(g)$is.observed] <- "light green"
  fill[V(g)$is.root] <- "dark orange"
  fill[V(g)$is.hidden] <- "blue"
  if("is.bias" %in% list.vertex.attributes(g)) fill[V(g)$is.bias] <- "grey"
  col <- structure(rep("black", vcount(g)), names = V(g)$name)
  observed_and_random <- intersect(V(g)[is.observed], V(g)[is.random])
  col[observed_and_random] <- "dark red"
  lwd <- structure(rep(1, vcount(g)), names = V(g)$name)
  mses <- vertexMSEs(g) 
  lwd_vals <- 8 * (mses / max(mses))^3 + 1
  lwd[observed_and_random] <- lwd_vals
  node_list <- list(fill = fill, col = col, lwd = lwd)
  edge_list <- novel_edge %>%
    {get_edge_vertex(g, .)} %>%
    {V(g)[.]$name} %>%
    paste0(collapse="~") %>%
    {structure(rep("green", length(.)), names = .)} %>%
    list(col = .)
  g_out <- g %>% name_vertices %>% # Give vertices names if they do not have nay 
    igraph.to.graphNEL(.) %>% # convert to a graphNEL
    {Rgraphviz::layoutGraph(.)} %>% # lay the graph out
    {graph::`nodeRenderInfo<-`(., node_list)} %>%# add the node annotation
    {graph::`edgeRenderInfo<-`(., edge_list)}
  graph::graph.par(list(graph = list(main = main, sub = sub))) # Add a title if one is given
  Rgraphviz::renderGraph(g_out) # Render the graph
}
sim_plotter(g, novel_edge)

Step 4: Select proteins for quantification based on the simulated data and modified network

g_sub <- g - novel_edge
fitted_model <- fit_initialized_sg(g_sub,verbose = TRUE)
suggested_sets <- causal_prioritization(fitted_model, k = 2)
mse_set <- suggested_sets$mse_set
btw_set <- suggested_sets$btw_set

Step 5: Conduct causal network inference, evaluate the weight assigned to the missing edge

Causal learning algo:

#source("/Users/robertness/Dropbox/code/signalgraph/inst/extdata/novel_discovery_simulation.R")
causal_learning_instance <- function(g, selection_set, target_edge){
  zero_output <- data.frame(
    from = target_edge[1],
    to = target_edge[2],
    strength = 0, 
    direction = 0)
  # I wrap this workflow into a function for legibility
  # Background
  #   The following calculate the weight of the novel edge from the data
  #   1. Generate 500 random graph consistent with the white list.
  #   2. Run a tabu search using BIC as the scoring function, and a tabu length of 50
  #   3. Calculate weight of edge model averaging results.
  fixed <- V(g)[is.fixed]$name # Grab fixed names
  data_vars <- unique(c(fixed, selection_set)) # These are the variables  
  if(!all(target_edge %in% data_vars)) return(zero_output)
  .data <- sim_system_data(g, 100, add_error = TRUE)[, data_vars] %>% # simulation 10000 data points
    {bnlearn::discretize(., method = "hartemink", breaks = 3)} # discretize them into 3 levels
  whitelist <- E(g)[from(fixed)] %>% # Create a white list of fixed effects edges
    {data.frame(
      from = V(g)[get_edge_vertex(g, ., "from")]$name,
      to = V(g)[get_edge_vertex(g, ., "to")]$name
    )} %>% 
    {dplyr::filter(., to %in% data_vars)} %>%
    as.matrix
  blacklist <- expand.grid(data_vars, fixed) %>% # Create a black list excluding edges to the to the fixed effects 
    apply(2, gsub, pattern = " ", replacement = "")
  # Causal learning algorithm:
  #   The following calculate the weight of the novel edge from the data
  #   1. Generate 500 random graph consistent with the white list.
  #   2. Run a tabu search using BIC as the scoring function, and a tabu length of 50
  #   3. Calculate weight of edge model averaging results.
  random <- intersect(selection_set, V(g)[is.random]$name)
  strength_table <- selection_set %>% 
    intersect(V(g)[is.random]$name) %>%
    {bnlearn::random.graph(., num = 50)} %>%
    lapply(function(net){
      bnlearn::empty.graph(c(fixed, random)) %>%
      {bnlearn::`arcs<-`(., value = rbind(bnlearn::arcs(net), whitelist))}
    }) %>%
    lapply(function(start_net){
      bnlearn::tabu(.data, start = start_net, whitelist, blacklist, score = "loglik", tabu = 50)
    }) %>%
    {bnlearn::custom.strength(., data_vars)} 
  output <- strength_table %>%
    {dplyr::filter(., from == target_edge[1],
               to == target_edge[2])}
  if(nrow(output) == 0) { # If the edge was not detected, add zeros.
    output <- zero_output
  }
  output
}
novel_edge_named <- V(g)[get_edge_vertex(g, novel_edge)]$name
causal_learning_instance(g, mse_set, target_edge = novel_edge_named)
causal_learning_instance(g, btw_set, target_edge = novel_edge_named)

We simulated 500 networks and compared naive topological selection to topology+data selection in how they enable recovery of an chosen novel edge. The simulation uses the following algorithm:

  1. Create a 12 protein model and use it to simulate input data where 5 proteins are unobserved.
  2. Select an edge as the novel edge to be discover and remove it from the model's network structure. Use the truncated network as the model input.
  3. Use the topology only and signalgraph approaches to select two proteins.
  4. For each approach, add the protein back into the data and apply causal inference algorithm.
  5. Compare the score of the novel edge.

Repeating this algorithm 100 times, I obtained the following distributions of novel edge scores.


library(gridExtra)
edge_density <- function(g){
  g2 <- induced.subgraph(g, V(g)[is.random])
  n <- vcount(g2)
  max_edge_num <- (n-1) * n / 2
  ecount(g2) / max_edge_num
}
m <- 10
sim_results <- data.frame(edge_density = rep(NA, m),
                          coverage = rep(NA, m), 
                          fit_mse = rep(NA, m),
                          set_length = rep(NA, m),
                          mse_is_learnable = rep(NA, m),
                          source_is_in_data = rep(NA, m),
                          target_is_in_data = rep(NA, m),
                          mse_strength = rep(NA, m),
                          mse_direction = rep(NA, m),
                          btw_strength = rep(NA, m),
                          btw_direction = rep(NA, m))
sim_results_list <- list(mse_set = list(),
                         btw_set = list(),
                         fit = list())
pdf("prelim_sim_results.pdf", width = 13)
p <- 20
data(mapk_g)
for(i in 1:m){
  g <- sim_system(p, 100, input_g = mapk_g)
  novel_edge <- get_novel_edge(g)  
  novel_edge_named <- V(g)[get_edge_vertex(g, novel_edge)]$name
  g_sub <- g - novel_edge
  g_sub$min.max.constraints <- c(-50, 50)
  g_sub$L2_pen <- .04
  #sim_plotter(g, novel_edge)
  fitted_model <- fit_initialized_sg(g_sub, verbose=TRUE)
  suggested_sets <- causal_prioritization(fitted_model, k = 1)
  mse_set <- suggested_sets$mse_set
  btw_set <- suggested_sets$btw_set
  novel_edge_vertices <- get_edge_vertex(g, novel_edge)
  named_novel_edge <- V(g)[novel_edge_vertices]$name 
  mse_result <- causal_learning_instance(g, mse_set, target_edge = named_novel_edge)
  btw_result <-  causal_learning_instance(g, btw_set, target_edge = named_novel_edge)
  #### Collect results
  sim_results_list$mse_set[[i]] <- mse_set
  sim_results_list$btw_set[[i]] <- btw_set
  sim_results_list$fit[[i]] <- fitted_model
  sim_results$coverage[i] <- (length(V(g)[is.observed])/p) %>% round(3)  
  sim_results$fit_mse[i] <- getMSE(fitted_model) %>% round(3)
  sim_results$set_length[i] <- length(mse_set)
  sim_results$edge_density[i] <- edge_density(g) %>% round(3)
  sim_results$mse_is_learnable[i] <- all(named_novel_edge %in% mse_set)
  observed_and_random <- V(g)[setdiff(V(g)[is.observed], V(g)[is.fixed])]$name
  sim_results$source_is_in_data[i] <- named_novel_edge[1] %in% observed_and_random
  sim_results$target_is_in_data[i] <- named_novel_edge[2] %in% observed_and_random
  sim_results$mse_strength[i] <- mse_result$strength %>% as.numeric %>% round(3)
  sim_results$mse_direction[i] <- mse_result$direction %>% as.numeric %>% round(3)
  sim_results$btw_strength[i] <- btw_result$strength %>% as.numeric %>% round(3)
  sim_results$btw_direction[i] <- btw_result$direction %>% as.numeric %>% round(3)
  #### Plot
  par(fig=c(0,1,0.15,1))
  sim_plotter(g, novel_edge, main = i, 
                          sub = paste("mse set= ", paste(sort(as.numeric(mse_set)), collapse = " "),
                                      "btw set= ", paste(sort(as.numeric(btw_set)), collapse = " "),
                                      collapse = "~"))
  pushViewport(viewport(y=.1))
  grid.table(sim_results[i, ])
  popViewport()
}
dev.off()
library(ggplot2)
sparcity <- c("no null edges", "25% null edges", "50% null edges", "75% null edges")
coverage <- c("no data", "downstream only", "70% of proteins", "100% of proteins")
naive_scores <- NULL
proposed_scores <- NULL
# for(i in 1:10000){
#   naive_scores <- c(naive_scores, sample(c(0, rbeta(1, 2, 5)), 1, prob = c(.3, .7)))
#   proposed_scores <- c(proposed_scores, sample(c(0, rbeta(1, 2, 2)), 1, prob = c(.2, .8)))
# }
# sim_data <- function(n, p){
#   rb <- rbeta(n, 2, 5)
#   samp <- sample(rb, floor(p * n))
#   c(samp, rep(0, ceiling((1-p) * n)))[sample(n)]
# }
#naive 0 with probability .4
#proposed with probaility .2
m <- 10000
library(reshape2)
sim_results2 <- sim_results %>%
  mutate(mse_score = mse_strength * mse_direction,
         btw_score = btw_strength * btw_direction) %>%
  select(edge_density, coverage, mse_score, btw_score) %>% 
  melt(c("edge_density", "coverage")) %>%
  rename(method = variable, score = value) %>%
  mutate(method = ifelse(method == "mse_score", "proposed", "naive"))



sim_results <- data.frame(
  approach = c(rep("proposed", m/2), rep("naive", m/2)),
  sparcity = factor(sample(sparcity, m, replace = TRUE), levels = sparcity),
  coverage = factor(sample(coverage, m, replace = TRUE), levels = coverage)) %>%
  mutate(score= ifelse(approach == "naive", 
                  sim_data(sum(approach == "naive"), .5), #naive sim
                  sim_data(sum(approach != "proposed"), .9)), #proposed sim
         score = ifelse(coverage == "no data",
                         sim_data(sum(coverage == "no data"), .4), #from naive distribution
                         score)
    )
g_plot <- ggplot(sim_results, aes(x = score)) +
  geom_density(aes(fill=approach ), alpha = 0.5) + 
  labs(title = "Distributions of Novel Edge Scores") + 
  scale_x_continuous(breaks = c(0, .25, .5, .75)) +
  facet_grid(coverage ~ sparcity)
plot(g_plot)

todo



robertness/signalgraph documentation built on May 27, 2019, 10:33 a.m.