library(signalgraph)
A common method in systems biology is to conduct an experiment that quantifies different features (eg genes, proteins) in a cell, then infer a network structure from that data that illustrates how those features are vary together. The network reflects correlation, and in the stricter case of causal Bayesian networks, reflects causality.
Typically some of the edges in the inferred network will already be known from past investigations. Inferring such edges tells us nothing new, they only serve to validate the inference algorithm. The ultimate goal of network inference is detection of novel edges, meaning edges that were not previously known. The novel edges are hypotheses for novel discovery, which you then validate in follow up experiments.
Suppose a novel edge A -> B exists. A method called model averaging infers many different networks from the same data, then assigns a weight to an edge based on how frequently it appears in the set of inferred networks. The weight quantifies the amount of evidence for that edge in the data. Assuming you have several interesting inferred edges, the weights assign priority.
Biological networks can include 1000's of features. Depending on the technology and experimental design, you can't quantify or don't want to quantify every feature. This motivates the problem of selecting what features to quantify in the data.
Since the nodes that are quantified determine which edges can be inferred, then the problem of selecting which nodes to quantify directly influences the network inference, specifically the weights assigned to inferred nodes.
For causal inference, one must quantify a node along with its Markov neighborhood. I define the Markov neighborhood of a given node as the node itself and all the nodes with which it shares a causal relationship. Causal network inference doesn't work on individual nodes, it works on these sets.
Specifically this is the set of nodes containing itself, its parents, its children, and other parents of its children in the underlying network. This is based on more formal definition of a Markov blanket.
Biologists often take the state of knowledge about a system and represent it as a graph data structure stored in computer memory. Representing a graph/network as a data object comprised of vertices, edges and metadata on these vertices and edges is useful both for illustration as well as queries and other computations on the network. An example is the MapK signaling pathway map from the KEGG database. Each edge was validated in a lap experiment, and the edges are assembled together in understandable and computable data objects by biocurators.
Betweenness centrality, an indicator of a nodes centrality in a network, is ideal for prioritizing proteins with just such a graph object. Simply put, the more central a node, the more causal relationships it is involved in. Ranking nodes based on network centrality and quantifying their Markov neighborhoods is a systematic and informed approach for selecting features to quantify, assuming the conditions under which you are studying the system don't differ to much from those where past knowledge about the was gathered.
But what if that assumption is false? Then what is already known about the system is less useful in helping you decide what to quantify.
Frequently, in addition to a network based representation about the system, you have some data you have gathered from previous experiments -- call this historic data. This data may not be sufficient for network inference, but it can be used in addition to known network structure to identify what proteins to quantify.
In the following simulation, I use Signalgraph to explore the space of nodes before applying causal network inference to search the space of directed edges.
The simulation uses the following steps.
set.seed(19) g <- sim_system(11, 100) sg_viz(g) ```` ### Step 2: Simulate a dataset from that graph structure ```r .data <- recover_design(g) head(.data)
get_novel_edge <- function(g){ E(g)[from(!V(g)$is.root)] %>% as.numeric %>% sample(1) %>% {E(g)[.]} } novel_edge <- get_novel_edge(g) novel_edge
sim_plotter <- function(g, main = NULL, sub = NULL, show_biases = FALSE){ if(!show_biases) g <- igraph::induced.subgraph(g, V(g)[!is.bias]) fill <- structure(rep("white", vcount(g)), names = V(g)$name) fill[V(g)$is.observed] <- "light green" fill[V(g)$is.root] <- "dark orange" fill[V(g)$is.hidden] <- "blue" if("is.bias" %in% list.vertex.attributes(g)) fill[V(g)$is.bias] <- "grey" col <- structure(rep("black", vcount(g)), names = V(g)$name) observed_and_random <- intersect(V(g)[is.observed], V(g)[is.random]) col[observed_and_random] <- "dark red" lwd <- structure(rep(1, vcount(g)), names = V(g)$name) mses <- vertexMSEs(g) lwd_vals <- 8 * (mses / max(mses))^3 + 1 lwd[observed_and_random] <- lwd_vals node_list <- list(fill = fill, col = col, lwd = lwd) edge_list <- novel_edge %>% {get_edge_vertex(g, .)} %>% {V(g)[.]$name} %>% paste0(collapse="~") %>% {structure(rep("green", length(.)), names = .)} %>% list(col = .) g_out <- g %>% name_vertices %>% # Give vertices names if they do not have nay igraph.to.graphNEL(.) %>% # convert to a graphNEL {Rgraphviz::layoutGraph(.)} %>% # lay the graph out {graph::`nodeRenderInfo<-`(., node_list)} %>%# add the node annotation {graph::`edgeRenderInfo<-`(., edge_list)} graph::graph.par(list(graph = list(main = main, sub = sub))) # Add a title if one is given Rgraphviz::renderGraph(g_out) # Render the graph } sim_plotter(g, novel_edge)
g_sub <- g - novel_edge fitted_model <- fit_initialized_sg(g_sub,verbose = TRUE) suggested_sets <- causal_prioritization(fitted_model, k = 2) mse_set <- suggested_sets$mse_set btw_set <- suggested_sets$btw_set
Causal learning algo:
#source("/Users/robertness/Dropbox/code/signalgraph/inst/extdata/novel_discovery_simulation.R")
causal_learning_instance <- function(g, selection_set, target_edge){ zero_output <- data.frame( from = target_edge[1], to = target_edge[2], strength = 0, direction = 0) # I wrap this workflow into a function for legibility # Background # The following calculate the weight of the novel edge from the data # 1. Generate 500 random graph consistent with the white list. # 2. Run a tabu search using BIC as the scoring function, and a tabu length of 50 # 3. Calculate weight of edge model averaging results. fixed <- V(g)[is.fixed]$name # Grab fixed names data_vars <- unique(c(fixed, selection_set)) # These are the variables if(!all(target_edge %in% data_vars)) return(zero_output) .data <- sim_system_data(g, 100, add_error = TRUE)[, data_vars] %>% # simulation 10000 data points {bnlearn::discretize(., method = "hartemink", breaks = 3)} # discretize them into 3 levels whitelist <- E(g)[from(fixed)] %>% # Create a white list of fixed effects edges {data.frame( from = V(g)[get_edge_vertex(g, ., "from")]$name, to = V(g)[get_edge_vertex(g, ., "to")]$name )} %>% {dplyr::filter(., to %in% data_vars)} %>% as.matrix blacklist <- expand.grid(data_vars, fixed) %>% # Create a black list excluding edges to the to the fixed effects apply(2, gsub, pattern = " ", replacement = "") # Causal learning algorithm: # The following calculate the weight of the novel edge from the data # 1. Generate 500 random graph consistent with the white list. # 2. Run a tabu search using BIC as the scoring function, and a tabu length of 50 # 3. Calculate weight of edge model averaging results. random <- intersect(selection_set, V(g)[is.random]$name) strength_table <- selection_set %>% intersect(V(g)[is.random]$name) %>% {bnlearn::random.graph(., num = 50)} %>% lapply(function(net){ bnlearn::empty.graph(c(fixed, random)) %>% {bnlearn::`arcs<-`(., value = rbind(bnlearn::arcs(net), whitelist))} }) %>% lapply(function(start_net){ bnlearn::tabu(.data, start = start_net, whitelist, blacklist, score = "loglik", tabu = 50) }) %>% {bnlearn::custom.strength(., data_vars)} output <- strength_table %>% {dplyr::filter(., from == target_edge[1], to == target_edge[2])} if(nrow(output) == 0) { # If the edge was not detected, add zeros. output <- zero_output } output }
novel_edge_named <- V(g)[get_edge_vertex(g, novel_edge)]$name causal_learning_instance(g, mse_set, target_edge = novel_edge_named) causal_learning_instance(g, btw_set, target_edge = novel_edge_named)
We simulated 500 networks and compared naive topological selection to topology+data selection in how they enable recovery of an chosen novel edge. The simulation uses the following algorithm:
Repeating this algorithm 100 times, I obtained the following distributions of novel edge scores.
library(gridExtra) edge_density <- function(g){ g2 <- induced.subgraph(g, V(g)[is.random]) n <- vcount(g2) max_edge_num <- (n-1) * n / 2 ecount(g2) / max_edge_num } m <- 10 sim_results <- data.frame(edge_density = rep(NA, m), coverage = rep(NA, m), fit_mse = rep(NA, m), set_length = rep(NA, m), mse_is_learnable = rep(NA, m), source_is_in_data = rep(NA, m), target_is_in_data = rep(NA, m), mse_strength = rep(NA, m), mse_direction = rep(NA, m), btw_strength = rep(NA, m), btw_direction = rep(NA, m)) sim_results_list <- list(mse_set = list(), btw_set = list(), fit = list()) pdf("prelim_sim_results.pdf", width = 13) p <- 20 data(mapk_g) for(i in 1:m){ g <- sim_system(p, 100, input_g = mapk_g) novel_edge <- get_novel_edge(g) novel_edge_named <- V(g)[get_edge_vertex(g, novel_edge)]$name g_sub <- g - novel_edge g_sub$min.max.constraints <- c(-50, 50) g_sub$L2_pen <- .04 #sim_plotter(g, novel_edge) fitted_model <- fit_initialized_sg(g_sub, verbose=TRUE) suggested_sets <- causal_prioritization(fitted_model, k = 1) mse_set <- suggested_sets$mse_set btw_set <- suggested_sets$btw_set novel_edge_vertices <- get_edge_vertex(g, novel_edge) named_novel_edge <- V(g)[novel_edge_vertices]$name mse_result <- causal_learning_instance(g, mse_set, target_edge = named_novel_edge) btw_result <- causal_learning_instance(g, btw_set, target_edge = named_novel_edge) #### Collect results sim_results_list$mse_set[[i]] <- mse_set sim_results_list$btw_set[[i]] <- btw_set sim_results_list$fit[[i]] <- fitted_model sim_results$coverage[i] <- (length(V(g)[is.observed])/p) %>% round(3) sim_results$fit_mse[i] <- getMSE(fitted_model) %>% round(3) sim_results$set_length[i] <- length(mse_set) sim_results$edge_density[i] <- edge_density(g) %>% round(3) sim_results$mse_is_learnable[i] <- all(named_novel_edge %in% mse_set) observed_and_random <- V(g)[setdiff(V(g)[is.observed], V(g)[is.fixed])]$name sim_results$source_is_in_data[i] <- named_novel_edge[1] %in% observed_and_random sim_results$target_is_in_data[i] <- named_novel_edge[2] %in% observed_and_random sim_results$mse_strength[i] <- mse_result$strength %>% as.numeric %>% round(3) sim_results$mse_direction[i] <- mse_result$direction %>% as.numeric %>% round(3) sim_results$btw_strength[i] <- btw_result$strength %>% as.numeric %>% round(3) sim_results$btw_direction[i] <- btw_result$direction %>% as.numeric %>% round(3) #### Plot par(fig=c(0,1,0.15,1)) sim_plotter(g, novel_edge, main = i, sub = paste("mse set= ", paste(sort(as.numeric(mse_set)), collapse = " "), "btw set= ", paste(sort(as.numeric(btw_set)), collapse = " "), collapse = "~")) pushViewport(viewport(y=.1)) grid.table(sim_results[i, ]) popViewport() } dev.off()
library(ggplot2) sparcity <- c("no null edges", "25% null edges", "50% null edges", "75% null edges") coverage <- c("no data", "downstream only", "70% of proteins", "100% of proteins") naive_scores <- NULL proposed_scores <- NULL # for(i in 1:10000){ # naive_scores <- c(naive_scores, sample(c(0, rbeta(1, 2, 5)), 1, prob = c(.3, .7))) # proposed_scores <- c(proposed_scores, sample(c(0, rbeta(1, 2, 2)), 1, prob = c(.2, .8))) # } # sim_data <- function(n, p){ # rb <- rbeta(n, 2, 5) # samp <- sample(rb, floor(p * n)) # c(samp, rep(0, ceiling((1-p) * n)))[sample(n)] # } #naive 0 with probability .4 #proposed with probaility .2 m <- 10000 library(reshape2) sim_results2 <- sim_results %>% mutate(mse_score = mse_strength * mse_direction, btw_score = btw_strength * btw_direction) %>% select(edge_density, coverage, mse_score, btw_score) %>% melt(c("edge_density", "coverage")) %>% rename(method = variable, score = value) %>% mutate(method = ifelse(method == "mse_score", "proposed", "naive")) sim_results <- data.frame( approach = c(rep("proposed", m/2), rep("naive", m/2)), sparcity = factor(sample(sparcity, m, replace = TRUE), levels = sparcity), coverage = factor(sample(coverage, m, replace = TRUE), levels = coverage)) %>% mutate(score= ifelse(approach == "naive", sim_data(sum(approach == "naive"), .5), #naive sim sim_data(sum(approach != "proposed"), .9)), #proposed sim score = ifelse(coverage == "no data", sim_data(sum(coverage == "no data"), .4), #from naive distribution score) ) g_plot <- ggplot(sim_results, aes(x = score)) + geom_density(aes(fill=approach ), alpha = 0.5) + labs(title = "Distributions of Novel Edge Scores") + scale_x_continuous(breaks = c(0, .25, .5, .75)) + facet_grid(coverage ~ sparcity) plot(g_plot)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.