The steps of causal inference are; (1) infer conditional independence (undirected edges), (2) orient the edges. What is the role of dimensionality in step 1?

library(bninfo)
library(magrittr)
library(tidyr)

Influence of sample size on the number of tests and performance

How important is sample size in the inference of network structure? If sample size is large enough will network inference be perfect? If not, then what other factors affect the inference?

Constraint-based algorithms perform a sequence of conditional independence tests to determine the Markov blanket of each node in the network. Each test makes a decision based on a p-value, and tests depend on the results of subsequent tests. Since p-values depend on sample size, the trajectory of the test sequence and the number of tests performed depend on sample size.

To gain an intuition, the most demanding tests for conditional indepenence would those that involved a large Markov blanket. For example, INT has Markov blanket with 8 nodes;

INT_mb <-mb(ground_truth, "INT")
INT_mb

Supposing we wanted to use a Poisson regression model to test the conditional independence of INT and its grandchild node ACO2 given INT's Markov Blanket. The number of levels of all the nodes involved is;

prod(sapply(c("INT", "VALV", INT_mb), function(node) length(levels(alarm[[node]]))))

That is a total of 98304 combinations, and you'd want at least one observation of each combination in the data to get maximum liklihood estimates for all the parameters of the model. So you'd need at least 98304 observations for this test, but in reality since some combinations are going to have low probability, many observations would be needed to get a complete sampling. This ties graph theory to network inference -- nodes with larger Markov blankets have higher connectivity in graph theoretic terms, and thus the neighbhorhoods of highly connected nodes place greater demands on inference.

The constraint-based algorithms try to do series of conditional independence tests of smaller size than the one just proposed, and aggregate the results. With these algorithms, the larger the sample size, the more tests they perform. This might provide a metric for how much sample size is "enough" -- at some point, an increase in sample size will result in only a slight increase in the number of tests performed. This is simulated in the followin experiment. Data is simulated from the ALARM model fit on the ALARM data.

data(alarm)
ground_truth <- empty.graph(names(alarm))
modelstring(ground_truth) = paste("[HIST|LVF][CVP|LVV][PCWP|LVV][HYP][LVV|HYP:LVF]",
  "[LVF][STKV|HYP:LVF][ERLO][HRBP|ERLO:HR][HREK|ERCA:HR][ERCA]",
  "[HRSA|ERCA:HR][ANES][APL][TPR|APL][ECO2|ACO2:VLNG][KINK]",
  "[MINV|INT:VLNG][FIO2][PVS|FIO2:VALV][SAO2|PVS:SHNT][PAP|PMB][PMB]",
  "[SHNT|INT:PMB][INT][PRSS|INT:KINK:VTUB][DISC][MVS][VMCH|MVS]",
  "[VTUB|DISC:VMCH][VLNG|INT:KINK:VTUB][VALV|INT:VLNG][ACO2|VALV]",
  "[CCHL|ACO2:ANES:SAO2:TPR][HR|CCHL][CO|HR:STKV][BP|CO:TPR]", sep = "")
fit_net <- bn.fit(ground_truth, alarm, method = "bayes")
N <- c(20000, 30000, 40000, 80000, 100000, 250000, 500000, 1000000, 2000000)
test_num1 <-  sample_size_sim(fit_net, N, algo = gs, m = 30, verbose = T)
test_num2 <-  sample_size_sim(fit_net, N, algo = inter.iamb, m = 30, verbose = T)
sample_size_sim_results <- list(test_gs = test_num1, test_interiamb = test_num2)
data(sample_size_sim_results)
plot(N, log(test_num1[[1]]), main = "sample size vs avg number of tests used by
     algorithm", type = "l", ylim = c(7.5, 10), 
     xlab = "sample size", ylab = "log median number of tests performed",  col = "blue")
lines(N, log(test_num2[[1]]), col = "red")
legend("topleft", c("gs", "inter.iamb"), lty = c(1, 1),  
       col = c("blue", "red"))

The inter.iamb algorithm does more tests and makes more use of added data than the grow-shrink algorithm.

plot(N, test_num1[[2]], main = "# of false positives", ylim=c(0, 13), 
     xlab = "sample size", ylab = "median false positive count ",  col = "blue")
points(N, test_num2[[2]], col = "red")
legend("topleft", c("gs", "inter.iamb"), pch = c(1, 1),  
       col = c("blue", "red"))
plot(N, test_num1[[3]], main = "# of true positives", ylim=c(15, 68), 
     xlab = "sample size", ylab = "median true positive count ",  col = "blue")
points(N, test_num2[[3]], col = "red")
abline(h = 65)
legend("topright", c("gs", "inter.iamb"), pch = c(1, 1),  
       col = c("blue", "red"))

Subjectively speaking, beyond a sample size of 500K the effects of adding more observations are minimal. The number of tests climb but at decreasing rate as sample size grows. The same is true with the predicted number of true positives, which steadily slows in its approach of the number of true edges (65) pictured above with the solid line. Of course this is specific to this network's structure and parameters as well as the algorithms that were applied -- clearly in this case "inter.iamb" algorithm is superior to "gs" and the number of true positives could increase if control of the nominal type I error rate in the conditional independence tests (the algorithms' default of .05 in this simulation) were relaxed. But this shows that even with extremely large sample size inference is imperfect, and thus other factors need to be considered.

The effect of dimensionality on sample size on the number of tests and performance

The following is a simulation intended to show how the dimensionality of the data affects the performance of Bayesian network structure inference algorithms. Specifically, I focus on the ability of constraint-based algorithms to learn the skeletons of Bayesian networks. Here I focus on the grow-shrink, IAMB, and Inter-IAMB implemented in the "bnlearn" package (see bnlearn's description of the algorithms).

The simulation focuses on the challenge of discovery. A network of p nodes has p * (p - 1) / 2 possible dependence relationships. The null hypothesis of a test is that two nodes are conditionally independent given others. In a biological context the alternative hypothesis implies some interesting biological relationship. More than anything else, we want to minimize the count of false positives. The constraint-based algorithms will control false positive rate. So in this simulation I vary the number of replicates (n) and the number of variables/nodes and evaluate the effect on the total counts of true postives and false positives.

Simulating simple "triplet" conditional independence structure and data.

Constraint-based network inference algorithms infer network structure through sequential tests for conditional independence. For this simple simulation, I build a net of composed of multiple variable triplets -- two variables that are conditional independent given a third. I then simulate multivariate Gaussian data using this dependence structure. I'll have the network inference algorithms try to uncover this original network, using conditional independence tests for Gaussian variables.

To illustrate, here is a network of 3 triplets containing 9 nodes and 6 directed edges.

gbn <- triplet_network(3, .8)
graphviz.plot(gbn)

Below, I modify this plot to show the undirected edges, representing the dependence relationships that persist after conditioning on all other nodes:

graphviz.plot(moral(fit2net(gbn)))

Confirming A1 and B1 come out as conditionally independent given C1 in a conditional independence test:

rbn(gbn, 100000) %>%
  {ci.test("A1", "B1", "C1", data = .)}

The simulation examines what happens when scaling up the number of variables from a small amount to a large amount. We examine two ways of scaling up;

For the first case, we add completely independent single variables to an existing net. For example, in the following plot I add 6 independent nodes to a set of 3 triplets, for a total of 15 variables.

noisy_net <- add_singletons(gbn, 6)
graphviz.plot(noisy_net)

In the second case, we just build a net with more triplets.

gbn2 <- triplet_network(5, .8)
graphviz.plot(gbn2)

Grow-shink algorithm

The simulation uses the bnlearn package's implementation of a constraint-based network inference approach called the grow-shrink algorithm. Since we are only interested in conditional independence in this analysis, I set the function's "undirected" argument set to true, so it looks only for undirected edges, ignoring direction. I simulate 15 replicates and visualize the learned skeleton:

set.seed(346)
gbn %>%
  rbn(15) %>%
  gs(undirected = TRUE) %>% 
  graphviz.plot

Edges between members of different triplets are false positives. For example, the edge between B3 and C1 is a false positive. Edges missing from within a triplet are false negatives. Here, the missing B1-C1 and B2-C2 edges are false negatives.

Running the simulation

To start, for the "small" case, I simulate 20 triplets, which is 60 variables, or 60 * 59 / 2 = 1770 possible variable pairs the structure learning algorithm to examine. For the large case I simulate 6000 variables/17997000 pairs. For this large case, we first simulate the scenario of starting with 20 triplets and adding 5940 independent variables (more variables but same amount of information). Then I simulate the scenario of adding 1980 triplets to the original 20 triplets (increase information at the same rate as the number of variables).

triplet_net <- triplet_network(20, .3)

big_cor_net <- triplet_network(2000, .3)
noisy_cor_net <- add_singletons(triplet_net, 5940)

big_cond_net <- triplet_network(33, .3)
noisy_cond_net <- add_singletons(triplet_net, 39)
library(tidyr)
sim_instance <- function(n, p){
  n_val <- ifelse(n == "small", 60, 100)
  cor_net <- switch(as.character(p), 
         small = triplet_net,
         big = noisy_cor_net,
         big_scaled = big_cor_net) 
  cond_net <- switch(as.character(p), 
         small = triplet_net,
         big = noisy_cond_net,
         big_scaled = big_cond_net) 
  list(median_cor_dep = sim_median_cor(cor_net, n_val))  
#  c(list(max_cor_ind = sim_cond_ind_cor(cor_net, n_val),
#         max_cor_dep = sim_median_cor(cor_net, n_val)), 
#    test_markov_dependence_learning(cond_net, n = n_val))
}
num <- 3
results4 <- lapply(1:num, function(i){
  expand.grid(p = c("small", "big", "big_scaled"),
            n = c("small", "big")) 
  }) %>%
  {do.call("rbind", .)} %>%
  mutate(max_cor_ind = NA, median_cor_dep = NA, tp = NA, fp = NA) %>%
  within({
    for(i in 1:length(n)){
      if(mod(i, 6) == 0) print(i / 6)
      output <- sim_instance(n[i], p[i])
      #max_cor_ind[i] <- output$max_cor_ind
      median_cor_dep[i] <- output$median_cor_dep
      #tp[i] <- output$tp
      #fp[i] <- output$fp
    }
    rm(output); rm(i)
  }) %>%
  gather(metric, value, -p, -n)
devtools::use_data(results, overwrite = TRUE)
data(results)

Simulation results

plot_results <- function(results, type, n_size, scale_p = FALSE){
  if(type == "max_cor_ind"){
    ylim <- c(0, 30)
    xlim <- c(.3, .75)
    main <- paste0("Max correlation: n is ", n_size)
  }else if(type == "tp"){
    ylim <- c(0, .22)
    xlim <- c(0, 70)
    main <-paste0("#TP: n is ", n_size)
  } else if(type == "fp"){
    ylim <- c(0, .12)
    xlim <- c(20 , 120)
    main <- paste0("#FP: n is ", n_size)
  } else { # type == "max_cor" 
    ylim <- c(0, 30)
    xlim <- c(0, 1)
    main <- paste0("max correlation: n is ", n_size)
  }
  #if(scale_p) main <- c(main, "\nAdded informative variables")
  results %>%
    filter(n == n_size, 
         metric == type) %T>%
    {filter(., p == "small") %$%
    hist(value, main = " ", #main = main, 
         freq = F, col = rgb(0.1,0.1,0.1,0.5), #dark = small_p
        ylim = ylim, xlim = xlim)
  } %T>%
  {filter(., p == ifelse(scale_p, "big_scaled", "big")) %$%
     hist(value, freq = F, col = rgb(0.8,0.8,0.8,0.5),  add = TRUE) #light = big_p
   if(type == "tp"){
       #abline(v = 40, lty = "dashed", lwd = 2)
       #if(scale_p) abline(v = 66, lty = "twodash", lwd = 2)
       abline(v = 40, lwd = 4, col = rgb(0.1,0.1,0.1,0.5)) 
       if(scale_p) abline(v = 66, lwd = 4, col = rgb(0.8,0.8,0.8,0.5))
   }
  }
  invisible()
}

Max correlation

Maximum correlation increases with the increase in number of variables. In the case of adding only independent variables, it is clear this is spurious correlation.

plot_results(results, "max_cor_ind", n = "small")
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "6000 proteins, 20 triplets",
                           "60 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))
plot_results(results, "max_cor_ind", n = "big")
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "6000 proteins, 20 triplets",
                           "100 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))
plot_results(results, "max_cor_ind", n = "small", scale_p = TRUE)
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "6000 proteins, 2000 triplets",
                           "60 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))
plot_results(results, "max_cor_ind", n = "big", scale_p = TRUE)
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "6000 proteins, 2000 triplets",
                           "100 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))

False Positives

Adding on uniformative variables increases the number of false positives.

plot_results(results, "fp", n = "small")
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 20 triplets",
                           "60 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))
plot_results(results, "fp", n = "big")
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 20 triplets",
                           "100 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))

Adding more replicates actually increases the false positives, because the increased degrees of freedom gives the algorithm more room to explore, and thus make more mistakes. Both sample sizes are pretty low considering the amount of possible relationships between variable pairs.

Whether the added variables are informative or uniformative doesn't seem to affect the count of false positives.

plot_results(results, "fp", n = "small", scale_p = TRUE)
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 33 triplets",
                           "60 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))
plot_results(results, "fp", n = "big", scale_p = TRUE)
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 33 triplets",
                           "100 replicates"), cex = 1,  
       pch=c(15,15, 0), col=c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5), NA))

True positives

Increased replicates and adding new relationships becomes important in the space of true positives. The dashed line represents the total amount of true relationships.

plot_results(results, "tp", n_size = "small")
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 20, triplets", 
                           "60 replicates"), 
       cex = 1,  pch=c(15, 15, 0),  
       col = c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5),  NA))

Increasing n increases the number of true relationships detected. In this case and the last, adding on independent variables doesn't improve things, nor does it hurt.

plot_results(results = results, "tp", n_size = "big")
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 20, triplets", 
                           "100 replicates"), 
       cex = 1,  pch=c(15, 15, 0), 
       col = c(rgb(0.1,0.1,0.1,0.5), rgb(0.8,0.8,0.8,0.5),  NA))

In this case, we again add on new variables, increasing from 60 to 99. The difference this time is the new variables bring new relationships (i.e. I added more triplets). The left most dark green hour is the new total number of discoverable relationships after the new variables are added.

Here, the low sample size provides insufficient power for detecting the added edges.

plot_results(results, "tp", n_size = "small", scale_p = TRUE)
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 33 triplets",
                           "60 replicates"), cex = 1,  
       pch=c(15, 15, 0), 
       col=c(rgb(0.1,0.1,0.1,0.5), 
             rgb(0.8,0.8,0.8,0.5), NA))

Increasing sample size to 100 increases power.

plot_results(results, "tp", n_size = "big", scale_p = TRUE)
legend("topleft", legend=c("60 proteins, 20 triplets", 
                           "99 proteins, 33 triplets",
                           "100 replicates"), cex = 1,  
       pch=c(15, 15, 0), 
       col=c(rgb(0.1,0.1,0.1,0.5), 
             rgb(0.8,0.8,0.8,0.5), NA))


robertness/bninfo documentation built on May 27, 2019, 10:32 a.m.