In jhrcook/wext: Weighted Exclusivity Test (WExT)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(0)

library(wext)
library(tidygraph)
library(tibble)
library(dplyr)
library(bench)

This is just a quick comparison of the edge swapping algorithms used in this package.

The first step is to create a bipartite graph for the test.

v1 <- sample(seq(2, 1e3, 2), 1e4, replace = TRUE)
v2 <- sample(seq(1, 1e3+1, 2), 1e4, replace = TRUE)
bgr <- tibble(node1 = as.character(v1),
              node2 = as.character(v2)) %>% 
  unique() %>%
  as_tbl_graph(directed = FALSE) %N>% 
  mutate(type = name %in% unique(as.character(v1)))
bgr

This graph has r igraph::vcount(bgr) nodes and r igraph::ecount(bgr) edges.

The first bipartite edge swapping method is bipartite_edge_swap() which relies on the 'igraph' and 'tidygraph' APIs for handling all of the data and logically manipulates a tidygraph graph object. This was the easiest implementation, though I expected that the manipulation of high-level data types would be a time-suck. This first test runs the edge swapping algorithm to swap 100 edges 20 times.

m1 <- bench::mark(bipartite_edge_swap(bgr, N = 100),
            iterations = 20, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
m1

The second edge swapping algorithm implementation is in bipartite_edge_swap2(). This method took more work on my end, though its use of vector manipulations rapidly increased the speed of the algorithm. (As it is much faster, I now run 50 iterations for the benchmark.)

m2 <- bench::mark(bipartite_edge_swap2(bgr, N = 100),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
m2

This first test ran 100 edge swaps on the graph bgr. The first method took r round(m1$median, 2) s on average (a median of 20 iterations) while the second took r round(m2$median, 3)*1000 ms on average. Thus, the second method is much faster. Another striking difference is the amount of memory each method was allocated - the first method took about 2 GB of RAM! It also called the garbage collector an order of magnitude more frequently.

The function bipartite_edge_swap3() implements the same algorithm used in method 2 in C++.

m2_C <- bench::mark(bipartite_edge_swap3(bgr, N = 100),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
m2_C

This is the fastest method, requiring r round(m2_C$median, 3)*1000 ms on average with even less RAM required.

To specifically test everything expect for the edge swapping, I re-ran the same test but only conducting 1 edge swap in each iteration.

# first method
bench::mark(bipartite_edge_swap(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
# second method
bench::mark(bipartite_edge_swap2(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
# second method in C++
bench::mark(bipartite_edge_swap3(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)

Interestingly, the first method (based around the 'igraph' and 'tidygraph' APIs) is substantially faster to do all of the other steps in the edge swapping process. This is likely because the second method must first transform the bipartite graph bgr into the list of vectors while the first method works directly on the tidygraph object (it only does this once per call to bipartite_edge_swap2). Still, it may be worth poking around on the setup steps in the second method to improve its efficiency.

To further test this hypothesis, the difference should shrink if I use a smaller bipartite graph.

v1 <- sample(seq(2, 50, 2), 100, replace = TRUE)
v2 <- sample(seq(1, 51, 2), 100, replace = TRUE)
bgr <- tibble(node1 = as.character(v1),
              node2 = as.character(v2)) %>% 
  unique() %>%
  as_tbl_graph(directed = FALSE) %N>% 
  mutate(type = name %in% unique(as.character(v1)))
bgr

This smaller graph has r igraph::vcount(bgr) nodes and r igraph::ecount(bgr) edges. I now re-run the test with only 1 edge swap.

# first method
bench::mark(bipartite_edge_swap(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
# second method
bench::mark(bipartite_edge_swap2(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
# second method in C++
bench::mark(bipartite_edge_swap3(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)

Now, the second method is actually faster because it was easier to change to the vector manipulation than lug around the relatively "heavy" tidygraph object (see the mem_alloc columns). You can see that the garbage collection intervened a few times (in the gc/sec and n_gc column).

For comparison, here is the same test using the heavily optimized edge swapping algorithm in 'igraph'. While fast, this test does not preserve the bipartite nature of the graph, thus I cannot use it for 'WExT'.

bench::mark(
  igraph::rewire(bgr, igraph::keeping_degseq(loops = FALSE, niter = 100)),
  iterations = 30, filter_gc = FALSE
) %>% 
  dplyr::select(-expression)