knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(matchr)

The functions in matchr can be very computation- and time-intensive. While small or medium-sized datasets can be processed "out of the box" with no considerations for optimizing performance, processing large sets of images may be frustratingly slow without some care to leveraging multicore processors and to memory requirements.

The key takeaways are:

The following discussion will rely on examples taken from two large sets of images. The paths_low vector contains low-resolution images (roughly 200 x 150 pixels, and 8 kB), while the paths_high vector contains higher-resolution images (between 640 x 480 and 1200 x 800, and 170 kB). To reproduce the benchmarks and analysis in this vignette, point the following lines at a pair of folders with low- and high- resolution images, respectively.

paths_low <- list.files("example_low", full.names = TRUE)
paths_high <- list.files("example_high", full.names = TRUE)

The benchmarks were taken on a 2019 Mac Pro with a 3.2 GHz 16-core Intel Xeon W and 384 GB of RAM, running R 4.1.1 in RStudio RStudio 2021.09.0, reading images from a SATA-3 SSD drive.

BLAS

R relies on "BLAS" (Basic Linear Algebra Subprograms) and "LAPACK" (Linear Algebra Package) libraries for carrying out common linear algebra tasks such as matrix multiplication and solving systems of linear equations. However, the BLAS and LAPACK libraries included in R are far slower than other stable, free alternatives. Certain functions in matchr (in particular match_signatures) rely heavily on linear algebra, and consequently will run much faster if the default BLAS library (and perhaps to a lesser extent the default LAPACK library) are replaced with faster ones.

The guidelines for replacing these libraries vary from platform to platform, but are usually quite straightforward. In general, OpenBLAS is a simple choice for Windows and Linux, and Apple's vecLib is a simple choice for macOS.

Using OpenBLAS or vecLib may speed up match_signatures by approximately 400%, depending on the system and the task, and other functions by a smaller amount. All the benchmarks which follow below were run using the vecLib BLAS and LAPACK libraries on macOS 11.6.

Parallel processing

With the exception of match_signatures, all of the time-consuming functions in matchr support parallel and remote processing via the future package. If a multisession, multicore, or cluster plan is set through future, matchr functions will generally perform substantially faster. For local workloads (i.e. workloads not being executed on a remote computing cluster), the most straightforward way to leverage parallel processing will be to run the following two lines before executing matchr functions:

library(future)
plan(multisession)

This will initiate a set of background processes which will share the computational workload of matchr functions. By default, one process will be initiated for each (real or virtual) CPU core on the local computer.

While in theory running a function using four cores instead of a single core might be expected to quadruple its execution speed, in practice the gains from parallel processing are usually much less than this, because of communication overhead between the main process and the background processes. In particular, if large data objects need to be passed between processes, the time this takes can swamp the gains from splitting calculations across processes. The functions in matchr send the minimum possible amount of data between processes—for example, paths to images on disk rather than the image data itself—but even so it is often true that for quick tasks a non-parallelized approach will be faster.

(The preceding paragraph only applies to "multisession" or "socketed" versions of parallel processing, where R creates a set of new processes and then copies the necessary objects to memory for each process. There is also a paradigm of parallel processing referred to as "forked" or "multicore", which is only available on Unix and macOS, and can be set from {future} with plan(multicore). In forked parallel processing, the parallel processes work off the single R process containing all the session's data and share memory. This avoids the need to copy memory and thus a substantial amount of inter-process overhead, but at the cost of significant instability, particularly when R is run with a graphical interface in RStudio. Under extensive testing, matchr functions complete much more quickly with forked parallelism (plan(multicore)) than with socketed parallelism (plan(multisession)), but will occasionally fail to complete in the former instance, which in practice swamps the speed gains. It is therefore recommended to avoid forked parallelism in matchr.)

By default, regardless of the plan which has been set with future::plan, matchr will only enable parallel processing in scenarios where testing has indicated that this will reduce computation time. For example, in create_signature, parallel processing is enabled by default when operating on file paths, but disabled by default when operating on images which have already been loaded into memory with load_image, since in the latter case the communication overhead of passing data to and from each parallel worker negates the speed gains from parallelizing the relatively quick calculations. To override this default, set the matchr.force_parallel option to TRUE (with options(matchr.force_parallel = TRUE)) for a temporary override, or add the line MATCHR_FORCE_PARALLEL = TRUE to the .Renviron file for a permanent override.

In order to explore the impacts of parallel processing on computation time, the following benchmarks enable the force_parallel option.

options(matchr.force_parallel = TRUE)

load_image

We begin with timings for reading vectors of 100, 1000 and 10,000 file paths with load_image. Because the memory requirements of importing the higher-resolution images are so high, only 1000 images are read in that case.

library(future)
library(ggplot2)

li_bench <- 
  bench::press(
    length = c(100, 1000, 10000),
    workers = c(1, 2, 4, 8, 16, 32), {
      if (workers == 1) plan(sequential) else plan(multisession, workers = workers)
      bench::mark(
        low = load_image(paths_low[1:length], quiet = TRUE),
        high = if (length <= 1000) load_image(paths_high[1:length], quiet = TRUE),
        iterations = 20, check = FALSE)
      })

li_bench |> 
  dplyr::filter(expression == "low" | length < 10000) |>  
  tidyr::unnest(c(time, gc)) |> 
  ggplot() +
  geom_boxplot(aes(time, as.factor(length), colour = as.factor(workers))) +
  scale_colour_viridis_d(name = "workers") +
  bench:::scale_x_bench_time() +
  scale_y_continuous("length") +
  facet_wrap(vars(expression), nrow = 2, scales = "free_y") +
  theme_minimal() +
  theme(legend.position = "bottom")

``` {r li_timing_graph, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 6} library(bench) library(ggplot2)

load(here::here("vignettes", "li_bench.Rdata"))

li_bench_graph |> ggplot() + geom_boxplot(aes(time, as.factor(length), colour = as.factor(workers))) + scale_colour_viridis_d(name = "workers") + bench:::scale_x_bench_time() + scale_y_discrete("length") + facet_wrap(vars(expression), nrow = 2, scales = "free_y") + theme_minimal() + theme(legend.position = "bottom")

The figure shows the relationship between the number of parallel threads ("workers"), the size and length of the input vector, and the computational time of the load_image function. Under each scenario except for loading 10,000 images, the non-parallel version of `load_image` is the fastest. For this reason, `load_image` by default will operate sequentially unless 5,000 or more images are being loaded, or unless parallel processing is forced with the option `matchr.force_parallel`.

### create_signature

Because the memory requirements of reading many images into memory at once are so high (see [Memory requirements](#memory) below), it is rarely feasible to use `load_image` directly on a large set of images. Instead, `create_signature` combines the (memory-intensive) step of importing images with the (less memory-intensive) step of calculating signatures based on the colours and shades present in each image, and proceeds one image at a time. Each parallel process reads one image into memory, calculates the image signature, and discards the image from memory, before moving on to the next image. This makes it possible to process arbitrarily large sets of images with `create_signature`, and increases the usefulness of parallel processing.

```r
cs_bench <- 
  bench::press(
    length = c(1000, 10000),
    workers = c(1, 2, 4, 8, 16, 32), {
      if (workers == 1) plan(sequential) else plan(multisession, workers = workers)
      bench::mark(
        low = create_signature(paths_low[1:length], backup = FALSE, quiet = TRUE),
        high = create_signature(paths_high[1:length], backup = FALSE, quiet = TRUE),
        iterations = 20, check = FALSE)
      })

cs_bench |> 
  tidyr::unnest(c(time, gc)) |> 
  ggplot() +
  geom_boxplot(aes(time, as.factor(length), colour = as.factor(workers))) +
  scale_colour_viridis_d(name = "workers") +
  bench:::scale_x_bench_time() +
  scale_y_discrete("length") +
  facet_wrap(vars(expression), nrow = 2) +
  theme_minimal() +
  theme(legend.position = "bottom")

``` {r cs_timing_graph, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 6} load(here::here("vignettes", "cs_bench.Rdata"))

d_by_60 <- function(x) x / 60

cs_low_1k_1 <- cs_bench |> dplyr::filter(expression == "low", length == 1000, workers == 1) |> dplyr::pull(median) |> as.numeric() |> round(1)

cs_low_10k_1 <- cs_bench |> dplyr::filter(expression == "low", length == 10000, workers == 1) |> dplyr::pull(median) |> as.numeric() |> round(1)

cs_high_1k_1 <- cs_bench |> dplyr::filter(expression == "high", length == 1000, workers == 1) |> dplyr::pull(median) |> as.numeric() |> d_by_60() |> round(1)

cs_high_10k_1 <- cs_bench |> dplyr::filter(expression == "high", length == 10000, workers == 1) |> dplyr::pull(median) |> as.numeric() |> d_by_60() |> round(1)

cs_high_1k_dif_2 <- cs_bench |> dplyr::filter(expression == "high", length == 1000, workers %in% 1:2) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_10k_dif_2 <- cs_bench |> dplyr::filter(expression == "high", length == 10000, workers %in% 1:2) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_1k_dif_4 <- cs_bench |> dplyr::filter(expression == "high", length == 1000, workers %in% 1:4) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_10k_dif_4 <- cs_bench |> dplyr::filter(expression == "high", length == 10000, workers %in% 1:4) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_1k_dif_8 <- cs_bench |> dplyr::filter(expression == "high", length == 1000, workers %in% 1:8) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_10k_dif_8 <- cs_bench |> dplyr::filter(expression == "high", length == 10000, workers %in% 1:8) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_1k_dif_8 <- cs_bench |> dplyr::filter(expression == "high", length == 1000, workers %in% 1:8) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_high_10k_dif_8 <- cs_bench |> dplyr::filter(expression == "high", length == 10000, workers %in% 1:8) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_low_1k_dif_8 <- cs_bench |> dplyr::filter(expression == "low", length == 1000, workers %in% 1:8) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_low_10k_dif_8 <- cs_bench |> dplyr::filter(expression == "low", length == 10000, workers %in% 1:8) |> dplyr::summarize(dif = max(median) / min(median)) |> dplyr::pull(dif) |> as.numeric() |> round(1)

cs_bench_graph |> ggplot() + geom_boxplot(aes(time, as.factor(length), colour = as.factor(workers))) + scale_colour_viridis_d(name = "workers") + bench:::scale_x_bench_time() + scale_y_discrete("length") + facet_wrap(vars(expression), nrow = 2) + theme_minimal() + theme(legend.position = "bottom")

The results demonstrate, first of all, that the completion time of `create_signature` scales roughly linearly with the number of images being read. The non-parallel (`workers == 1`) version of the function reads 1,000 low-resolution images in `r cs_low_1k_1` seconds, and 10,000 images in `r cs_low_10k_1` seconds. It reads 1,000 higher-resolution images in `r cs_high_1k_1` minutes and 10,000 in `r cs_high_10k_1` minutes

Secondly, parallel versions of `create_signature` produce dramatic speed increases, particularly when reading higher-resolution images. For low core counts, the increases are nearly linear: two workers process the 1,000 and 10,000 higher-resolution images `r cs_high_1k_dif_2` and `r cs_high_10k_dif_2` times faster than a single worker, four workers are `r cs_high_1k_dif_4` and `r cs_high_1k_dif_4` times faster, and eight workers are `r cs_high_1k_dif_8` and `r cs_high_1k_dif_8` times faster. For large workloads (e.g. 100,000 images or more), this can translate into `create_signature` finishing in minutes instead of hours. The same dynamics are present with low-resolution images (e.g. eight workers process the 1,000 and 10,000 lower-resolution images `r cs_low_1k_dif_8` and `r cs_low_10k_dif_8` times faster than a single worker, respectively). While the marginal performance improvements to adding more workers decline at higher core counts, there is no point within the testing parameters at which adding more workers reduces performance. For this reason, by default `create_signature` always uses the maximum number of available cores when it is processing raw file paths. (By contrast, when `create_signature` operates on images which have already been loaded into memory with `load_image`, the communication overhead introduced by passing the data to each worker consistently swamps the performance gains from parallel processing, so parallel processing is never enabled in these cases.)

### match_signatures

The core functionality of `match_signatures` is to calculate Hamming distances between the image signatures in a pair of `matchr_signature` vectors. Because each signature in the left-hand vector needs to be compared to each signature in the right-hand vector, the computational requirements increase quadratically with vector size. (E.g. comparing two 100-length vectors implies 100 ^ 2 = 10,000 sets of calculations, while comparing two 200-length vectors implies 200 ^ 2 = 40,000 sets of calculations.) Matrix algebra is one of the operations which can be enormously sped up with a high-performance BLAS library, and the BLAS library is correspondingly the single most important factor influencing the performance of `match_signatures`.

The `match_signatures` function includes an argument `compare_ar`, which uses k-means clustering to split the input image signatures into groups with similar aspect ratios. This has two important implications. First, assuming the image sets being compared don't feature any extremely distorted scaling (e.g. landscape pictures stretched into portrait orientation), splitting by aspect ratio will tend to reduce the possibility for false positive matches, since, e.g., portrait and landscape images will not be compared to each other. Second, using `compare_ar = TRUE` can significantly shrink the number of image comparisons which need to be made, and thus speed up computation time. As a simple example, two sets of 100 images require 100 ^ 2 = 10,000 sets of calculations to be compared, but if each set is split into 50 portrait and 50 landscape images, only 50 ^ 2 * 2 = sets of 5,000 calculations will be needed. The function evaluates a range of different clustering possibilities to identify the one which minimizes total computation time.

```r
sig_high <- create_signature(paths_high)
sig_low <- create_signature(paths_low)
shs <- sig_high[1:1000]
sls <- sig_low[1:1000]
shm <- sig_high[1:20000]
slm <- sig_low[1:20000]
shl <- rep(sig_high, 2)
sll <- rep(sig_low, 2)

# Need to run this twice, once with a fast BLAS and once with reference BLAS
ms_bench_blas <- bench::press(
  compare = c(TRUE, FALSE), BLAS = TRUE, bench::mark(
    small = match_signatures(shs, sls, compare_ar = compare, quiet = TRUE),
    medium = match_signatures(shm, slm, compare_ar = compare, quiet = TRUE),
    large = match_signatures(shl, sll, compare_ar = compare, quiet = TRUE),
    iterations = 20, check = FALSE))

ms_bench_no_blas <- bench::press(
  compare = c(TRUE, FALSE), BLAS = FALSE, bench::mark(
    small = match_signatures(shs, sls, compare_ar = compare, quiet = TRUE),
    medium = match_signatures(shm, slm, compare_ar = compare, quiet = TRUE),
    large = match_signatures(shl, sll, compare_ar = compare, quiet = TRUE),
    iterations = 20, check = FALSE))

ms_bench <- bind_rows(ms_bench_blas, ms_bench_no_blas)

ms_bench |> 
  dplyr::mutate(
    compare = if_else(compare, "compare_ar = TRUE", "compare_ar = FALSE")) |> 
  tidyr::unnest(c(time, gc)) |> 
  ggplot() +
  geom_boxplot(aes(time, expression, colour = as.factor(BLAS))) +
  scale_colour_viridis_d(name = "Fast BLAS") +
  bench:::scale_x_bench_time() +
  scale_y_discrete("length") +
  facet_wrap(vars(compare), nrow = 2) +
  theme_minimal() +
  theme(legend.position = "bottom")

``` {r ms_timing_graph, echo = FALSE, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 6}

load(here::here("vignettes", "ms_bench.Rdata"))

ms_blas_range <- ms_bench |> dplyr::arrange(median) |> dplyr::group_by(expression, compare) |> dplyr::summarize(time_dif = as.numeric(median[2] / median[1])) |> dplyr::pull(time_dif) |> range() |> round(1)

ms_ar_mean <- ms_bench |> dplyr::filter(expression != "small") |> dplyr::arrange(median) |> dplyr::group_by(expression, BLAS) |> dplyr::summarize(time_true = median[compare == TRUE], time_false = median[compare == FALSE], time_dif = as.numeric(time_false / time_true)) |> dplyr::pull(time_dif) |> mean() |> round(1)

ms_blas_time <- ms_bench |> dplyr::filter(expression != "small") |> dplyr::arrange(median) |> dplyr::group_by(expression, compare) |> dplyr::summarize(time_BLAS = median[BLAS], time_no_BLAS = median[!BLAS], .groups = "drop") |> dplyr::summarize(dplyr::across(c(time_BLAS, time_no_BLAS), ~round(as.numeric(mean(time_no_BLAS / .x)), 1)))

ms_bench_graph |> dplyr::mutate(compare = dplyr::if_else( compare, "compare_ar = TRUE", "compare_ar = FALSE")) |> ggplot() + geom_boxplot(aes(time, expression, colour = BLAS)) + scale_colour_viridis_d(name = "Fast BLAS") + bench:::scale_x_bench_time() + scale_y_discrete("length") + facet_wrap(vars(compare), nrow = 2) + theme_minimal() + theme(legend.position = "bottom")

The figure demonstrates a wide range of timings corresponding to the different performance scenarios. The key points are:

- The BLAS-optimized version of `match_signatures` dramatically outperforms the non-BLAS-optimized versions, achieving a `r ms_blas_range[1]`-`r ms_blas_range[2]` speed up in comparison to the non-BLAS-optimized version.
- Setting `compare_ar = TRUE` to cluster input image signatures by aspect ratio yields major performance benefits for large input vectors. In the small test the extra overhead of evaluating clusters and splitting the data for analysis means that `compare_ar` does not have any significant performance implications, although the function execution times are generally one second or less in any case, so the difference isn't practically meaningful. For larger input vectors, by contrast, `compare_ar = TRUE` is on average `r ms_ar_mean` times faster than the non-clustered version. Since the clustered version of `match_signatures` will also avoid potential false positives and thus tend to yield more accurate results, in general `compare_ar` should be set to `FALSE` only when images with significantly different aspect ratios *must* be compared to each other.

`match_signatures` is the most memory-intensive function in matchr, since it needs to compute and then store the results of roughly x * y correlations between the two input vectors. Special considerations for dealing with memory constraints in `match_signatures` are discussed below in [Memory requirements](#memory).


## Memory requirements {#memory}

Although matchr is designed to be performant on even modest hardware, two matchr functions have particularly intense memory requirements when used with large datasets, and require special consideration.

### load_image

Loading large numbers of images into memory with `load_image` will almost never be feasible, since a single 640x480 image requires more than 5 MB of memory. Even for low-resolution images, memory requirements quickly balloon into the multi-gigabyte range when several thousand images are loaded.

``` {r li_memory_graph, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 4}
li_bench |> 
  dplyr::filter((expression == "low" | length < 10000)) |> 
  dplyr::mutate(mem_alloc = mem_alloc / 1024 ^ 2) |> 
  ggplot() +
  geom_jitter(aes(mem_alloc, as.factor(length), colour = expression), height = 0.05) +
  scale_colour_viridis_d(name = "resolution", end = 0.8) +
  scale_x_log10("memory (MB)") +
  scale_y_discrete("length") +
  theme_minimal() +
  theme(legend.position = "bottom")

However, using load_image on large sets of images is rarely necessary for a standard matchr workflow, which relies on the image signatures created by create_signature rather than the raw image data itself. The character vector method for create_signature (which is used if the first argument to the function is a character vector of file paths instead of a matchr_image vector) processes batches of 1000 file paths at a time to keep memory requirements low. It loads the first 1000 images with load_image and extracts the corresponding 1000 image signatures (which require negligible memory to store), then allows the raw images to be released from memory and processes the next 1000 file paths.

For large-scale image matching workflows, therefore, use create_signature directly on a vector of file paths to keep memory usage reasonable. By contrast, load_image is best used for special-case examination of small image sets.

match_signatures

match_signatures is the matchr function with the highest potential to generate memory pressure, because of the necessity for generating a distance matrix between all the elements in the two input image signatures.

``` {r ms_memory_calcs, echo = FALSE} ms_ar_mem <- ms_bench |> dplyr::filter(expression != "small") |> dplyr::group_by(expression, BLAS) |> dplyr::summarize(mem_dif = as.numeric(mem_alloc[compare] / mem_alloc[!compare]), .groups = "drop") |> dplyr::pull(mem_dif) |> mean() |> {(x) x * 100}() |> round(1)

``` {r ms_memory_graph, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 4}
ms_bench |> 
  dplyr::mutate(mem_alloc = mem_alloc / 1024 ^ 3, compare = dplyr::if_else(
    compare, "compare_ar = TRUE", "compare_ar = FALSE")) |> 
  ggplot() +
  geom_jitter(aes(mem_alloc, as.factor(expression), colour = BLAS), 
              height = 0.1, width = 0.05) +
  scale_colour_viridis_d("Fast BLAS", end = 0.8) +
  scale_x_log10("memory (GB)") +
  scale_y_discrete("length") +
  facet_wrap(vars(compare), nrow = 2) +
  theme_minimal() +
  theme(legend.position = "bottom")

For larger datasets (where memory pressure is likelier to be an issue), the clustered version of match_signatures (with compare_ar = TRUE) uses an average of r paste0(ms_ar_mem, "%") of the memory of the non-clustered version. Using compare_ar = TRUE is thus the simplest way to keep memory requirements low. (The choice of fast or reference BLAS does not have any impact on memory usage.)

There are two further levels of memory management in match_signatures. The first (which is controlled with the mem_scale function argument) is that match_signatures will attempt to detect total system memory and limit its peak memory usage to some fraction of the total by splitting the underlying matchr_signature vectors into smaller pieces. This technique will reduce the amount of memory used by the intermediate steps in the calculations (since the slices of the data being computed upon will be smaller), and in general the default mem_scale value of 0.2 should only be changed with caution, since higher values have a high likelihood of causing system instability with large datasets.

The second level of memory management in match_signatures is a "failsafe" (which can be disabled with mem_override = TRUE) which will throw an error if the function detects insufficient system memory to succeed even with splitting the inputs as described above. Unless the system is somehow reporting an incorrect amount of RAM, when this error is triggered there is no possibility of match_signatures succeeding, given the input vectors and the total available memory, because the minimal results of the function (i.e. the output matchr_matrix vector itself) would exceed total system memory even leaving aside the necessity for keeping some subset of the input vectors in memory for performing calculations. In this case a viable alternative is to run identify_matches directly on the input signature vectors. The matchr_signature method for identify_matches will not store the complete distance matrices, but will process slices of the input image signature vectors to 1) generate distance matrices, 2) identify matches with low distance values, and then 3) allow the distance matrices to be released from memory. This allows the function to succeed even in heavily memory-constrained environments, but at the cost of not storing the matchr_matrix vector of distance matrices which match_signatures usually produces. In practice, the only significant downside of this method is that it makes it more difficult to experiment with different thresholds for identifying matches in identify_matches.

Progress reporting

All time-consuming matchr functions support progress reporting via the progressr package. To enable progress reporting, simply run the following line of code before running matchr functions:

progressr::handlers(global = TRUE)

Progress reporting carries a small performance overhead, and while matchr scales the amount of progress reports adaptively to the size of the workload, matchr functions will run slightly faster if progress reporting is disabled. To disable progress reporting on a per-function basis, set the argument quiet = TRUE, which is available in every matchr function which supports progress reporting.



UPGo-McGill/matchr documentation built on July 19, 2023, 1:02 p.m.