In thalesmello/segmentr: Segment Data Minimizing a Cost Function

Considerations on Performance

source("helper.R")

This chapter talks about various attempts that were made to make the code execution of the segmentr package faster, both with writing the cost function in native-code and experimenting with parallel code execution.

The package microbenchmark allows us to execute the code and compare different run-times. Therefore it's used in this work to compare the performance of different code snippets.

library(microbenchmark)

Native Code

In the algorithms chapter, we discussed the time complexity of the different types of algorithms. In that scenario, the most relevant variable is the number of columns in the data set, which dictates whether the computation of the exact algorithm will be prohibitive or not. However, the number of samples does influence the computation time linearly. Given the most performed operation in any of the algorithms described in the chapter is the cost function, in essence it's the bottleneck of the whole computation. Therefore, real performance gains can be obtained depending on the performance of the application.

In R statistical programming, it's commonplace to use the functions provided by the programming environment, which are usually performant algorithms implemented in compiled programming languages such as C. However, there are times the programmer must implement a custom function in R, which in turn is a dynamic and interpreted language, with typically slower execution times.

Therefore, in a first attempt to make the code execution run faster, and compare different results, segmentr implements the [multivariate()] likelihood function (used to calculate the cost) in the compiled C++ programming language, as well as it implements the equivalent [r_multivariate()] in the R programming environment.

data <- makeRandom(20, 100)

bench <- microbenchmark(
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "exact"),
  segment(data,
          cost = function(x) -r_multivariate(x),
          algorithm = "exact"),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "hierarchical"),
  segment(data,
          cost = function(x) -r_multivariate(x),
          algorithm = "hierarchical"),
  times = 1
)

print_benchmark(
  bench,
  caption="Execution time comparison
    between native C++ and interpreted R code"
)

In Table \@ref(tab:benchmark-native), we notice performance improvements in the native [multivariate()] function, with the biggest performance improvement taking place in the exact algorithm results. Therefore, whenever the resources are available, it's recommended to implement the cost function (or portions of it) directly in a native programming language such as C++. We recommend the RCpp package, which makes it easy to implement functions that interface easily with the R programming environment.

Parallelization

Given the increasing trend of parallel programming in the past few years, it's natural to try to take advantage of that by parallelizing the algorithm execution. In the algorithms implemented in this paper, parallelization is a bit complicated because a lot of the execution steps in the algorithm iterations depend on values computed in previous steps, and the golden rule of parallel computing is that faster results are generally obtained when computation steps are independent from one another, i.e. no dependency between the tasks. Though complicated to implement, there are still a few steps of the computation that can be performed in parallel, and its parallelization was implemented with the help of the foreach package.

Therefore, whether or not the computation will be performed depends on whether there is a parallel cluster registered in the R session with packages such as doMC, and whether the allow_parallel argument is set to TRUE. With that in mind, a few comparisons are evaluated.

data <- makeRandom(100, 10)
doMC::registerDoMC(4)
bench <- microbenchmark(
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "exact",
          allow_parallel = FALSE),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "exact",
          allow_parallel = TRUE),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "hierarchical",
          allow_parallel = FALSE),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "hierarchical",
          allow_parallel = TRUE),
  times = 1
)
print_benchmark(
  bench,
  caption="Execution time comparison
    between parallel and single-threaded computation
    with a data set of 100 samples and 10 columns"
)

Analyzing Table \@ref(tab:benchmark-parallel-narrow), we notice good improvements, especially for the "exact" algorithm. However, in a situation in which the number of rows isn't very large in comparison with the number of columns, the results of the parallel computation are a lot more modest. Consider the following case.

data <- makeRandom(5, 100)
doMC::registerDoMC(4)
bench <- microbenchmark(
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "exact",
          allow_parallel = FALSE),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "exact",
          allow_parallel = TRUE),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "hierarchical",
          allow_parallel = FALSE),
  segment(data,
          cost = function(x) -multivariate(x),
          algorithm = "hierarchical",
          allow_parallel = TRUE),
  times = 1
)

print_benchmark(
  bench,
  caption="Execution time comparison
    between parallel and single-threaded computation
    with a data set of 5 samples and 100 columns"
)

In Table \@ref(tab:benchmark-parallel-wide), parallelization achieved the opposite effect, actually making the computation slower, as there is now a lot of work done synchronizing necessary information between all the processes in the cluster. That takes more time than is saved by running the algorithm in parallel.

Taking all of that into consideration, the decision of whether or not to use parallelization relies on the user after evaluating the nature of the data set. For data sets with few samples and many columns, parallelization is not recommended. In the case there are lots of samples, parallelization might give better computation times.

An alternative not developed in this project would be to use a genetic algorithm approach to the optimal set estimation, which is a probabilistic process that converges to the result after enough computation time has passed. The main advantage of such a method is that computations can be more easily parallelized. Such development might be explored in future developments in this package.

thalesmello/segmentr documentation built on March 4, 2020, 1 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com