source("helper.R")
This chapter talks about various attempts that were made to make the code execution
of the segmentr
package faster, both with writing the cost function in native-code
and experimenting with parallel code execution.
The package microbenchmark
allows us to execute the code and compare different run-times.
Therefore it's used in this work to compare the performance of different code snippets.
library(microbenchmark)
In the algorithms chapter, we discussed the time complexity of the different types of algorithms. In that scenario, the most relevant variable is the number of columns in the data set, which dictates whether the computation of the exact algorithm will be prohibitive or not. However, the number of samples does influence the computation time linearly. Given the most performed operation in any of the algorithms described in the chapter is the cost function, in essence it's the bottleneck of the whole computation. Therefore, real performance gains can be obtained depending on the performance of the application.
In R statistical programming, it's commonplace to use the functions provided by the programming environment, which are usually performant algorithms implemented in compiled programming languages such as C. However, there are times the programmer must implement a custom function in R, which in turn is a dynamic and interpreted language, with typically slower execution times.
Therefore, in a first attempt to make the code execution run faster, and
compare different results, segmentr
implements the [multivariate()]
likelihood function (used to calculate the cost) in the compiled C++ programming
language, as well as it implements the equivalent [r_multivariate()] in the R programming environment.
data <- makeRandom(20, 100) bench <- microbenchmark( segment(data, cost = function(x) -multivariate(x), algorithm = "exact"), segment(data, cost = function(x) -r_multivariate(x), algorithm = "exact"), segment(data, cost = function(x) -multivariate(x), algorithm = "hierarchical"), segment(data, cost = function(x) -r_multivariate(x), algorithm = "hierarchical"), times = 1 ) print_benchmark( bench, caption="Execution time comparison between native C++ and interpreted R code" )
In Table \@ref(tab:benchmark-native), we notice performance improvements in the native [multivariate()]
function, with the biggest performance improvement taking place in the exact
algorithm
results. Therefore, whenever the resources are available, it's recommended to implement
the cost function (or portions of it) directly in a native programming language such as C++. We recommend
the RCpp
package, which makes it easy to implement functions that interface easily with
the R programming environment.
Given the increasing trend of parallel programming in the past few years, it's natural
to try to take advantage of that by parallelizing the algorithm execution. In the
algorithms implemented in this paper, parallelization is a bit complicated because
a lot of the execution steps in the algorithm iterations depend on values computed
in previous steps, and the golden rule of parallel computing is that faster results
are generally obtained when computation steps are independent from one another, i.e.
no dependency between the tasks. Though complicated to implement, there are still
a few steps of the computation that can be performed in parallel, and its parallelization
was implemented with the help of the foreach
package.
Therefore, whether or not the computation will be performed depends on whether there is
a parallel cluster registered in the R session with packages such as doMC
, and whether
the allow_parallel
argument is set to TRUE
. With that in mind, a few comparisons are evaluated.
data <- makeRandom(100, 10) doMC::registerDoMC(4) bench <- microbenchmark( segment(data, cost = function(x) -multivariate(x), algorithm = "exact", allow_parallel = FALSE), segment(data, cost = function(x) -multivariate(x), algorithm = "exact", allow_parallel = TRUE), segment(data, cost = function(x) -multivariate(x), algorithm = "hierarchical", allow_parallel = FALSE), segment(data, cost = function(x) -multivariate(x), algorithm = "hierarchical", allow_parallel = TRUE), times = 1 ) print_benchmark( bench, caption="Execution time comparison between parallel and single-threaded computation with a data set of 100 samples and 10 columns" )
Analyzing Table \@ref(tab:benchmark-parallel-narrow), we notice good improvements, especially for the "exact" algorithm. However, in a situation in which the number of rows isn't very large in comparison with the number of columns, the results of the parallel computation are a lot more modest. Consider the following case.
data <- makeRandom(5, 100) doMC::registerDoMC(4) bench <- microbenchmark( segment(data, cost = function(x) -multivariate(x), algorithm = "exact", allow_parallel = FALSE), segment(data, cost = function(x) -multivariate(x), algorithm = "exact", allow_parallel = TRUE), segment(data, cost = function(x) -multivariate(x), algorithm = "hierarchical", allow_parallel = FALSE), segment(data, cost = function(x) -multivariate(x), algorithm = "hierarchical", allow_parallel = TRUE), times = 1 ) print_benchmark( bench, caption="Execution time comparison between parallel and single-threaded computation with a data set of 5 samples and 100 columns" )
In Table \@ref(tab:benchmark-parallel-wide), parallelization achieved the opposite effect, actually making the computation slower, as there is now a lot of work done synchronizing necessary information between all the processes in the cluster. That takes more time than is saved by running the algorithm in parallel.
Taking all of that into consideration, the decision of whether or not to use parallelization relies on the user after evaluating the nature of the data set. For data sets with few samples and many columns, parallelization is not recommended. In the case there are lots of samples, parallelization might give better computation times.
An alternative not developed in this project would be to use a genetic algorithm approach to the optimal set estimation, which is a probabilistic process that converges to the result after enough computation time has passed. The main advantage of such a method is that computations can be more easily parallelized. Such development might be explored in future developments in this package.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.