README.md

Calling an Rcpp Function with a doParallel backend

R-CMD-check

The Rcpp2doParallel R package provides an example of providing a C++ function and a parallelization call from R to the C++ function using the doParallel and foreach backend. That said, any of the do* backends -- doFuture, doMC, doMPI, doRedis, doRNG, doSNOW -- can be substituted in for the doParallel backend used as a driving example here.

Usage

To install the package, you must first have a compiler on your system that is compatible with R. For help on obtaining a compiler consult either macOS or Windows guides.

With a compiler in hand, one can then install the package from GitHub by:

# install.packages("devtools")
devtools::install_github("coatless-rd-rcpp/rcpp-and-doparallel")
library("Rcpp2doParallel")

Implementation Details

Within this project, there is a C++ function created using Rcpp that is used by the doParallel region within the R package. By packaging the C++ function, the cost when parallelizing code is decreased as each worker in the parallelization setup does not have to compile the code locally before being able to execute it. Moreover, by packing the parallelization code, the deployment of the algorithm is done using R's package management instead of a monolithic R script.

.
├── DESCRIPTION                      # Package metadata
├── LICENSE                          # Code license
├── NAMESPACE                        # Function and dependency registration
├── R                                # R functions
│   ├── Rcpp2doParallel-package.R    # Package documentation
│   ├── RcppExports.R                # Autogenerated R to C++ bindings by Rcpp
│   └── mean_parallel_compute.R      # doParallel cluster formation and C++ call
├── README.md
├── Rcpp2doParallel.Rproj
├── man                              # Package Documentation
│   ├── Rcpp2doParallel-package.Rd
│   └── mean_parallel_compute.Rd
└── src                              # Compiled Code
    ├── RcppExports.cpp              # Autogenerated R Bindings
    └── mean_rcpp.cpp                # Construct a C++ function to comupte mean.

R Function

Parallelized R functions require a cluster or set of workers to be setup for the underlying jobs in the parallelization region to be distributed to. The approach taken here self-contains the setup and execution of parallel workers. By encapsulating both options within the function, there is a higher runtime cost on subsequent function calls as the cluster must be setup again. An alternative approach would be to pass an initialized cluster into the function.

When constructing a parallelized region with foreach, one must:

  1. Startup a cluster with cl = parallel::startCluster(n_workers)
  2. Register the parallel backend with the do* package using do*::registerDo*().
  3. In the case of doParallel, this would be doParallel::registerDoParallel(cl).
  4. Denote the parallelization region with foreach() %dopar%
  5. Pay close attention to any variables or packages that must be exported. Supply such data using foreach(..., .packages = c("pkgA", "pkgB"), .export = c("var1", "var2"))
  6. As an example, Rcpp2doParallel is loaded on each worker by using foreach(..., .packages = "Rcpp2doParallel")
  7. Shut down the cluster with parallel::stopCluster(cl).
  8. Alternatively, define a handler for the end of the function that stops the cluster with on.exit(parallel::stopCluster(cl))
  9. Return the results of the estimation.
mean_parallel_compute = function(n, mean = 0, sd = 1,
                                 n_sim = 1000,
                                 n_cores = parallel::detectCores()) {

  # Construct cluster
  cl = parallel::makeCluster(n_cores)

  # After the function is run, shutdown the cluster.
  on.exit(parallel::stopCluster(cl))

  # Register parallel backend
  doParallel::registerDoParallel(cl)   # Modify with any do*::registerDo*()

  # Compute estimates
  estimates = foreach::foreach(i = iterators::icount(n_sim), # Perform n simulations
                               .combine = "rbind",           # Combine results
                                                             # Self-load
                               .packages = "Rcpp2doParallel") %dopar% {
    random_data = rnorm(n, mean, sd)

    result = mean_rcpp(random_data) # or use Rcpp2doParallel::mean_rcpp()
    result
  }

  # Release results
  return(estimates)
}

C++ Function Construction

The C++ function must be placed within the package's src/ directory and exported into R with Rcpp Attributes. Outside of these two requirements, nothing else must be done as the parallelization is handled by R and not within the C++ code.

#include <Rcpp.h>

// [[Rcpp::export]]
double mean_rcpp(Rcpp::NumericVector x){
  int n = x.size(); // Size of vector
  double sum = 0;   // Sum value

  // For loop, note cpp index shift to 0
  for(int i = 0; i < n; i++){
    // Shorthand for sum = sum + x[i]
    sum += x[i];
  }

  return sum/n;  // Obtain and return the Mean
}

DESCRIPTION

The use of the doParallel backend has many dependencies that are required depending on the features you wish to use. In particular, the doParallel package requires foreach and parallel to operate. Only iterators can be removed from the dependency list if there is sufficient RAM to allocate index values, e.g. 1:n, instead of creating a low cost iterator with n elements through iterators::icount().

LinkingTo: 
    Rcpp
Imports: 
    doParallel,
    Rcpp,
    foreach,
    iterators,
    parallel

NAMESPACE

As discussed in DESCRIPTION, the doParallel() backend has a few dependencies. The following are functions that must be imported into the package in order for it to successfully run.

#' @importFrom foreach %dopar% foreach
#' @importFrom iterators icount
#' @importFrom doParallel registerDoParallel

Author

James Joseph Balamuta

License

GPL (>= 2)



r-pkg-examples/rcpp-and-doparallel documentation built on March 13, 2024, 4:32 p.m.