The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage. Big caches are efficient. Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics).
[...]
Give it up. The whole "parallel computing is the future" is a bunch of crock.
\framesubtitle{Imagine a \texttt{gsub("DBMs", "", tweet)} to complement further...}
\centering{\includegraphics[width=\textwidth,height=0.8\textheight,keepaspectratio]{images/big-data-big-machine-tweet.png}}
\framesubtitle{\texttt{http://cran.r-project.org/web/views/HighPerformanceComputing.html}}
Things R does well:
\medskip
In the fairly early days of Rcpp, we also put out RInside as a simple C++ class wrapper around the R-embedding API.
It got one clever patch taking this (ie: R wrapped in C++ with its own
main()
function) and encapsulating it within MPI.
HP Vertica also uses Rcpp and RInside in DistributedR.
Rcpp is now easy to deploy; Rcpp Attributes played a key role:
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] double piSugar(const int N) { NumericVector x = runif(N); NumericVector y = runif(N); NumericVector d = sqrt(x*x + y*y); return 4.0 * sum(d < 1.0) / N; }
Rcpp Attributes also support "plugins"
OpenMP is easy to use and widely supported (on suitable OS / compiler combinations).
So we added support via a plugin. Use is still not as wide-spread.
Errors have commonality: calling back into R.
\framesubtitle{NOT like this...}
using namespace boost; void task() { lock_guard<boost::mutex> lock(mutex); // etc... } threadpool::pool tp(thread::hardware_concurrency()); for (int i=0; i<slices; i++) tp.schedule(&task);
Goals:
\footnotesize
| | TBB | OMP | RAW | |---|:----------:|:------:|:-------:| Task level parallelism | \textbullet | \textbullet | | Data decomposition support | \textbullet | \textbullet | | Non loop parallel patterns | \textbullet | | | Generic parallel patterns | \textbullet | | | Nested parallelism support | \textbullet | | | Built in load balancing | \textbullet | \textbullet | | Affinity support | | \textbullet | \textbullet | Static scheduling | | \textbullet | | Concurrent data structures | \textbullet | | | Scalable memory allocator | \textbullet | | |
R is single-threaded and includes this warning in Writing R Extensions when discussing the use of OpenMP:
Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. In particular, code which makes use of the stack-checking mechanism must not be called from threaded code.
However we don't really want to force Rcpp users to resort to reading the Rcpp and R source code to assess thread safety issues.
Since R vectors and matrices are just raw contiguous arrays it's easy to create threadsafe C++ wrappers for them:
RVector<T>
is a very thin wrapper over a C array.
RMatrix<T>
is the same but also provides Row<T>
and Column<T>
accessors/iterators.
The implementions of these classes are extremely lightweight and never call into Rcpp or the R API (so are always threadsafe).
Two high-level operations are provided (with TBB and TinyThread implementations of each):
parallelFor
-- Convert the work of a standard serial "for" loop into a parallel one
parallelReduce
-- Used for accumulating aggregate or other values.
Not surprisingly the TBB versions of these operations perform ~ 50% better than the "naive" parallel implementation provided by TinyThread.
Create a Worker
class with operator()
that RcppParallel uses to operate on discrete slices of the input data on different threads:
class MyWorker : public RcppParallel::Worker { void operator()(size_t begin, size_t end) { // do some work from begin to end // within the input data } }
Worker would typically take input and output data in it's constructor then save them as members (for reading/writing within operator()
):
NumericMatrix matrixSqrt(NumericMatrix x) { NumericMatrix output(x.nrow(), x.ncol()); SquareRootWorker worker(x, output); parallelFor(0, x.length(), worker); return output; }
For parallelReduce
you need to specify how data is to be combined. Typically you save data in a member within operator()
then fuse it with another Worker
instance in the join
function.
class SumWorker : public RcppParallel::Worker // join my value with that of another SumWorker void join(const SumWorker& rhs) { value += rhs.value; } }
All available on the Rcpp Gallery http://gallery.rcpp.org
Tested with 4 cores on a 2.6GHz Haswell MacBook Pro
Note that benchmarks will be 30-50% slower on Windows because we aren't using the more sophisticated scheduling of TBB
\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-matrix-transform}}
void operator()(size_t begin, size_t end) { std::transform(input.begin() + begin, input.begin() + end, output.begin() + begin, ::sqrt); }
test replications elapsed relative 2 parallelMatrixSqrt(m) 100 0.294 1.000 1 matrixSqrt(m) 100 0.755 2.568
\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-vector-sum}}
void operator()(size_t begin, size_t end) { value += std::accumulate(input.begin() + begin, input.begin() + end, 0.0); } void join(const Sum& rhs) { value += rhs.value; }
test replications elapsed relative 2 parallelVectorSum(v) 100 0.182 1.000 1 vectorSum(v) 100 0.857 4.709
\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-distance-matrix}}
test reps elapsed relative 3 rcpp_parallel_distance(m) 3 0.110 1.000 2 rcpp_distance(m) 3 0.618 5.618 1 distance(m) 3 35.560 323.273
parallel_scan
, parallel_while
, parallel_do
, parallel_pipeline
, parallel_sort
concurrent_queue
, concurrent_priority_queue
, concurrent_vector
, concurrent_hash_map
mutex
, spin_mutex
, queuing_mutex
, spin_rw_mutex
, queuing_rw_mutex
, recursive_mutex
fetch_and_add
, fetch_and_increment
, fetch_and_decrement
, compare_and_swap
, fetch_and_store
Additional (portable to Win32 via TinyThread) wrappers for other TBB constructs?
Alternatively, sort out Rtools configuration issues required to get TBB working on Windows.
Education: Parallel Programming is hard.
Simple parallelFor
and parallelReduce
are reasonably easy to grasp, but more advanced idioms aren't trivial to learn and use (but for some applications have lots of upside so are worth the effort).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.