Partial dependence plots in Spark


Partial dependence (PD) plots are rather straightforward to construct in practice, as discussed in @RJ-2017-016. However, is is difficult to apply the brute force algorithm as is to situations where the data are stored in a Spark data frame; such is the case when fitting models using Spark's MLlib library [@meng-2015-mllib]. Fortunately, the same computations can be done using a couple of simple Spark operations; in particular, a cross-join, followed by a group-by and aggregation step.

To illustrate...

