Partial dependence plots in Spark

library(pdp)

# Set global chunk options
knitr::opts_chunk$set(
  cache = TRUE,
  comment = "#>",
  error = FALSE, 
  fig.path = "figure/", 
  cache.path = "cache/", 
  dpi = 300,
  fig.align = "center", 
  fig.asp = 0.618,
  fig.pos = "!htb",
  fig.width = 6, 
  fig.show = "hold",
  message = FALSE,
  out.width = "100%",
  par = TRUE,  # defined below
  size = "small",
  # size = "tiny",
  tidy = FALSE,
  warning = FALSE
)

# Set general hooks
knitr::knit_hooks$set(
  par = function(before, options, envir) {
    if (before && options$fig.show != "none") {
      par(
        mar = c(4, 4, 0.1, 0.1), 
        cex.lab = 0.95, 
        cex.axis = 0.8,  # was 0.9
        mgp = c(2, 0.7, 0), 
        tcl = -0.3, 
        las = 1
      )
      if (is.list(options$par)) {
        do.call(par, options$par)
      }
    }
  }
)

Introduction

Partial dependence (PD) plots are rather straightforward to construct in practice, as discussed in @RJ-2017-016. However, is is difficult to apply the brute force algorithm as is to situations where the data are stored in a Spark data frame; such is the case when fitting models using Spark's MLlib library [@meng-2015-mllib]. Fortunately, the same computations can be done using a couple of simple Spark operations; in particular, a cross-join, followed by a group-by and aggregation step.

To illustrate...



Try the pdp package in your browser

Any scripts or data that you put into this service are public.

pdp documentation built on June 8, 2022, 1:07 a.m.