partial: Partial Dependence Functions

Description Usage Arguments Value Note References Examples

View source: R/partial.R

Description

Compute partial dependence functions (i.e., marginal effects) for various model fitting objects.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
partial(object, ...)

## Default S3 method:
partial(object, pred.var, pred.grid, pred.fun = NULL,
  grid.resolution = NULL, quantiles = FALSE, probs = 1:9/10,
  trim.outliers = FALSE, type = c("auto", "regression", "classification"),
  which.class = 1L, prob = FALSE, recursive = TRUE, plot = FALSE,
  smooth = FALSE, rug = FALSE, chull = FALSE, train, cats = NULL,
  check.class = TRUE, progress = "none", parallel = FALSE,
  paropts = NULL, ...)

Arguments

object

A fitted model object of appropriate class (e.g., "gbm", "lm", "randomForest", "train", etc.).

...

Additional optional arguments to be passed onto predict.

pred.var

Character string giving the names of the predictor variables of interest. For reasons of computation/interpretation, this should include no more than three variables.

pred.grid

Data frame containing the joint values of interest for the variables listed in pred.var.

pred.fun

Optional prediction function that requires two arguments: object and newdata. If specified, then the function must return a single prediction or a vector of predictions (i.e., not a matrix or data frame). Default is NULL.

grid.resolution

Integer giving the number of equally spaced points to use (only used for the continuous variables listed in pred.var when pred.grid is not supplied). If left NULL, it will default to the minimum between 51 and the number of unique data points for each of the continuous independent variables listed in pred.var.

quantiles

Logical indicating whether or not to use the sample quantiles of the numeric predictors listed in pred.var. Can only be specified when grid.resolution = NULL.

probs

Numeric vector of probabilities with values in [0,1]. (Values up to 2e-14 outside that range are accepted and moved to the nearby endpoint.) Default is 1:9/10 which corresponds to the deciles of the predictor variables.

trim.outliers

Logical indicating whether or not to trim off outliers from the numeric predictors (using the simple boxplot method) before creating the grid of joint values for which the partial dependence is computed. Default is FALSE.

type

Character string specifying the type of supervised learning. Current options are "auto", "regression" or "classification". If type = "auto" then partial will try to extract the necessary information from object.

which.class

Integer specifying which column of the matrix of predicted probabilities to use as the "focus" class. Default is to use the first class. Only used for classification problems (i.e., when type = "classification").

prob

Logical indicating whether or not partial dependence for classification problems should be returned on the probability scale, rather than the centered logit. If FALSE, the partial dependence in on a scale similar to the logit. Default is FALSE.

recursive

Logical indicating whether or not to use the weighted tree traversal method described in Friedman (2001). This only applies to objects that inherit from class "gbm". Default is TRUE which is much faster than the exact brute force approach used for all other models. (Based on the C++ code behind plot.gbm.)

plot

Logical indicating whether to return a data frame containing the partial dependence values (FALSE) or plot the partial dependence function directly (TRUE). Default is FALSE. See plotPartial for plotting details.

smooth

Logical indicating whether or not to overlay a LOESS smooth. Default is FALSE.

rug

Logical indicating whether or not to include rug marks on the predictor axes. Only used when plot = TRUE. Default is FALSE.

chull

Logical indicating wether or not to restrict the first two variables in pred.var to lie within the convex hull of their training values; this affects pred.grid. Default is FALSE.

train

An optional data frame containing the original training data. This may be required depending on the class of object. For objects that do not store a copy of the original training data, this argument is required.

cats

Character string indicating which columns of train should be treated as categorical variables. Only used when train inherits from class "matrix" or "dgCMatrix".

check.class

Logical indicating whether or not to make sure each column in pred.grid has the correct class, levels, etc. Default is TRUE.

progress

Character string giving the name of the progress bar to use. See create_progress_bar for details. Default is "none".

parallel

Logical indicating whether or not to run partial in parallel using a backend provided by the foreach package. Default is FALSE. Default is NULL.

paropts

List containing additional options passed onto foreach when parallel = TRUE.

Value

If plot = FALSE (the default) partial returns a data frame with the additional class "partial" that is specially recognized by the plotPartial function. If plot = TRUE then partial returns a "trellis" object (see lattice for details) with an additional attribute, "partial.data", containing the data displayed in the plot.

Note

In some cases it is difficult for partial to extract the original training data from object. In these cases an error message is displayed requesting the user to supply the training data via the train argument in the call to partial. In most cases where partial can extract the required training data from object, it is taken from the same environment in which partial is called. Therefore, it is important to not change the training data used to construct object before calling partial. This problem is completely avoided when the training data are passed to the train argument in the call to partial.

It is recommended to call partial with plot = FALSE and store the results; this allows for more flexible plotting, and the user will not have to waste time calling partial again if the default plot is not sufficient.

It is possible to retrieve the last printed "trellis" object, such as those produced by plotPartial, using trellis.last.object().

If the prediction function given to pred.fun returns a prediction for each observation in newdata, then the result will be a PDP for each observation. These are called individual conditional expectation (ICE) curves; see Goldstein et al. (2015) and ice for details.

References

J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29: 1189-1232, 2001.

Goldstein, A., Kapelner, A., Bleich, J., and Pitkin, E., Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation. (2014) Journal of Computational and Graphical Statistics, 24(1): 44-65, 2015.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
## Not run: 

#
# Regression example (requires randomForest package to run)
#

# Fit a random forest to the boston housing data
library(randomForest)
data (boston)  # load the boston housing data
set.seed(101)  # for reproducibility
boston.rf <- randomForest(cmedv ~ ., data = boston)

# Using randomForest's partialPlot function
partialPlot(boston.rf, pred.data = boston, x.var = "lstat")

# Using pdp's partial function
head(partial(boston.rf, pred.var = "lstat"))  # returns a data frame
partial(boston.rf, pred.var = "lstat", plot = TRUE, rug = TRUE)

# The partial function allows for multiple predictors
partial(boston.rf, pred.var = c("lstat", "rm"), grid.resolution = 40,
        plot = TRUE, chull = TRUE, progress = "text")

# The plotPartial function offers more flexible plotting
pd <- partial(boston.rf, pred.var = c("lstat", "rm"), grid.resolution = 40)
plotPartial(pd, levelplot = FALSE, zlab = "cmedv", drape = TRUE,
            colorkey = FALSE, screen = list(z = -20, x = -60))

# The autplot function can be used to produce graphics based on ggplot2
library(ggplot2)
autoplot(pd, contour = TRUE, contour = TRUE,
         legend.title = "Partial\ndependence")

#
# Individual conditional expectation (ICE) curves
#

# Use partial to obtain ICE curves
pred.ice <- function(object, newdata) predict(object, newdata)
rm.ice <- partial(boston.rf, pred.var = "rm", pred.fun = pred.ice)
plotPartial(rm.ice, rug = TRUE, train = boston, alpha = 0.2)
autoplot(rm.ice, center = FALSE, alpha = 0.2, rug = TRUE, train = boston)

#
# Centered ICE curves (c-ICE curves) (requires dplyr and ggplot2 to run)
#

# Post-process rm.ice to obtain c-ICE curves
library(dplyr)  # for group_by and mutate functions
rm.cice <- rm.ice %>%
  group_by(yhat.id) %>%  # perform next operation within each yhat.id
  mutate(yhat.centered = yhat - first(yhat))  # so each curve starts at yhat = 0

# ICE curves with their average
library(ggplot2)
p1 <- ggplot(rm.ice, aes(rm, yhat)) +
  geom_line(aes(group = yhat.id), alpha = 0.2) +
  stat_summary(fun.y = mean, geom = "line", col = "red", size = 1)
# c-ICE curves with their average
p2 <- ggplot(rm.cice, aes(rm, yhat.centered)) +
  geom_line(aes(group = yhat.id), alpha = 0.2) +
  stat_summary(fun.y = mean, geom = "line", col = "red", size = 1)
grid.arrange(p1, p2, ncol = 2)

# Or just use autoplot (the default is to center the curves first)
autoplot(rm.ice, alpha = 0.2, rug = TRUE, train = boston)

#
# Classification example (requires randomForest package to run)
#

# Fit a random forest to the Pima Indians diabetes data
data (pima)  # load the boston housing data
set.seed(102)  # for reproducibility
pima.rf <- randomForest(diabetes ~ ., data = pima, na.action = na.omit)

# Partial dependence of diabetes test result (neg/pos) on glucose
partial(pima.rf, pred.var = c("glucose", "age"), plot = TRUE, chull = TRUE,
        progress = "text")

# Partial dependence of positive diabetes test result on glucose, plotted on
# the probability scale, rather than the centered logit
pfun <- function(object, newdata) {
  mean(predict(object, newdata, type = "prob")[, "pos"], ne.rm = TRUE)
}
partial(pima.rf, pred.var = "glucose", pred.fun = pfun,
        plot = TRUE, chull = TRUE, progress = "text")


## End(Not run)

pdp documentation built on May 30, 2017, 7:18 a.m.

Search within the pdp package
Search all R packages, documentation and source code