Model Predictor Effects and Diagnostics

Calculation of performance metrics on test sets or by resampling, as discussed previously, is one method of assessing model performance. Others available include measures of predictor variable importance, partial dependence plots, calibration curves comparing observed and predicted response values, and receiver operating characteristic analysis.

Variable Importance

The importance of predictor variables in a model fit is estimated with the r rdoc_url("varimp()") function and displayed graphically with r rdoc_url("plot()"). Variable importance is a relative measure of the contributions of model predictors and is scaled by default to have a maximum value of 100, where higher values represent more important variables. Model-specific implementations of variable importance are available in many cases, although their definitions may differ. In the case of a r rdoc_url("GBMModel"), importance of each predictor is based on the sum of squared empirical improvements over all internal tree nodes created by splitting on that variable [@greenwell:2019:GBM].

## Predictor variable importance
(vi <- varimp(surv_fit, method = "model"))

plot(vi)

In contrast, importance is based on negative log-transformed p-values for statistical models, like r rdoc_url("CoxModel"), that produce them. For other models, variable importance may be defined and calculated by their underlying source packages or not defined at all, as is the case for r rdoc_url("SVMModel"). Logical indicators of model-specific variable importance are given in the information displayed by model constructors and returned by r rdoc_url("modelinfo()").

SVMModel

modelinfo(SVMModel)[[1]]$varimp

Variable importance can be computed with model-agnostic permutation methods [@fisher:2019:AMW] as an alternative to model-specific methods. The following algorithm for permutation-based variable importance is implemented and the default method in MachineShop.

## Permutation-based variable importance
varimp(surv_fit)

There are a number of advantages to permutation-based variable importance. In particular, it can be computed for any

  1. model,
  2. performance metric defined for the given response variable type, and
  3. predictor variable in the original training set.

Conversely, model-specific methods are not defined for some models, are generally limited to metrics implemented in their source packages, and might be computed on derived, rather than original, predictor variables. These differences can make comparisons of variable importance across classes of models difficult if not impossible. The trade-off for the advantages of a permutation-based approach is increased computation time. To speed up computations, the algorithm will run in parallel if a compatible backend is loaded as described in the Parallel Processing section.

Recursive Feature Elimination

Recursive feature elimination (RFE) is a wrapper method of variable selection. In wrapper methods, a given model is fit to subsets of predictor variables in order to select the subset whose fit is optimal. Forward, backward, and step-wise variable selection are examples of wrapper methods. RFE is a type of backward selection in which subsets are formed from decreasing numbers of the most important predictor variables. The RFE algorithm implemented in the package-supplied rfe() function is summarized below.

This RFE algorithm differs from others in that variables are "eliminated" by permuting their values rather than by removing them from the dataset. Using a permutation approach for both the elimination of variables and computation of variable importance enables application of the rfe() function to any variable specification (traditional formula, design matrix, model frame, or recipe) and any model available in the package. The syntax for rfe() is similar to resample() as illustrated in the following example.

## Recursive feature elimination
(surv_rfe <- rfe(surv_fo, data = surv_train, model = GBMModel,
                 control = surv_means_control))
rfe_summary <- summary(surv_rfe)
rfe_summary$terms[rfe_summary$selected]

Partial Dependence Plots

Partial dependence plots show the marginal effects of predictors on a response variable. Dependence for a select set of one or more predictor variables $X_S$ is computed as $$ \bar{f}S(X_S) = \frac{1}{N}\sum{i=1}^N f(X_S, x_{iS'}), $$ where $f$ is a fitted prediction function and $x_{iS'}$ are values of the remaining predictors in a dataset of $N$ cases. The response scale displayed in dependence plots will depend on the response variable type: probability for predicted factors and survival probabilities, original scale for numerics, and survival time for predicted survival means. By default, dependence is computed for each selected predictor individually over a grid of 10 approximately evenly spaced values and averaged over the dataset on which the prediction function was fit.

## Partial dependence plots
pd <- dependence(surv_fit, select = c(thickness, age))
plot(pd)

Estimated predictor effects are marginal in that they are averaged over the remaining variables, whose distribution depends on the population represented by the dataset. Consequently, partial dependence plots for a given model can vary across datasets and populations. The package allows averaging over different datasets to estimate marginal effects in other case populations, over different numbers of predictor values, and over quantile spacing of the values.

pd <- dependence(surv_fit, data = surv_test, select = thickness, n = 20,
                 intervals = "quantile")
plot(pd)

In addition, dependence may be computed for combinations of multiple predictors to examine interaction effects and for summary statistics other than the mean.

Calibration Curves

Agreement between model-predicted and observed values can be visualized with calibration curves. Calibration curves supplement individual performance metrics with information on model fit in different regions of predicted values. They also provide more direct assessment of agreement than some performance metrics, like ROC AUC, that do not account for scale and location differences. In the construction of binned calibration curves, cases are partitioned into equal-width intervals according to their (resampled) predicted responses. Mean observed responses are then calculated within each of the bins and plotted on the vertical axis against the bin midpoints on the horizontal axis.

## Binned calibration curves
cal <- calibration(res_probs, breaks = 10)
plot(cal, se = TRUE)

As an alternative to discrete bins, curves can be smoothed by setting breaks = NULL to compute weighted averages of observed values. Smoothing has the advantage of producing more precise curves by including more observed values in the calculation at each predicted value.

## Smoothed calibration curves
cal <- calibration(res_probs, breaks = NULL)
plot(cal)

Calibration curves close to the 45$^\circ$ line represent agreement between observed and predicted responses and a model that is said to be well calibrated.

Confusion Matrices

Confusion matrices of cross-classified observed and predicted categorical responses are available with the r rdoc_url("confusion()") function. They can be constructed with predicted class membership or with predicted class probabilities. In the latter case, predicted class membership is derived from predicted probabilities according to a probability cutoff value for binary factors (default: cutoff = 0.5) and according to the class with highest probability for factors with more than two levels.

## Confusion matrices
(conf <- confusion(res_probs, cutoff = 0.7))
plot(conf)

Confusion matrices are the data structure upon which many of the performance metrics described earlier for factor predictor variables are based. Metrics commonly reported for confusion matrices are generated by the r rdoc_url("summary()") function.

## Summary performance metrics
summary(conf)

Summaries can also be obtained with the r rdoc_url("performance()") function for default or use-specified metrics.

## Confusion matrix-specific metrics
metricinfo(conf) %>% names

## User-specified metrics
performance(conf, metrics = c("Accuracy" = accuracy,
                              "Sensitivity" = sensitivity,
                              "Specificity" = specificity))

Performance Curves

Tradeoffs between correct and incorrect classifications of binary responses, across the range of possible cutoff probabilities, can be studied with performance curves. In general, any two binary response metrics may be specified for the construction of a performance curve.

ROC Curves

Receiver operating characteristic (ROC) curves are one example in which true positive rates (sensitivity) are plotted against false positive rates (1 - specificity) [@fawcett:2006:IRA]. True positive rate (TPR) and false positive rate (FPR) are defined as $$ \begin{aligned} TPR &= \text{sensitivity} = \Pr(\hat{p} > c \mid D^+) \ FPR &= 1 - \text{specificity} = \Pr(\hat{p} > c \mid D^-), \end{aligned} $$ where $\hat{p}$ is the model-predicted probability of being positive, $0 \le c \le 1$ is a probability cutoff value for classification as positive or negative, and $D^+/D^-$ is positive/negative case status. ROC curves show tradeoffs between the two rates over the range of possible cutoff values. Higher curves are indicative of better predictive performance.

## ROC curves
roc <- performance_curve(res_probs)
plot(roc, diagonal = TRUE)

ROC curves show the relation between the two rates being plotted but not their relationships with specific cutoff values. The latter may be helpful for the selection of a cutoff to apply in practice. Accordingly, separate plots of each rate versus the range of possible cutoffs are available with the type = "cutoffs" option.

plot(roc, type = "cutoffs")

Area under the ROC curve (ROC AUC) is an overall measure of model predictive performance. It is interpreted as the probability that a randomly selected positive case will have a higher predicted value than a randomly selected negative case. AUC values of 0.5 and 1.0 indicate chance and perfect concordance between predicted probabilities and observed responses.

auc(roc)

Precision Recall Curves

Precision recall curves plot precision (positive predictive value) against recall (sensitivity) [@davis:2006:RPR], where $$ \begin{aligned} \text{precision} &= PPV = \Pr(D^+ \mid \hat{p} > c) \ \text{recall} &= \text{sensitivity} = \Pr(\hat{p} > c \mid D^+). \end{aligned} $$ These curves tend to be used when primary interest lies in detecting positive cases and such cases are rare.

## Precision recall curves
pr <- performance_curve(res_probs, metrics = c(precision, recall))
plot(pr)
auc(pr)

Lift Curves

Lift curves depict the rate at which positive cases are found as a function of the proportion predicted to be positive in the population. In particular, they plot true positive rate (sensitivity) against positive prediction rate (PPR) for all possible classification probability cutoffs, where $$ \begin{aligned} TPR &= \Pr(\hat{p} > c \mid D^+) \ PPR &= \Pr(\hat{p} > c). \end{aligned} $$ Models more efficient (lower cost) at identifying positive cases find them at a higher proportion ($TPR$) while predicting fewer in the overall population to be positive ($PPR$). In other words, higher lift curves are signs of model efficiency.

## Lift curves
lf <- lift(res_probs)
plot(lf, find = 0.75)


brian-j-smith/MachineShop documentation built on Sept. 22, 2023, 10:01 p.m.