Calculation of performance metrics on test sets or by resampling, as discussed previously, is one method of assessing model performance. Others available include measures of predictor variable importance, partial dependence plots, calibration curves comparing observed and predicted response values, and receiver operating characteristic analysis.
The importance of predictor variables in a model fit is estimated with the r rdoc_url("varimp()")
function and displayed graphically with r rdoc_url("plot()")
. Variable importance is a relative measure of the contributions of model predictors and is scaled by default to have a maximum value of 100, where higher values represent more important variables. Model-specific implementations of variable importance are available in many cases, although their definitions may differ. In the case of a r rdoc_url("GBMModel")
, importance of each predictor is based on the sum of squared empirical improvements over all internal tree nodes created by splitting on that variable [@greenwell:2019:GBM].
## Predictor variable importance (vi <- varimp(surv_fit, method = "model")) plot(vi)
In contrast, importance is based on negative log-transformed p-values for statistical models, like r rdoc_url("CoxModel")
, that produce them. For other models, variable importance may be defined and calculated by their underlying source packages or not defined at all, as is the case for r rdoc_url("SVMModel")
. Logical indicators of model-specific variable importance are given in the information displayed by model constructors and returned by r rdoc_url("modelinfo()")
.
SVMModel modelinfo(SVMModel)[[1]]$varimp
Variable importance can be computed with model-agnostic permutation methods [@fisher:2019:AMW] as an alternative to model-specific methods. The following algorithm for permutation-based variable importance is implemented and the default method in MachineShop.
Fit a model to a training dataset.
Compute training performance.
For $s = 1, \ldots, S$.
Optionally sample a subset of the training set without replacement.
For predictor variable $p = 1, \ldots, P$.
Randomly permute the variable values.
Compute performance on the permutation set.
Compute importance as the difference or ratio between the permutation and training performances.
Reset the variable to its original values.
Return the mean or other summary statistics of importance for each variable.
## Permutation-based variable importance varimp(surv_fit)
There are a number of advantages to permutation-based variable importance. In particular, it can be computed for any
Conversely, model-specific methods are not defined for some models, are generally limited to metrics implemented in their source packages, and might be computed on derived, rather than original, predictor variables. These differences can make comparisons of variable importance across classes of models difficult if not impossible. The trade-off for the advantages of a permutation-based approach is increased computation time. To speed up computations, the algorithm will run in parallel if a compatible backend is loaded as described in the Parallel Processing section.
Recursive feature elimination (RFE) is a wrapper method of variable selection. In wrapper methods, a given model is fit to subsets of predictor variables in order to select the subset whose fit is optimal. Forward, backward, and step-wise variable selection are examples of wrapper methods. RFE is a type of backward selection in which subsets are formed from decreasing numbers of the most important predictor variables. The RFE algorithm implemented in the package-supplied rfe()
function is summarized below.
Compute variable importance for all predictors.
For predictor subsets of sizes $S = S_n > ... > S_1$.
Eliminate predictors whose variable importance is not in the top $S$ by randomly permuting their values.
Compute a resampled estimate of model predictive performance.
Optionally recompute variable importance for the top $S$ predictors, and set importance equal to zero for those eliminated.
Select the predictor set with highest predictive performance.
This RFE algorithm differs from others in that variables are "eliminated" by permuting their values rather than by removing them from the dataset. Using a permutation approach for both the elimination of variables and computation of variable importance enables application of the rfe()
function to any variable specification (traditional formula, design matrix, model frame, or recipe) and any model available in the package. The syntax for rfe()
is similar to resample()
as illustrated in the following example.
## Recursive feature elimination (surv_rfe <- rfe(surv_fo, data = surv_train, model = GBMModel, control = surv_means_control)) rfe_summary <- summary(surv_rfe) rfe_summary$terms[rfe_summary$selected]
Partial dependence plots show the marginal effects of predictors on a response variable. Dependence for a select set of one or more predictor variables $X_S$ is computed as $$ \bar{f}S(X_S) = \frac{1}{N}\sum{i=1}^N f(X_S, x_{iS'}), $$ where $f$ is a fitted prediction function and $x_{iS'}$ are values of the remaining predictors in a dataset of $N$ cases. The response scale displayed in dependence plots will depend on the response variable type: probability for predicted factors and survival probabilities, original scale for numerics, and survival time for predicted survival means. By default, dependence is computed for each selected predictor individually over a grid of 10 approximately evenly spaced values and averaged over the dataset on which the prediction function was fit.
## Partial dependence plots pd <- dependence(surv_fit, select = c(thickness, age)) plot(pd)
Estimated predictor effects are marginal in that they are averaged over the remaining variables, whose distribution depends on the population represented by the dataset. Consequently, partial dependence plots for a given model can vary across datasets and populations. The package allows averaging over different datasets to estimate marginal effects in other case populations, over different numbers of predictor values, and over quantile spacing of the values.
pd <- dependence(surv_fit, data = surv_test, select = thickness, n = 20, intervals = "quantile") plot(pd)
In addition, dependence may be computed for combinations of multiple predictors to examine interaction effects and for summary statistics other than the mean.
Agreement between model-predicted and observed values can be visualized with calibration curves. Calibration curves supplement individual performance metrics with information on model fit in different regions of predicted values. They also provide more direct assessment of agreement than some performance metrics, like ROC AUC, that do not account for scale and location differences. In the construction of binned calibration curves, cases are partitioned into equal-width intervals according to their (resampled) predicted responses. Mean observed responses are then calculated within each of the bins and plotted on the vertical axis against the bin midpoints on the horizontal axis.
## Binned calibration curves cal <- calibration(res_probs, breaks = 10) plot(cal, se = TRUE)
As an alternative to discrete bins, curves can be smoothed by setting breaks = NULL
to compute weighted averages of observed values. Smoothing has the advantage of producing more precise curves by including more observed values in the calculation at each predicted value.
## Smoothed calibration curves cal <- calibration(res_probs, breaks = NULL) plot(cal)
Calibration curves close to the 45$^\circ$ line represent agreement between observed and predicted responses and a model that is said to be well calibrated.
Confusion matrices of cross-classified observed and predicted categorical responses are available with the r rdoc_url("confusion()")
function. They can be constructed with predicted class membership or with predicted class probabilities. In the latter case, predicted class membership is derived from predicted probabilities according to a probability cutoff value for binary factors (default: cutoff = 0.5
) and according to the class with highest probability for factors with more than two levels.
## Confusion matrices (conf <- confusion(res_probs, cutoff = 0.7))
plot(conf)
Confusion matrices are the data structure upon which many of the performance metrics described earlier for factor predictor variables are based. Metrics commonly reported for confusion matrices are generated by the r rdoc_url("summary()")
function.
## Summary performance metrics summary(conf)
Summaries can also be obtained with the r rdoc_url("performance()")
function for default or use-specified metrics.
## Confusion matrix-specific metrics metricinfo(conf) %>% names ## User-specified metrics performance(conf, metrics = c("Accuracy" = accuracy, "Sensitivity" = sensitivity, "Specificity" = specificity))
Tradeoffs between correct and incorrect classifications of binary responses, across the range of possible cutoff probabilities, can be studied with performance curves. In general, any two binary response metrics may be specified for the construction of a performance curve.
Receiver operating characteristic (ROC) curves are one example in which true positive rates (sensitivity) are plotted against false positive rates (1 - specificity) [@fawcett:2006:IRA]. True positive rate (TPR) and false positive rate (FPR) are defined as $$ \begin{aligned} TPR &= \text{sensitivity} = \Pr(\hat{p} > c \mid D^+) \ FPR &= 1 - \text{specificity} = \Pr(\hat{p} > c \mid D^-), \end{aligned} $$ where $\hat{p}$ is the model-predicted probability of being positive, $0 \le c \le 1$ is a probability cutoff value for classification as positive or negative, and $D^+/D^-$ is positive/negative case status. ROC curves show tradeoffs between the two rates over the range of possible cutoff values. Higher curves are indicative of better predictive performance.
## ROC curves roc <- performance_curve(res_probs) plot(roc, diagonal = TRUE)
ROC curves show the relation between the two rates being plotted but not their relationships with specific cutoff values. The latter may be helpful for the selection of a cutoff to apply in practice. Accordingly, separate plots of each rate versus the range of possible cutoffs are available with the type = "cutoffs"
option.
plot(roc, type = "cutoffs")
Area under the ROC curve (ROC AUC) is an overall measure of model predictive performance. It is interpreted as the probability that a randomly selected positive case will have a higher predicted value than a randomly selected negative case. AUC values of 0.5 and 1.0 indicate chance and perfect concordance between predicted probabilities and observed responses.
auc(roc)
Precision recall curves plot precision (positive predictive value) against recall (sensitivity) [@davis:2006:RPR], where $$ \begin{aligned} \text{precision} &= PPV = \Pr(D^+ \mid \hat{p} > c) \ \text{recall} &= \text{sensitivity} = \Pr(\hat{p} > c \mid D^+). \end{aligned} $$ These curves tend to be used when primary interest lies in detecting positive cases and such cases are rare.
## Precision recall curves pr <- performance_curve(res_probs, metrics = c(precision, recall)) plot(pr)
auc(pr)
Lift curves depict the rate at which positive cases are found as a function of the proportion predicted to be positive in the population. In particular, they plot true positive rate (sensitivity) against positive prediction rate (PPR) for all possible classification probability cutoffs, where $$ \begin{aligned} TPR &= \Pr(\hat{p} > c \mid D^+) \ PPR &= \Pr(\hat{p} > c). \end{aligned} $$ Models more efficient (lower cost) at identifying positive cases find them at a higher proportion ($TPR$) while predicting fewer in the overall population to be positive ($PPR$). In other words, higher lift curves are signs of model efficiency.
## Lift curves lf <- lift(res_probs) plot(lf, find = 0.75)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.