npelVIMP: Generate an estimate for variable importance for nearest...
In henkelstone/NPEL.Classification: Classifaction and data-handling routines for NPEL Caribou Project

Description Usage Arguments Details Value Limitations Warning See Also Examples

There is no well-established way of determining variable importance for nearest neighbour classifiers. In an effort to generate some useful metric for comparison with other models, we developed this leave-one-out type approach; it is analogous to a variable inflation metric. The algorithm proceeds as follows:

Given a model, find the formula it was generated with.
Compute the overall accuracy as well as the accuracy-by-class (producer accuracy) for that model.
Re-generate the model using all but one of the input variables...
... and save the overall and class level accuracies.
Repeat for each input variable so we have an accuracy metric for a model in which all-but-one variable is included.
Standardize each variable (column) by the change from the accuracy of the complete model:

Acc_complete-Acc_without.variable.x

1	npelVIMP(model, calc = FALSE, echo = TRUE)

`model`	the classifier to test
`calc`	(optional) should the function recalculate the VIMP; that is, if the model is one of the types that computes VIMP during generation, should we recalculate it, or use the pre-existing VIMP; currently: randomForest, randomForestSRC, and gbm.
`echo`	(optional) should the function inform the user about it's progress

However, once the nearest-neighbour estimates were achieved we realized that we still couldn't compare because of the different algorithms used; hence we expanded this function to accept any of the models this package deals with. But see limitations for a (short) discussion on why this method is not the greatest.

returns a data frame with rows as the variables, and columns being the classes. The first column is the overall VIMP for that variable.

While this algorithm is simple, easy to compute, and applicable to any model, it has some (serious) limitations:

The most notable is that it is not clear that a variable's impact on a model's accuracy is well correlated with the drop in overall accuracy! Sure, this can be one definition of a VIMP metric, but this is much coarser than those use by, for example, random forest. Hence, while this function gives some idea of which variables are contributing to accuracy, it is not clear that the scale is consistent between variables. Hence, a VIMP of 0.1 for a given variable on a given class may not be the same as a VIMP of 0.1 for a different variable.
Many of the classifiers used in this package are sensitive to colinearity in the data; and there is typically much colinearity in remote sensed data! We can easily generate 20 or more variables (layers) from the imagery and elevation data alone. It is clear that these variables cannot all be orthogonal. Hence, removing a variable when there is significant colinearity may have no effect on the overall accuracy, or worse, it may have only a ‘stochastic’ effect, which is to say, it depends on small perturbations in the input data.
It is not uncommon to see models improving when variables are removed. To a certain extent this is to be expect, however, in our experience it occurs with more frequency than might be expected. Our hypothesis is that the limitations discussed above contribute to this effect.

Hence, our recommendation is to consider the VIMP data produced by this function with caution: perhaps only consider the rank order, or look only at gross effects. We have considered standardizing the output in other ways, e.g. by the largest value in each row (each variable that has been removed) so the maximum in each row is unity, but it isn't immediately clear what the results mean. So, while there is considerable literature on using leave-one-out type approaches for cross-validation, until the technique of using it as a VIMP metric can be further studied, the results of this function fall under caveat emptor. Enjoy!

Given that each model type, and even each model implementation uses different algorithms for computing VIMP, the output from this function does not fall on the same scale as that given by other methods. This is not a limitation of our function, it is a limitation of VIMP in general. In fact, that our model can compute VIMP on all the models used is one of its strengths. However, do not try and compare (or plot) the VIMP output from different models/packages against each other; they are only valid as relative values.

For VIMP information about packages used that have the metric build-in: importance, rfsrc, and summary.gbm.

data ('siteData')
modelRun <- generateModels (data = siteData,
                            modelTypes = suppModels,
                            x = c('brtns','grnns','wetns','dem','slp','asp','hsd'),
                            y = 'ecoType',
                            grouping = ecoGroup[['domSpecies','transform']])
npelVIMP (modelRun$rfsrc,calc=FALSE)
npelVIMP (modelRun$rfsrc,calc=TRUE)