npelVIMP: Generate an estimate for variable importance for nearest...

Description Usage Arguments Details Value Limitations Warning See Also Examples

Description

There is no well-established way of determining variable importance for nearest neighbour classifiers. In an effort to generate some useful metric for comparison with other models, we developed this leave-one-out type approach; it is analogous to a variable inflation metric. The algorithm proceeds as follows:

  1. Given a model, find the formula it was generated with.

  2. Compute the overall accuracy as well as the accuracy-by-class (producer accuracy) for that model.

  3. Re-generate the model using all but one of the input variables...

  4. ... and save the overall and class level accuracies.

  5. Repeat for each input variable so we have an accuracy metric for a model in which all-but-one variable is included.

  6. Standardize each variable (column) by the change from the accuracy of the complete model:

    Acc_complete-Acc_without.variable.x

Usage

1
npelVIMP(model, calc = FALSE, echo = TRUE)

Arguments

model

the classifier to test

calc

(optional) should the function recalculate the VIMP; that is, if the model is one of the types that computes VIMP during generation, should we recalculate it, or use the pre-existing VIMP; currently: randomForest, randomForestSRC, and gbm.

echo

(optional) should the function inform the user about it's progress

Details

However, once the nearest-neighbour estimates were achieved we realized that we still couldn't compare because of the different algorithms used; hence we expanded this function to accept any of the models this package deals with. But see limitations for a (short) discussion on why this method is not the greatest.

Value

returns a data frame with rows as the variables, and columns being the classes. The first column is the overall VIMP for that variable.

Limitations

While this algorithm is simple, easy to compute, and applicable to any model, it has some (serious) limitations:

Hence, our recommendation is to consider the VIMP data produced by this function with caution: perhaps only consider the rank order, or look only at gross effects. We have considered standardizing the output in other ways, e.g. by the largest value in each row (each variable that has been removed) so the maximum in each row is unity, but it isn't immediately clear what the results mean. So, while there is considerable literature on using leave-one-out type approaches for cross-validation, until the technique of using it as a VIMP metric can be further studied, the results of this function fall under caveat emptor. Enjoy!

Warning

Given that each model type, and even each model implementation uses different algorithms for computing VIMP, the output from this function does not fall on the same scale as that given by other methods. This is not a limitation of our function, it is a limitation of VIMP in general. In fact, that our model can compute VIMP on all the models used is one of its strengths. However, do not try and compare (or plot) the VIMP output from different models/packages against each other; they are only valid as relative values.

See Also

For VIMP information about packages used that have the metric build-in: importance, rfsrc, and summary.gbm.

Examples

1
2
3
4
5
6
7
8
data ('siteData')
modelRun <- generateModels (data = siteData,
                            modelTypes = suppModels,
                            x = c('brtns','grnns','wetns','dem','slp','asp','hsd'),
                            y = 'ecoType',
                            grouping = ecoGroup[['domSpecies','transform']])
npelVIMP (modelRun$rfsrc,calc=FALSE)
npelVIMP (modelRun$rfsrc,calc=TRUE)

henkelstone/NPEL.Classification documentation built on May 17, 2019, 3:42 p.m.