View source: R/SOptim_ClassificationFunctions.R
calibrateClassifier | R Documentation |
Main function used for classifier training and evaluation for both single and multi-class problems.
calibrateClassifier(
calData,
classificationMethod = "RF",
classificationMethodParams = NULL,
balanceTrainData = FALSE,
balanceMethod = "ubOver",
evalMethod = "HOCV",
evalMetric = "Kappa",
trainPerc = 0.8,
nRounds = 20,
minTrainCases = 30,
minCasesByClassTrain = 10,
minCasesByClassTest = 10,
runFullCalibration = FALSE,
verbose = TRUE
)
calData |
An input calibration dataset used for classification. It can be either an object of class |
classificationMethod |
An input string defining the classification algorithm to be used. Available options are:
|
classificationMethodParams |
A list object with a customized set of parameters to be used for the classification algorithms (default = NULL). See also generateDefaultClassifierParams to see which parameters can be changed and how to structure the list object. |
balanceTrainData |
Defines if data balancing is to be used (only available for single-class problems; default: TRUE). |
balanceMethod |
A character string used to set the data balancing method. Available methods are based on under-sampling
|
evalMethod |
A character string defining the evaluation method. The available methods are |
evalMetric |
A character string setting the evaluation metric or a function that calculates the performance score
based on two vectors one for observed and the other for predicted values (see below for more details).
This option defines the outcome value of the genetic algorithm fitness function and the output of grid or random search
optimization routines. Check |
trainPerc |
A decimal number defining the training proportion (default: 0.8; if |
nRounds |
Number of training rounds used for holdout cross-validation (default: 20; if |
minTrainCases |
The minimum number of training cases used for calibration (default: 20). If the number of rows
in |
minCasesByClassTrain |
Minimum number of cases by class for each train data split so that the classifier is able to run. |
minCasesByClassTest |
Minimum number of cases by class for each test data split so that the classifier is able to run. |
runFullCalibration |
Run full calibration? Check details section (default: FALSE). |
verbose |
Print progress messages? (default: TRUE) |
Two working modes can be used:
i) for "internal" GA optimization or grid/random search: runFullCalibration = FALSE
, or,
ii) for performing a full segmented image classification: runFullCalibration = TRUE
.
Tipically, the first option is used internally for optimizing segmentation parameters in
gaOptimizeSegmentationParams
where the output value from the selected evaluation metric
is passed as the fitnes function outcome for GA optimization.
The second option, should be used to perform a final image classification and to get full evaluation
statistics (slot: 'PerfStats'), confusion matrices (slot: 'ConfMat'), train/test partion sets (slot: 'TrainSets'),
classifier objects (slot: 'ClassObj') and parameters (slot: 'ClassParams'). In addition to the evaluation rounds
(depending on the evaluation method selected) this option will also run a "full" round where all the data (i.e.,
no train/test split) will be used for training. Results from this option can then be used in predictSegments
.
This function can also perform data balancing for single-class problems (check out option balanceTrainData
and
balanceMethod
).
Check ubBalance function for further details regarding data balancing.
For more details about the classification algorithms check out the following functions:
randomForest for random forest algorithm,
gbm for generalized boosted modelling,
svm for details related to support vector machines,
knn for k-nearest neighbour classification, and,
fda for flexible discriminant analysis.
If runFullCalibration = FALSE
then a single average value (across evaluation replicates/folds) for the
selected evaluation metric will be returned (typically used for GA optimization).
If runFullCalibration = TRUE
then an object of class SOptim.Classifier
is returned with the
following elements:
AvgPerf - average value of the evaluation metric selected;
PerfStats - numeric vector with performance statistics (for the selected metric) for each evaluation round plus one more round using the "full" train dataset;
Thresh - for single-class problems only; numeric vector with the threshold values (one for each round plus the "full" dataset) that maximize the selected evaluation metric;
ConfMat - a list object with confusion matrices generated at each round; for single-class problems this matrix is generated by dichotomizing the probability predictions (into 0,1) using the threshold that optimizes the selected evaluation metric (see 'Thresh' above);
obsTestSet - observed values for the test set (one integer vector for each evaluation round plus the full evaluation round);
predTestSet - predicted values for the test set (one integer or numeric vector for each evaluation round plus the full evaluation round);
TrainSets - a list object with row indices identifying train splits for each test round;
ClassObj - a list containing classifier objects for each round;
ClassParams - classification parameters used for running calibrateClassifier
.
In argument evalMetric
it is possible to define a custom function. This must take two vectors: one containing
observed/ground-truth values (first argument) and other with predicted values by the trained classifier (second argument)
and both for the test set (from holdout or k-fold CV). If the classification task is single-class (e.g., 1:forest/0:non-forest,
1:water/0:non-water) then the predicted values will be probabilities (ranging in [0,1]) for the interest class (coded as
1's). If the task is multi-class, then the predicted values will be integer codes for each class.
To be considered valid, the evaluation function for single-class must have:
Have at least two inputs arguments (observed and predicted);
Produce a non-null and valid numerical result;
A scalar output;
An attribute named 'thresh'
defining the numerical threshold to
binarize the classifier predictions (i.e., to convert from continuous probability
to discrete 0,1). The calculation of this threshold is necessary to maximize the
value of the performance metric instead of using a naive 0.5 cutoff value.
Here goes an example function used to calculate the maximum value for the overall accuracy based on multiple threshold values:
calcMaxAccuracy <- function(obs, pred){ accuracies <- c() i <- 0 N <- length(obs) thresholds <- seq(0, 1, 0.05) for(thresh in thresholds){ i <- i + 1 pred_bin <- as.integer(pred > thresh) confusionMatrix <- as.matrix(table(obs, pred_bin)) accuracies[i] <- diag(confusionMatrix) / N } bestAccuracy <- max(accuracies) attr(bestAccuracy, "thresh") <- thresholds[which.max(accuracies)] return(bestAccuracy) } x <- sample(0:1,100,replace=TRUE) y <- runif(100) calcMaxAccuracy(obs = x, pred = y)
Valid multi-class functions' must have:
Have at least two inputs arguments (observed and predicted);
Produce a non-null and valid numerical result;
A scalar output.
An example of a valid custom function to calculate the overall accuracy:
calcAccuracy <- function(obs, pred){ N <- length(obs) confusionMatrix <- as.matrix(table(obs, pred)) acc <- diag(confusionMatrix) / N return(acc) } x <- sample(0:1,100,replace=TRUE) y <- sample(0:1,100,replace=TRUE) calcAccuracy(obs = x, pred = y)
1) By default, if 25% or more of the calibration/evaluation rounds must produce valid results otherwise the
optimization algorithm will return NA
.
2) Data balancing is only performed on the train dataset to avoid bias in performance evaluation derived from this procedure.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.