library(sits)
library(sitsdata)
library(sits.docs)

Validation techniques

Validation is a process undertaken on models to estimate some error associated with them, and hence has been used widely in different scientific disciplines. Here, we are interested in estimating the prediction error associated to some model. For this purpose, we concentrate on the cross-validation approach, probably the most used validation technique [@Hastie2009].

To be sure, cross-validation estimates the expected prediction error. It uses part of the available samples to fit the classification model, and a different part to test it. The so-called k-fold validation, we split the data into $k$ partitions with approximately the same size and proceed by fitting the model and testing it $k$ times. At each step, we take one distinct partition for test and the remaining ${k-1}$ for training the model, and calculate its prediction error for classifying the test partition. A simple average gives us an estimation of the expected prediction error.

A natural question that arises is: how good is this estimation? According to @Hastie2009, there is a bias-variance trade-off in choice of $k$. If $k$ is set to the number of samples, we obtain the so-called leave-one-out validation, the estimator gives a low bias for the true expected error, but produces a high variance expectation. This can be computational expensive as it requires the same number of fitting process as the number of samples. On the other hand, if we choose ${k=2}$, we get a high biased expected prediction error estimation that overestimates the true prediction error, but has a low variance. The recommended choices of $k$ are $5$ or $10$ [@Hastie2009], which somewhat overestimates the true prediction error.

sits_kfold_validate() gives support the k-fold validation in sits. The following code gives an example on how to proceed a k-fold cross-validation in the package. It perform a five-fold validation using SVM classification model as a default classifier. We can see in the output text the corresponding confusion matrix and the accuracy statistics (overall and by class).

# perform a five fold validation for the "cerrado_2classes" data set
# Random Forest machine learning method using default parameters
prediction.mx <- sits_kfold_validate(cerrado_2classes, 
                                     folds = 5, 
                                     ml_method = sits_rfor())

Comparing different validation methods

One useful function in SITS is the capacity to compare different validation methods and store them in an XLS file for further analysis. The following example shows how to do this, using the Mato Grosso data set.

```r

Retrieve the set of samples for the Mato Grosso region (provided by EMBRAPA)

data("samples_matogrosso_mod13q1")

create a list to store the results

results <- list()

adjust the multicores parameters to suit your machine

SVM model

conf_svm.tb <- sits_kfold_validate( samples_matogrosso_mod13q1, folds = 5, multicores = 2, ml_method = sits_svm(kernel = "radial", cost = 10))

Give a name to the SVM model

conf_svm.tb$name <- "svm_10"

store the result

results[[length(results) + 1]] <- conf_svm.tb

conf_rfor.tb <- sits_kfold_validate( samples_matogrosso_mod13q1, folds = 5, multicores = 1, ml_method = sits_rfor(num_trees = 500))

Give a name to the model

conf_rfor.tb$name <- "rfor_500"

store the results in a list

results[[length(results) + 1]] <- conf_rfor.tb

Save to an XLS file

sits_to_xlsx(results, file = "./accuracy_mt_ml.xlsx") ````



e-sensing/sits-docs documentation built on March 30, 2021, 10:59 a.m.