library(sits) library(sitsdata) library(sits.docs)
Validation is a process undertaken on models to estimate some error associated with them, and hence has been used widely in different scientific disciplines. Here, we are interested in estimating the prediction error associated to some model. For this purpose, we concentrate on the cross-validation approach, probably the most used validation technique [@Hastie2009].
To be sure, cross-validation estimates the expected prediction error. It uses part of the available samples to fit the classification model, and a different part to test it. The so-called k-fold validation, we split the data into $k$ partitions with approximately the same size and proceed by fitting the model and testing it $k$ times. At each step, we take one distinct partition for test and the remaining ${k-1}$ for training the model, and calculate its prediction error for classifying the test partition. A simple average gives us an estimation of the expected prediction error.
A natural question that arises is: how good is this estimation? According to @Hastie2009, there is a bias-variance trade-off in choice of $k$. If $k$ is set to the number of samples, we obtain the so-called leave-one-out validation, the estimator gives a low bias for the true expected error, but produces a high variance expectation. This can be computational expensive as it requires the same number of fitting process as the number of samples. On the other hand, if we choose ${k=2}$, we get a high biased expected prediction error estimation that overestimates the true prediction error, but has a low variance. The recommended choices of $k$ are $5$ or $10$ [@Hastie2009], which somewhat overestimates the true prediction error.
sits_kfold_validate()
gives support the k-fold validation in sits
. The following code gives an example on how to proceed a k-fold cross-validation in the package. It perform a five-fold validation using SVM classification model as a default classifier. We can see in the output text the corresponding confusion matrix and the accuracy statistics (overall and by class).
# perform a five fold validation for the "cerrado_2classes" data set # Random Forest machine learning method using default parameters prediction.mx <- sits_kfold_validate(cerrado_2classes, folds = 5, ml_method = sits_rfor())
One useful function in SITS is the capacity to compare different validation methods and store them in an XLS file for further analysis. The following example shows how to do this, using the Mato Grosso data set.
```r
data("samples_matogrosso_mod13q1")
results <- list()
conf_svm.tb <- sits_kfold_validate( samples_matogrosso_mod13q1, folds = 5, multicores = 2, ml_method = sits_svm(kernel = "radial", cost = 10))
conf_svm.tb$name <- "svm_10"
results[[length(results) + 1]] <- conf_svm.tb
conf_rfor.tb <- sits_kfold_validate( samples_matogrosso_mod13q1, folds = 5, multicores = 1, ml_method = sits_rfor(num_trees = 500))
conf_rfor.tb$name <- "rfor_500"
results[[length(results) + 1]] <- conf_rfor.tb
sits_to_xlsx(results, file = "./accuracy_mt_ml.xlsx") ````
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.