KODAMA | R Documentation |
KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, KODAMA is driven by an integrated procedure of cross validation of the results.
KODAMA(data, M = 100, Tcycle = 20, FUN_VAR = function(x) { ceiling(ncol(x)) }, FUN_SAM = function(x) { ceiling(nrow(x) * 0.75)}, bagging = FALSE, FUN = c("PLS-DA","KNN"), f.par = 5, W = NULL, constrain = NULL, fix=NULL, epsilon = 0.05, dims=2, landmarks=5000)
data |
a matrix. |
M |
number of iterative processes (step I-III). |
Tcycle |
number of iterative cycles that leads to the maximization of cross-validated accuracy. |
FUN_VAR |
function to select the number of variables to select randomly. By default all variable are taken. |
FUN_SAM |
function to select the number of samples to select randomly. By default the 75 per cent of all samples are taken. |
bagging |
Should sampling be with replacement, |
FUN |
classifier to be considered. Choices are " |
f.par |
parameters of the classifier. |
W |
a vector of |
constrain |
a vector of |
fix |
a vector of |
epsilon |
cut-off value for low proximity. High proximity are typical of intracluster relationships, whereas low proximities are expected for intercluster relationships. Very low proximities between samples are ignored by (default) setting |
dims |
dimensions of the configurations of Sammon's non-linear mapping based on the KODAMA dissimilarity matrix. |
landmarks |
number of landmarks to use. |
KODAMA consists of five steps. These can be in turn divided into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the beginning of this part, a fraction of the total samples (defined by FUN_SAM
) are randomly selected from the original data. The whole iterative process (step I-III) is repeated M
times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of samples is selected. The second part aims at collecting and processing these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V). Then, Sammon's non-linear mapping is used to visualise the results of KODAMA dissimilarity matrix.
The function returns a list with 4 items:
dissimilarity |
a dissimilarity matrix. |
acc |
a vector with the |
proximity |
a proximity matrix. |
v |
a matrix containing the all classification obtained maximizing the cross-validation accuracy. |
pp |
a matrix containing the score of the Sammon's non-linear mapping. |
res |
a matrix containing all classification vectors obtained through maximizing the cross-validation accuracy. |
f.par |
parameters of the classifier.. |
entropy |
Shannon's entropy of the KODAMA proximity matrix. |
landpoints |
indexes of the landmarks used. |
Stefano Cacciatore and Leonardo Tenori
Cacciatore S, Luchinat C, Tenori L
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22. doi: 10.1073/pnas.1220873111. Link
Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA
KODAMA: an updated R package for knowledge discovery and data mining.
Bioinformatics 2017;33(4):621-623. doi: 10.1093/bioinformatics/btw705. Link
data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA(data,FUN="KNN") plot(kk$pp,col=as.numeric(labels), xlab="First component", ylab="Second component",cex=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.