KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semi-supervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, KODAMA is driven by an integrated procedure of cross validation of the results.

KODAMA(data, M = 100, Tcycle = 20, FUN_VAR = function(x) { ceiling(ncol(x)) }, FUN_SAM = function(x) { ceiling(nrow(x) * 0.75)}, bagging = FALSE, FUN = c("PLS-DA","KNN"), f.par = 5, W = NULL, constrain = NULL, fix=NULL, epsilon = 0.05, dims=2, landmarks=5000)

`data` |
a matrix. |

`M` |
number of iterative processes (step I-III). |

`Tcycle` |
number of iterative cycles that leads to the maximization of cross-validated accuracy. |

`FUN_VAR` |
function to select the number of variables to select randomly. By default all variable are taken. |

`FUN_SAM` |
function to select the number of samples to select randomly. By default the 75 per cent of all samples are taken. |

`bagging` |
Should sampling be with replacement, |

`FUN` |
classifier to be considered. Choices are " |

`f.par` |
parameters of the classifier. |

`W` |
a vector of |

`constrain` |
a vector of |

`fix` |
a vector of |

`epsilon` |
cut-off value for low proximity. High proximity are typical of intracluster relationships, whereas low proximities are expected for intercluster relationships. Very low proximities between samples are ignored by (default) setting |

`dims` |
dimensions of the configurations of Sammon's non-linear mapping based on the KODAMA dissimilarity matrix. |

`landmarks` |
number of landmarks to use. |

KODAMA consists of five steps. These can be in turn divided into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the beginning of this part, a fraction of the total samples (defined by `FUN_SAM`

) are randomly selected from the original data. The whole iterative process (step I-III) is repeated `M`

times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of samples is selected. The second part aims at collecting and processing these results by constructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V). Then, Sammon's non-linear mapping is used to visualise the results of KODAMA dissimilarity matrix.

The function returns a list with 4 items:

`dissimilarity` |
a dissimilarity matrix. |

`acc` |
a vector with the |

`proximity` |
a proximity matrix. |

`v` |
a matrix containing the all classification obtained maximizing the cross-validation accuracy. |

`pp` |
a matrix containing the score of the Sammon's non-linear mapping. |

`res` |
a matrix containing all classification vectors obtained through maximizing the cross-validation accuracy. |

`f.par` |
parameters of the classifier.. |

`entropy` |
Shannon's entropy of the KODAMA proximity matrix. |

`landpoints` |
indexes of the landmarks used. |

Stefano Cacciatore and Leonardo Tenori

data(iris) data=iris[,-5] labels=iris[,5] kk=KODAMA(data,FUN="KNN") plot(kk$pp,col=as.numeric(labels), xlab="First component", ylab="Second component",cex=2)

