# DetermineWeight_SimClust: Determines an optimal weight for weighted clustering by... In IntClust: Integration of Multiple Data Sets with Clustering Techniques

## Description

The function `DetermineWeight_SimClust` determines an optimal weight for performing weighted similarity clustering on by applying similarity clustering. For each given weight, is each separate clustering compared to the clustering on a weighted dissimilarity matrix and a Jaccard coefficient is calculated. The ratio of the Jaccard coefficients closets to one indicates an optimal weight.

## Usage

 ```1 2 3 4 5 6``` ```DetermineWeight_SimClust(List, type = c("data", "dist", "clusters"), distmeasure = c("tanimoto", "tanimoto"), normalize = c(FALSE, FALSE), method = c(NULL, NULL), weight = seq(0, 1, by = 0.01), nrclusters = NULL, clust = "agnes", linkage = c("flexible", "flexible"), linkageF = "ward", alpha = 0.625, gap = FALSE, maxK = 15, names = NULL, StopRange = FALSE, plottype = "new", location = NULL) ```

## Arguments

 `List` A list of matrices of the same type. It is assumed the rows are corresponding with the objects. `type` indicates whether the provided matrices in "List" are either data matrices, distance matrices or clustering results obtained from the data. If type="dist" the calculation of the distance matrices is skipped and if type="clusters" the single source clustering is skipped. Type should be one of "data", "dist" or "clusters". `distmeasure` A vector of the distance measures to be used on each data matrix. Should be one of "tanimoto", "euclidean", "jaccard", "hamming". Defaults to c("tanimoto","tanimoto"). `normalize` Logical. Indicates whether to normalize the distance matrices or not, defaults to c(FALSE, FALSE) for two data sets. This is recommended if different distance types are used. More details on normalization in `Normalization`. `method` A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names. Default is c(NULL,NULL) for two data sets. `weight` Optional. A list of different weight combinations for the data sets in List. If NULL, the weights are determined to be equal for each data set. It is further possible to fix weights for some data matrices and to let it vary randomly for the remaining data sets. Defaults to seq(1,0,-0.1). An example is provided in the details. `nrclusters` The number of clusters to cut the dendrogram in. This is necessary for the computation of the Jaccard coefficient. Default is NULL. `clust` Choice of clustering function (character). Defaults to "agnes". `linkage` Choice of inter group dissimilarity (character) for the individual clusterings. Defaults to c("flexible","flexible"). `linkageF` Choice of inter group dissimilarity (character) for the final clustering. Defaults to "ward". `alpha` The parameter alpha to be used in the "flexible" linkage of the agnes function. Defaults to 0.625 and is only used if the linkage is set to "flexible". `gap` Logical. Whether or not to calculate the gap statistic in the clustering on each data matrix separately. Only if type="data". Default is FALSE. `maxK` The maximal number of clusters to consider in calculating the gap statistic. Only if type="data". Default is 15. `names` The labels to give to the elements in List. Default is NULL. `StopRange` Logical. Indicates whether the distance matrices with values not between zero and one should be standardized to have so. If FALSE the range normalization is performed. See `Normalization`. If TRUE, the distance matrices are not changed. This is recommended if different types of data are used such that these are comparable. Default is FALSE. `plottype` Should be one of "pdf","new" or "sweave". If "pdf", a location should be provided in "location" and the figure is saved there. If "new" a new graphic device is opened and if "sweave", the figure is made compatible to appear in a sweave or knitr document, i.e. no new device is opened and the plot appears in the current device or document. Default is "new". `location` If plottype is "pdf", a location should be provided in "location" and the figure is saved there. Default is FALSE.

## Details

If the type of List is data, an hierarchical clustering is performed on each data matrix separately. After obtaining clustering results for the two data matrices, the distance matrices are extracted. If these are not calculated with the same distance measure, they are normalized to be in the same range. For each weight, a weighted linear combination of the distance matrices is taken and hierarchical clustering is performed once again. The resulting clustering is compared to each of the separate clustering results and a Jaccard coefficient is computed. The ratio of the Jaccard coefficients closets to one, indicates an optimal weight. A plot of all the ratios is produced with an extra indication for the optimal weight.

The weight combinations should be provided as elements in a list. For three data matrices an example could be: weights=list(c(0.5,0.2,0.3),c(0.1,0.5,0.4)). To provide a fixed weight for some data sets and let it vary randomly for others, the element "x" indicates a free parameter. An example is weights=list(c(0.7,"x","x")). The weight 0.7 is now fixed for the first data matrix while the remaining 0.3 weight will be divided over the other two data sets. This implies that every combination of the sequence from 0 to 0.3 with steps of 0.1 will be reported and clustering will be performed for each.

## Value

The returned value is a list with three elements:

 `ClustSep` The result of `Cluster` for each single element of List `Result` A data frame with the Jaccard coefficients and their ratios for each weight `Weight` The optimal weight

## References

\insertRef

PerualilaTan2016IntClust

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17``` ```## Not run: data(fingerprintMat) data(targetMat) MCF7_F = Cluster(fingerprintMat,type="data",distmeasure="tanimoto",normalize=FALSE, method=NULL,clust="agnes",linkage="flexible",alpha=0.625,gap=FALSE,maxK=55,StopRange=FALSE) MCF7_T = Cluster(targetMat,type="data",distmeasure="tanimoto",normalize=FALSE, method=NULL,clust="agnes",linkage="flexible",alpha=0.625,gap=FALSE,maxK=55,StopRange=FALSE) L=list(MCF7_F,MCF7_T) MCF7_Weight=DetermineWeight_SimClust(List=L,type="clusters",weight=seq(0,1,by=0.01), nrclusters=c(7,7),distmeasure=c("tanimoto","tanimoto"),normalize=c(FALSE,FALSE), method=c(NULL,NULL),clust="agnes",linkage=c("flexible","flexible"),linkageF="ward", alpha=0.625,gap=FALSE,maxK=50,names=c("FP","TP"),StopRange=FALSE,plottype="new",location=NULL) ## End(Not run) ```

IntClust documentation built on May 2, 2019, 5:51 a.m.