Distance: Distance calculation
In IntClust: Integration of Multiple Data Sets with Clustering Techniques

Description Usage Arguments Details Value Examples

The Distance function calculates the distances between the data objects. The included distance measures are euclidean for continuous data and the tanimoto coefficient or jaccard index for binary data.

1
2
3

Distance(Data, distmeasure = c("tanimoto", "jaccard", "euclidean", "hamming",
  "cont tanimoto", "MCA_coord", "gower", "chi.squared", "cosine"),
  normalize = FALSE, method = NULL)

`Data`	A data matrix. It is assumed the rows are corresponding with the objects.
`distmeasure`	Choice of metric for the dissimilarity matrix (character). Should be one of "tanimoto", "euclidean", "jaccard","hamming","cont tanimoto","MCA_coord","gower","chi.squared" or "cosine"
`normalize`	Logical. Indicates whether to normalize the distance matrices or not, default is FALSE. This is recommended if different distance types are used. More details on normalization in `Normalization`.
`method`	A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names. Default is NULL.

The euclidean distance distance is included for continuous matrices while for binary matrices, one has the choice of either the jaccard index, the tanimoto coeffcient or the hamming distance. The hamming distance is obtained by applying the hamming.distance function of the e1071 package. It will compute the hamming distance between the rows of the data matrix. The hamming distance counts the number of times where two rows differ in their zero and one values. The Jaccard index is calcaluted as determined by the formula of the dist.binary function in the a4 package and the tanimoto coefficient as described by Li2011. For both, first the similarity is calculated as

s=frac{n11}{n11+n10+n01}

with n11 the number of features the 2 objects have in common, n10 the number of features of the first compound and n01 the number of features of the second compound. These similarities are converted to distances by:

J=√{1-s}

for the jaccard index and by:

T=1-s

for the tanimoto coefficient. The lower the similarity values s are, the more features are shared between the two objects and the more alike they are. Since clustering is based on dissimilarity, the conversion to distances is performed. If normalize=TRUE and the distance meausure is euclidean, the data matrix is normalized beforehand. Further, a version of the tanimoto coefficient is also available for continuous data.

The returned value is a distance matrix.