cmp.cluster: cluster compounds using a descriptor database
In girke-lab/ChemmineR-release: Cheminformatics Toolkit for R

Description Usage Arguments Details Value Author(s) See Also Examples

'cmp.cluster' uses structural compound descriptors and clusters the compounds based on their pairwise distances. cmp.cluster uses single linkage to measure distance between clusters when it merges clusters. It accepts both a single cutoff and a cutoff vector. By using a cutoff vector, it can generate results similar to hierarchical clustering after tree cutting.

1 2	cmp.cluster(db, cutoff, is.similarity = TRUE, save.distances = FALSE, use.distances = NULL, quiet = FALSE, ...)

`db`	The desciptor database, in the format returned by 'cmp.parse'.
`cutoff`	The clustering cutoff. Can be a single value or a vector. The cutoff gives the maximum distance between two compounds in order to group them in the same cluster.
`is.similarity`	Set when the cutoff supplied is a similarity cutoff. This cutoff is the minimum similarity value between two compounds such that they will be grouped in the same cluster.
`save.distances`	whether to save distance for future clustering. See details below.
`use.distances`	Supply pre-computed distance matrix.
`quiet`	Whether to suppress the progress information.
`...`	Further arguments to be passed to `cmp.similarity`.

cmp.cluster will compute distances on the fly if use.distances is not set. Furthermore, if save.distances is not set, the distance values computed will never be stored and any distance between two compounds is guaranteed not to be computed twice. Using this method, cmp.cluster can deal with large databases when a distance matrix in memory is not feasible. The speed of the clustering function should be slowed when using a transient distance calculation.

When save.distances is set, cmp.cluster will be forced to compute the distance matrix and save it in memory before the clustering. This is useful when additional clusterings are required in the future without re-computed the distance matrix. Set save.distances to TRUE if you only want to force the clustering to use this 2-step approach; otherwise, set it to the filename under which you want the distance matrix to be saved. After you save it, when you need to reuse the distance matrix, you can 'load' it, and supply it to cmp.cluster via the use.distances argument.

cmp.cluster supports a vector of several cutoffs. When you have multiple cutoffs, cmp.cluster still guarantees that pairwise distances will never be recomputed, and no copy of distances is kept in memory. It is guaranteed to be as fast as calling cmp.cluster with a single cutoff that results in the longest processing time, plus some small overhead linear in processing time.

Returns a data.frame. Besides a variable giving compound ID, each of the other variables in the data frame will either give the cluster IDs of compounds under some clustering cutoff, or the size of clusters that the compounds belong to. When N cutoffs are given, in total 2*N+1 variables will be generated, with N of them giving the cluster ID of each compound under each of the N cutoffs, and the other N of them giving the cluster size under each of the N cutoffs. The rows are sorted by cluster sizes.

Y. Eddie Cao, Li-Chang Cheng

cmp.parse1, cmp.parse, cmp.search, cmp.similarity

## Load sample SD file
# data(sdfsample); sdfset <- sdfsample

## Generate atom pair descriptor database for searching
# apset <- sdf2ap(sdfset) 

## Loads atom pair and atom pair fingerprint samples provided by library
data(apset) 
db <- apset
fpset <- desc2fp(apset)

## Clustering of 'APset' object with multiple cutoffs
clusters <- cmp.cluster(db=apset, cutoff=c(0.5, 0.85))

## Clustering of 'FPset' object with multiple cutoffs. This method allows to call 
## various similarity methods provided by the fpSim function.
clusters2 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7), method="Tversky") 

## Saves the distance matrix before clustering:
clusters <- cmp.cluster(db, cutoff=0.65, save.distances="distmat.rda")
# Later one reload the matrix and pass it the clustering function. 
load("distmat.rda")
clusters <- cmp.cluster(db, cutoff=0.60, use.distances=distmat)