clusterCMP: cluster compounds using a descriptor database
In BioMedR: Generating Various Molecular Representations for Chemicals, Proteins, DNAs/RNAs and Their Interactions

Description Usage Arguments Details Value Author(s) References See Also Examples

'clusterCMP' uses structural compound descriptors and clusters the compounds based on their pairwise distances. clusterCMP uses single linkage to measure distance between clusters when it merges clusters. It accepts both a single cutoff and a cutoff vector. By using a cutoff vector, it can generate results similar to hierarchical clustering after tree cutting.

1 2	clusterCMP(db, cutoff, is.similarity = TRUE, save.distances = FALSE, use.distances = NULL, quiet = FALSE, ...)

`db`	The desciptor database
`cutoff`	The clustering cutoff. Can be a single value or a vector. The cutoff gives the maximum distance between two compounds in order to group them in the same cluster.
`is.similarity`	Set when the cutoff supplied is a similarity cutoff. This cutoff is the minimum similarity value between two compounds such that they will be grouped in the same cluster.
`save.distances`	whether to save distance for future clustering. See details below.
`use.distances`	Supply pre-computed distance matrix.
`quiet`	Whether to suppress the progress information.
`...`	Further arguments to be passed to similarity.

clusterCMP will compute distances on the fly if use.distances is not set. Furthermore, if save.distances is not set, the distance values computed will never be stored and any distance between two compounds is guaranteed not to be computed twice. Using this method, clusterCMP can deal with large databases when a distance matrix in memory is not feasible. The speed of the clustering function should be slowed when using a transient distance calculation. When save.distances is set, clusterCMP will be forced to compute the distance matrix and save it in memory before the clustering. This is useful when additional clusterings are required in the future without re-computed the distance matrix. Set save.distances to TRUE if you only want to force the clustering to use this 2-step approach; otherwise, set it to the filename under which you want the distance matrix to be saved. After you save it, when you need to reuse the distance matrix, you can 'load' it, and supply it to clusterCMP via the use.distances argument. clusterCMP supports a vector of several cutoffs. When you have multiple cutoffs, clusterCMP still guarantees that pairwise distances will never be recomputed, and no copy of distances is kept in memory. It is guaranteed to be as fast as calling clusterCMP with a single cutoff that results in the longest processing time, plus some small overhead linear in processing time.

Returns a data.frame. Besides a variable giving compound ID, each of the other variables in the data frame will either give the cluster IDs of compounds under some clustering cutoff, or the size of clusters that the compounds belong to. When N cutoffs are given, in total 2*N+1 variables will be generated, with N of them giving the cluster ID of each compound under each of the N cutoffs, and the other N of them giving the cluster size under each of the N cutoffs. The rows are sorted by cluster sizes.

Min-feng Zhu <wind2zhu@163.com>

...

See clusterStat for generate statistics on sizes of clusters.

data(sdfbcl)
apbcl <- convSDFtoAP(sdfbcl)
fpbcl <- convAPtoFP(apbcl)
clusters <- clusterCMP(db = apbcl, cutoff = c(0.5, 0.85))
clusters2 <- clusterCMP(fpbcl, cutoff = c(0.5, 0.7), method = "Tversky")
clusters <- clusterCMP(apbcl, cutoff = 0.65, save.distances = "distmat.rda")
load("distmat.rda")
clusters <- clusterCMP(apbcl, cutoff = 0.60, use.distances = distmat)