Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/1002-clusterCMP.R
'clusterCMP' uses structural compound descriptors and clusters the compounds
based on their pairwise distances. clusterCMP
uses single linkage to
measure distance between clusters when it merges clusters. It accepts both a
single cutoff and a cutoff vector. By using a cutoff vector, it can generate
results similar to hierarchical clustering after tree cutting.
1 2 |
db |
The desciptor database |
cutoff |
The clustering cutoff. Can be a single value or a vector. The cutoff gives the maximum distance between two compounds in order to group them in the same cluster. |
is.similarity |
Set when the cutoff supplied is a similarity cutoff. This cutoff is the minimum similarity value between two compounds such that they will be grouped in the same cluster. |
save.distances |
whether to save distance for future clustering. See details below. |
use.distances |
Supply pre-computed distance matrix. |
quiet |
Whether to suppress the progress information. |
... |
Further arguments to be passed to similarity. |
clusterCMP
will compute distances on the fly if use.distances
is not set. Furthermore, if save.distances
is not set, the distance
values computed will never be stored and any distance between two compounds is
guaranteed not to be computed twice. Using this method, clusterCMP
can
deal with large databases when a distance matrix in memory is not feasible. The
speed of the clustering function should be slowed when using a transient distance
calculation.
When save.distances
is set, clusterCMP
will be forced to compute
the distance matrix and save it in memory before the clustering. This is useful
when additional clusterings are required in the future without re-computed the
distance matrix. Set save.distances
to TRUE if you only want to force the
clustering to use this 2-step approach; otherwise, set it to the filename under
which you want the distance matrix to be saved. After you save it, when you need
to reuse the distance matrix, you can 'load' it, and supply it to clusterCMP
via the use.distances
argument.
clusterCMP
supports a vector of several cutoffs. When you have multiple
cutoffs, clusterCMP
still guarantees that pairwise distances will never be
recomputed, and no copy of distances is kept in memory. It is guaranteed to be as
fast as calling clusterCMP
with a single cutoff that results in the longest
processing time, plus some small overhead linear in processing time.
Returns a data.frame
. Besides a variable giving compound ID, each of the other
variables in the data frame will either give the cluster IDs of compounds under some
clustering cutoff, or the size of clusters that the compounds belong to. When N cutoffs
are given, in total 2*N+1 variables will be generated, with N of them giving the cluster
ID of each compound under each of the N cutoffs, and the other N of them giving the
cluster size under each of the N cutoffs. The rows are sorted by cluster sizes.
Min-feng Zhu <wind2zhu@163.com>
...
See clusterStat
for generate statistics on sizes of clusters.
1 2 3 4 5 6 7 8 | data(sdfbcl)
apbcl <- convSDFtoAP(sdfbcl)
fpbcl <- convAPtoFP(apbcl)
clusters <- clusterCMP(db = apbcl, cutoff = c(0.5, 0.85))
clusters2 <- clusterCMP(fpbcl, cutoff = c(0.5, 0.7), method = "Tversky")
clusters <- clusterCMP(apbcl, cutoff = 0.65, save.distances = "distmat.rda")
load("distmat.rda")
clusters <- clusterCMP(apbcl, cutoff = 0.60, use.distances = distmat)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.