Clustering sequences based on pairwise distances


Sequences are clustered by hierarchical clustering based on a set of pariwise distances. The distances must take values between 0.0 and 1.0, and all pairs not listed are assumed to have distance 1.0.





A data.frame with pairwise distances. The columns Sequence.A and Sequence.B contain tags identifying pairs of sequences. The column Distance contains the distances, always a number from 0.0 to 1.0.


A text indicating what type of clustering to perform, either single (default), average or complete.


Specifies the maximum size of a cluster. Must be a distance, i.e. a number between 0.0 and 1.0.


Computing clusters (gene families) is an essential step in many comparative studies. bClust will assign sequences into gene families by a hierarchical clustering approach. Since the number of sequences may be huge, a full all-against-all distance matrix will be impossible to handle in memory. However, most sequence pairs will have an ‘infinite’ distance between them, and only the pairs with a finite (smallish) distance need to be considered.

This function takes as input the distances in a data.frame where only the interesting distances are listed. Typically, this data.frame is the output from bDist. All pairs of sequence not listed are assumed to have distance 1.0, which is considered the ‘infinite’ distance. Note that dist.table must have the columns Sequence.A, Sequence.B and Distance. The first two contain texts identifying sequences, the latter contains the distances. All sequences must be listed at least once. This should pose no problem, since all sequences have distance 0.0 to themselves, and should be listed with this distance once.

The linkage defines the type of clusters produced. The threshold indicates the size of the clusters. A single linkage clustering means all members of a cluster have at least one other member of the same cluster within distance threshold from itself. An average linkage means all members of a cluster are within the distance threshold from the center of the cluster. A complete linkage means all members of a cluster are no more than the distance threshold away from any other member of the same cluster.

Typically, single linkage produces big clusters where members may differ a lot, since they are only required to be close to something, which is close to something,...,which is close to some other member. On the other extreme, complete linkage will produce small and tight clusters, since all must be similar to all. The average linkage is between, but closer to complete linkage. If you want the threshold to specify directly the maximum distance tolerated between two members of the same gene family, you must use complete linkage. The single linkage is the fastest alternative to compute. Using the default setting of single linkage and maximum threshold (1.0) will produce the largest and fewest clusters possible.


The function returns a vector of integers, indicating the cluster membership of every unique sequence from the Sequence.A and Sequence.B columns of the input dist.table. The name of each element indicates the sequence. Sequences having the same number are in the same cluster.


The igraph package is required by this function.


Lars Snipen and Kristian Hovde Liland.

See Also

bDist, hclust, dClust, isOrtholog.


# Loading distance data in the micropan package

# Clustering with default settings
clustering.blast.single <- bClust(Mpneumoniae.blast.distances)

# Clustering with complete linkage and a liberal threshold
clustering.blast.complete <- bClust(Mpneumoniae.blast.distances,linkage="complete",threshold=0.75)

Want to suggest features or report bugs for Use the GitHub issue tracker. Vote for new features on Trello.

comments powered by Disqus