An parallelised implementation of a Jarník (Prim/Dijkstra)-like algorithm for determining a(*) minimum spanning tree (MST) of a complete undirected graph representing a set of n points with weights given by a pairwise distance matrix.
(*) Note that there might be multiple minimum trees spanning a given graph.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
either a numeric matrix (or an object coercible to one,
e.g., a data frame with numeric-like columns) or an
object of class
further arguments passed to or from other methods.
metric used to compute the linkage, one of:
logical; whether to compute the distances using 32-bit instead of 64-bit precision floating-point arithmetic (up to 2x faster).
logical; whether to print diagnostic messages and progress information.
d is a numeric matrix of size n p,
the n (n-1)/2 distances are computed on the fly, so that O(n M)
memory is used.
The algorithm is parallelised; set the
Sys.setenv to control the number of threads
Time complexity is O(n^2) for the method accepting an object of
dist and O(p n^2) otherwise.
M >= 2, then the mutual reachability distance m(i,j) with smoothing
M (see Campello et al. 2015)
is used instead of the chosen "raw" distance d(i,j).
It holds m(i, j)=\max(d(i,j), c(i), c(j)), where c(i) is
d(i, k) with k being the (
M-1)-th nearest neighbour of i.
This makes "noise" and "boundary" points being "pulled away" from each other.
Genie++ clustering algorithm (see
with respect to the mutual reachability distance gains the ability to
identify some observations are noise points.
Note that the case
M = 2 corresponds to the original distance, but we are
determining the 1-nearest neighbours separately as well, which is a bit
suboptimal; you can file a feature request if this makes your data analysis
tasks too slow.
Matrix of class
mst with n-1 rows and 3 columns:
dist. It holds
dist is sorted nondecreasingly.
The i-th row gives the i-th edge of the MST.
(from[i], to[i]) defines the vertices (in 1,...,n)
dist[i] gives the weight, i.e., the
distance between the corresponding points.
method attribute gives the name of the distance used.
Labels attribute gives the labels of all the input points.
M > 1, the
nn attribute gives the indices of the
nearest neighbours of each point.
Jarník V., O jistém problému minimálním, Práce Moravské Přírodovědecké Společnosti 6 (1930) 57–63.
Olson C.F., Parallel algorithms for hierarchical clustering, Parallel Comput. 21 (1995) 1313–1325.
Prim R., Shortest connection networks and some generalisations, Bell Syst. Tech. J. 36 (1957) 1389–1401.
Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1) (2015) 5:1–5:51.
1 2 3 4
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.