seriate: Seriate Dissimilarity Matrices, Matrices or Arrays

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/seriate.R

Description

Tries to find an linear order for objects using data in form of a dissimilarity matrix (two-way one mode data), a data matrix (two-way two-mode data) or a data array (k-way k-mode data).

Usage

1
2
3
4
5
6
7
8
## S3 method for class 'dist'
seriate(x, method = "Spectral", control = NULL, ...)
## S3 method for class 'matrix'
seriate(x, method = "PCA", control = NULL,
margin = c(1,2), ...)
## S3 method for class 'array'
seriate(x, method = "PCA", control = NULL,
margin = seq(length(dim(x))), ...)

Arguments

x

the data.

method

a character string with the name of the seriation method (default: varies by data type).

control

a list of control options passed on to the seriation algorithm.

margin

a vector giving the margins to be seriated. For matrix, 1 indicates rows, 2 indicates columns, c(1,2) indicates rows and columns. For array, margin gets a vector with the dimensions to seriate.

...

further arguments are added to the control list.

Details

Seriation methods are available via a registry. See list_seriation_methods for help.

Many seriation methods (heuristically) optimize (minimize or maximize) an objective function. The value of the function for a given seriation can be calculated using criterion. In this manual page we only state the measure which is optimized (using bold font). A definition of the measures can be found in the criterion manual page.

Two-way two-mode data has to be provided as a dist object (not as a symmetric matrix). Similarities have to be transformed in a suitable way into dissimilarities. Currently the following methods are implemented for dist (for a more detailed description and an experimental comparison see Hahsler (2017)):

"ARSA"

Anti-Robinson seriation by simulated annealing to minimize the linear seriation criterion (simulated annealing initialization used in Brusco et al 2008).

"BBURCG"

Anti-Robinson seriation by branch-and-bound to minimize the unweighted gradient measure (Brusco and Stahl 2005). This is only feasible for a relatively small number of objects.

"BBWRCG"

Anti-Robinson seriation by branch-and-bound to minimize the weighted gradient measure (Brusco and Stahl 2005). This is only feasible for a relatively small number of objects.

"TSP"

Traveling salesperson problem solver to minimize the Hamiltonian path length. The solvers in TSP are used (see solve_TSP). The solver method can be passed on via the control argument, e.g. control = list(method = "two_opt"). Default is the est of 10 runs of arbitrary insertion heuristic with 2-opt improvement.

Since a tour returned by a TSP solver is a connected circle and we are looking for a path representing a linear order, we need to find the best cutting point. Climer and Zhang (2006) suggest to add a dummy city with equal distance to each other city before generating the tour. The place of this dummy city in an optimal tour with minimal length is the best cutting point (it lies between the most distant cities).

"R2E"

Rank-two ellipse seriation (Chen 2002).

This method starts with generating a sequence of correlation matrices R^1, R^2, …. R^1 is the correlation matrix of the original distance matrix D (supplied to the function as x), and

R^{n+1} = φ R^n,

where φ calculates the correlation matrix.

The rank of the matrix R^n falls with increasing n. The first R^n in the sequence which has a rank of 2 is found. Projecting all points in this matrix on the first two eigenvectors, all points fall on an ellipse. The order of the points on this ellipse is the resulting order.

The ellipse can be cut at the two interception points (top or bottom) of the vertical axis with the ellipse. In this implementation the top most cutting point is used.

"MDS", "MDS_metric", "MDS_nonmetric", "MDS_angle"

Multidimensional scaling (MDS).

Use multidimensional scaling techniques to find an linear order by minimizing stress. Note MDS algorithms used for a single dimension tend to end up in local optima and unidimensional scaling (see Maier and De Leeuw, 2015) would be more appropriate. However, generally, ordering along the first component of MDS provides good results.

By default, metric MDS (cmdscale in stats) is used. In case of of general dissimilarities, non-metric MDS can be used. The choices are isoMDS and sammon from MASS. The method can be specified as the element method ("cmdscale", "isoMDS" or "sammon") in control.

For convenience, "MDS_metric" performs cmdscale and "MDS_nonmetric" performs isoMDS.

"MDS_angle" projects the data on the first two components found by MDS and then orders by the angle in this space. The order is split by the larges gap between adjacent angles. A similar method was used for ordering correlation matrices by Friendly (2002).

"HC", "HC_single", "HC_complete", "HC_average","HC_ward"

Hierarchical clustering.

Using the order of the leaf nodes in a dendrogram obtained by hierarchical clustering can be used as a very simple seriation technique. This method applies hierarchical clustering (hclust) to x. The clustering method can be given using a "method" element in the control list. If omitted, the default "average" is used.

For convenience the other methods are provided as shortcuts.

"GW", "OLO"

Hierarchical clustering (by default using average-link) with additional leaf-node reordering to minimize Hamiltonian path length (restricted).

A dendrogram (binary tree) has 2^{n-1} internal nodes (subtrees) and the same number of leaf orderings. That is, at each internal node the left and right subtree (or leaves) can be swapped, or, in terms of a dendrogram, be flipped.

Method "GW" uses an algorithm developed by Gruvaeus and Wainer (1972) and implemented in package gclus (Hurley 2004). The clusters are ordered at each level so that the objects at the edge of each cluster are adjacent to that object outside the cluster to which it is nearest. The method produces an unique order.

Method "OLO" (Optimal leaf ordering, Bar-Joseph et al., 2001) produces an optimal leaf ordering with respect to the minimizing the sum of the distances along the (Hamiltonian) path connecting the leaves in the given order. The time complexity of the algorithm is O(n^3). Note that non-finite distance values are not allowed.

Both methods start with a dendrogram created by hclust. As the "method" element in the control list a clustering method (default "average") can be specified. Alternatively, a hclust object can be supplied using an element named "hclust".

For convenience "GW_single", "GW_average", "GW_complete", "GW_ward" and "OLO_single", "OLO_average", "OLO_complete", "OLO_ward" are provided.

"VAT"

Visual Assessment of (Clustering) Tendency (Bezdek and Hathaway (2002)).

Creates an order based on Prim's algorithm for finding a minimum spanning tree (MST) in a weighted connected graph representing the distance matrix. The order is given by the order in which the nodes (objects) are added to the MST.

"SA"

Simulated Annealing for diverse criterion measures.

Implement simulated annealing similar to the ARSA method, however, it works for any criterion measure defined in seriation. By default the algorithm optimizes for raw gradient measure and is warm started with the result of spectral seriation (2-Sum problem) since Hahsler (2017) shows that 2-Sum solutions are similar to solutions for the gradient measure.

Local neighborhood functions are LS_insert, LS_swap, LS_reverse, and LS_mix (1/3 insertion, 1/3 swap and 1/3 reverse). Any neighborhood function can be defined. It needs to take as the only argument the order (integer vector) and return a random neighbor.

Note that this is an R implementation repeatedly calling criterion, and therefore is relatively slow.

"Spectral", "Spectral_norm"

Spectral seriation (Ding and He 2004).

Spectral seriation uses a relaxation to minimize the 2-Sum Problem (Barnard, Pothen, and Simon 1993). It uses the order of the Fiedler vector of the similarity matrix's (normalized) Laplacian.

Spectral seriation gives a good trade-off between seriation quality, speed and scalability (see Hahsler, 2017).

"SPIN_NH", "SPIN_STS"

Sorting Points Into Neighborhoods (SPIN) (Tsafrir 2005). Given a weight matrix W, the algorithms try to minimize the energy for a permutation (matrix P) given by

F(P) = tr(PDP^TW),

where tr denotes the matrix trace.

"SPIN_STS" implements the Side-to-Side algorithm which tries to push out large distance values. The default weight matrix suggested in the paper with W=XX^T and X_i=i-(n+1)/2 is used. We run the algorithm from step (25) iteration and restart the algorithm nstart (10) with random initial permutations (default values in parentheses). Via control the parameters step, nstart, X and verbose.

"SPIN_NH" implements the neighborhood algorithm (concentrate low distance values around the diagonal) with a Gaussian weight matrix W_{ij} = exp(-(i-j)^2/nσ), where n is the size of the dissimilarity matrix and σ is the variance around the diagonal that control the influence of global (large σ) or local (small σ) structure.

We use the heuristic suggested in the paper for the linear assignment problem. We do not terminate as indicated in the algorithm, but run all the iterations since the heuristic does not guarantee that the energy is strictly decreasing. We also implement the heuristic "annealing" scheme where σ is successively reduced. The parameters in control are sigma which can be a single value or a decreasing sequence (default: 20 to 1 in 10 steps) and step which defines how many update steps are performed before for each value of alpha. Via W_function a custom function to create W with the function signature function(n, sigma, verbose) can be specified. The parameter verbose can be used to display progress information.

"QAP_LS", "QAP_2SUM", "QAP_BAR", "QAP_Inertia"

Quadratic assignment problem formulations for seriation using a simulated annealing solver. These methods minimize the Linear Seriation Problem (LS) formulation (Hubert and Schultz 1976), the 2-Sum Problem formulation (Barnard, Pothen, and Simon 1993), the banded anti-Robinson form (BAR) or the inertia criterion.

The parameters in control are passed on to qap in qap. An important parameter is rep to return the best result out of the given number of repetitions with random restarts. Default is 1, but bigger numbers result in better and more stable results.

"GA"

Use a genetic algorithm to optimize for various criteria. The GA code has to be first registered. A detailed description can be found in the manual page for register_GA.

"DendSer"

Use heuristic dendrogram seriation to optimize for various criteria. The DendSer code has to be first registered. A detailed description can be found in the manual page for register_DendSer.

"Identity"

Produces an identity permutation.

"Random"

Produces a random permutation.

Two-way two mode data are general positive matrices. Currently the following methods are implemented for matrix:

"BEA"

Bond Energy Algorithm (BEA; McCormick 1972). The algorithm tries to maximize the Measure of Effectiveness. of a non-negative matrix. Due to the definition of this measure, the tasks of ordering rows and columns is separable and can be solved independently.

A row is arbitrarily placed; then rows are positioned one by one. When this is completed, the columns are treated similarly. The overall procedure amounts to two approximate traveling salesperson problems (TSP), one on the rows and one on the columns. The so-called 'best insertion strategy' is used: rows (or columns) are inserted into the current permuted list of rows (or columns). Several consecutive runs of the algorithm might improve the energy.

Note that Arabie and Hubert (1990) question its use with non-binary data if the objective is to find a seriation or one-dimensional orderings of rows and columns.

The BEA code used in this package was implemented by Fionn Murtagh.

In control as element "rep" the number of runs can be specified. The results of the best run will be returned.

"BEA_TSP"

Use a TSP to optimize the Measure of Effectiveness (Lenstra 1974).

In control as element "method" a TSP solver method can be specified (see package TSP).

"PCA", "PCA_angle"

Principal component analysis.

Uses the projection of the data on its first principal component to determine the order.

Note that for a distance matrix calculated from x with Euclidean distance, this methods minimizes the least square criterion.

"PCA_angle" projects the data on the first two principal components and then orders by the angle in this space. The order is split by the larges gap between adjacent angles. A similar method was used for ordering correlation matrices by Friendly (2002).

"Identity"

Produces an identity permutation.

"Random"

Produces a random permutation.

For array no built-in methods are currently available.

Value

Returns an object of class ser_permutation.

Author(s)

Michael Hahsler

References

Arabie, P. and L.J. Hubert (1990): The bond energy algorithm revisited, IEEE Transactions on Systems, Man, and Cybernetics, 20(1), 268–274. doi: 10.1109/21.47829

Bar-Joseph, Z., E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29. doi: 10.1093/bioinformatics/17.suppl_1.S22

Barnard, S. T., A. Pothen, and H. D. Simon (1993): A Spectral Algorithm for Envelope Reduction of Sparse Matrices. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, 493–502. Supercomputing '93. New York, NY, USA: ACM. doi: 10.1109/SUPERC.1993.1263497

Bezdek, J.C. and Hathaway, R.J. (2002): VAT: a tool for visual assessment of (cluster) tendency. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN '02), Volume: 3, 2225–2230. doi: 10.1109/IJCNN.2002.1007487

Brusco, M., Koehn, H.F., and Stahl, S. (2008): Heuristic Implementation of Dynamic Programming for Matrix Permutation Problems in Combinatorial Data Analysis. Psychometrika, 73(3), 503–522. doi: 10.1007/s11336-007-9049-5

Brusco, M., and Stahl, S. (2005): Branch-and-Bound Applications in Combinatorial Data Analysis. New York: Springer. doi: 10.1007/0-387-28810-4

Chen, C. H. (2002): Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices. Statistica Sinica, 12(1), 7–29.

Ding, C. and Xiaofeng He (2004): Linearized cluster assignment via spectral ordering. Proceedings of the Twenty-first International Conference on Machine learning (ICML '04). doi: 10.1145/1015330.1015407

Climer, S. and Xiongnu Zhang (2006): Rearrangement Clustering: Pitfalls, Remedies, and Applications, Journal of Machine Learning Research, 7(Jun), 919–943.

Friendly, M. (2002): Corrgrams: Exploratory Displays for Correlation Matrices. The American Statistician, 56(4), 316–324. doi: 10.1198/000313002533

Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206. doi: 10.1111/j.2044-8317.1972.tb00491.x

Hahsler, M. (2017): An experimental comparison of seriation methods for one-mode two-way data. European Journal of Operational Research, 257, 133–143. doi: 10.1016/j.ejor.2016.08.066

Hubert, Lawrence, and James Schultz (1976): Quadratic Assignment as a General Data Analysis Strategy. British Journal of Mathematical and Statistical Psychology 29(2). Blackwell Publishing Ltd. 190–241. doi: 10.1111/j.2044-8317.1976.tb00714.x

Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806. doi: 10.1198/106186004X12425

Lenstra, J.K (1974): Clustering a Data Array and the Traveling-Salesman Problem, Operations Research, 22(2) 413–414. doi: 10.1287/opre.22.2.413

Mair P., De Leeuw J. (2015). Unidimensional scaling. In Wiley StatsRef: Statistics Reference Online, Wiley, New York. doi: 10.1002/9781118445112.stat06462.pub2

McCormick, W.T., P.J. Schweitzer and T.W. White (1972): Problem decomposition and data reorganization by a clustering technique, Operations Research, 20(5), 993–1009. doi: 10.1287/opre.20.5.993

Tsafrir, D., Tsafrir, I., Ein-Dor, L., Zuk, O., Notterman, D.A. and Domany, E. (2005): Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices, Bioinformatics, 21(10) 2301–8. doi: 10.1093/bioinformatics/bti329

See Also

list_seriation_methods, criterion, register_GA, register_DendSer, solve_TSP in TSP, hclust in stats.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
## show available seriation methods (for dist and matrix)
show_seriation_methods("dist")
show_seriation_methods("matrix")

##seriate dist
data("iris")
x <- as.matrix(iris[-5])
x <- x[sample(1:nrow(x)),]
d <- dist(x)

## default seriation
order <- seriate(d)
order

## plot
pimage(d, main = "Random")
pimage(d, order, main = "Reordered")

## compare quality
rbind(
        random = criterion(d),
        reordered = criterion(d, order)
     )

## seriate matrix
data("iris")
x <- as.matrix(iris[-5])

## to make the variables comparable, we scale the data
x <- scale(x, center = FALSE)

## try some methods
pimage(x, main = "original data")
criterion(x)

order <- seriate(x, method = "BEA_TSP")
pimage(x, order, main = "TSP to optimize ME")
criterion(x, order)

order <- seriate(x, method = "PCA")
pimage(x, order, main = "First principal component")
criterion(x, order)

## 2 TSPs
order <- c(
    seriate(dist(x), method = "TSP"),
    seriate(dist(t(x)), method = "TSP")
)
pimage(x, order, main = "2 TSPs")
criterion(x, order)

Example output

$dist_ARSA
name:        ARSA
kind:        dist
description: Minimize the linear seriation criterion using simulated annealing

$dist_BBURCG
name:        BBURCG
kind:        dist
description: Minimize the unweighted row/column gradient by branch-and-bound

$dist_BBWRCG
name:        BBWRCG
kind:        dist
description: Minimize the weighted row/column gradient by branch-and-bound

$dist_GW
name:        GW
kind:        dist
description: Hierarchical clustering reordered by Gruvaeus and Wainer heuristic

$dist_GW_average
name:        GW_average
kind:        dist
description: Hierarchical clustering (avg. link) reordered by Gruvaeus and Wainer heuristic

$dist_GW_complete
name:        GW_complete
kind:        dist
description: Hierarchical clustering (complete link) reordered by Gruvaeus and Wainer heuristic

$dist_GW_single
name:        GW_single
kind:        dist
description: Hierarchical clustering (single link) reordered by Gruvaeus and Wainer heuristic

$dist_GW_ward
name:        GW_ward
kind:        dist
description: Hierarchical clustering (Ward's method) reordered by Gruvaeus and Wainer heuristic

$dist_HC
name:        HC
kind:        dist
description: Hierarchical clustering

$dist_HC_average
name:        HC_average
kind:        dist
description: Hierarchical clustering (avg. link)

$dist_HC_complete
name:        HC_complete
kind:        dist
description: Hierarchical clustering (complete link)

$dist_HC_single
name:        HC_single
kind:        dist
description: Hierarchical clustering (single link)

$dist_HC_ward
name:        HC_ward
kind:        dist
description: Hierarchical clustering (Ward's method)

$dist_Identity
name:        Identity
kind:        dist
description: Identity permutation

$dist_MDS
name:        MDS
kind:        dist
description: MDS

$dist_MDS_angle
name:        MDS_angle
kind:        dist
description: MDS (angle)

$dist_MDS_metric
name:        MDS_metric
kind:        dist
description: MDS (metric)

$dist_MDS_nonmetric
name:        MDS_nonmetric
kind:        dist
description: MDS (non-metric)

$dist_OLO
name:        OLO
kind:        dist
description: Hierarchical clustering (single link) with optimal leaf ordering

$dist_OLO_average
name:        OLO_average
kind:        dist
description: Hierarchical clustering (avg. link) with optimal leaf ordering

$dist_OLO_complete
name:        OLO_complete
kind:        dist
description: Hierarchical clustering (complete link) with optimal leaf ordering

$dist_OLO_single
name:        OLO_single
kind:        dist
description: Hierarchical clustering with optimal leaf ordering

$dist_OLO_ward
name:        OLO_ward
kind:        dist
description: Hierarchical clustering (Ward's method) with optimal leaf ordering

$dist_QAP_2SUM
name:        QAP_2SUM
kind:        dist
description: QAP (2-SUM)

$dist_QAP_BAR
name:        QAP_BAR
kind:        dist
description: QAP (Banded anti-Robinson form)

$dist_QAP_Inertia
name:        QAP_Inertia
kind:        dist
description: QAP (Inertia)

$dist_QAP_LS
name:        QAP_LS
kind:        dist
description: QAP (Linear Seriation)

$dist_R2E
name:        R2E
kind:        dist
description: Rank-two ellipse seriation

$dist_Random
name:        Random
kind:        dist
description: Random permutation

$dist_SA
name:        SA
kind:        dist
description: Minimize a specified measure using simulated annealing (with warm start).

$dist_SPIN_NH
name:        SPIN_NH
kind:        dist
description: SPIN (Neighborhood algorithm)

$dist_SPIN_STS
name:        SPIN_STS
kind:        dist
description: SPIN (Side-to-Side algorithm)

$dist_Spectral
name:        Spectral
kind:        dist
description: Spectral seriation

$dist_Spectral_norm
name:        Spectral_norm
kind:        dist
description: Spectral seriation (normalized)

$dist_TSP
name:        TSP
kind:        dist
description: Minimize Hamiltonian path length with a TSP solver

$dist_VAT
name:        VAT
kind:        dist
description: Visual assesment of clustering tendency (VAT)

$matrix_BEA
name:        BEA
kind:        matrix
description: Bond Energy Algorithm to maximize ME

$matrix_BEA_TSP
name:        BEA_TSP
kind:        matrix
description: TSP to maximize ME

$matrix_Identity
name:        Identity
kind:        matrix
description: Identity permutation

$matrix_PCA
name:        PCA
kind:        matrix
description: First principal component

$matrix_PCA_angle
name:        PCA_angle
kind:        matrix
description: First two principal components (angle)

$matrix_Random
name:        Random
kind:        matrix
description: Random permutation

object of class 'ser_permutation', 'list'
contains permutation vectors for 1-mode data

  vector length seriation method
1           150         Spectral
              2SUM AR_deviations AR_events       BAR       Cor_R Gradient_raw
random    29692269    926266.454    541477 163595.77 0.009082024        19108
reordered 17821593      9887.392     54924  56609.97 0.371953911       992058
          Gradient_weighted   Inertia      LS Lazy_path_length Least_squares
random             46537.75 219523528 5637291        28962.077      78788422
reordered        1771427.16 356910343 4487365         6705.889      76488569
                ME Moore_stress Neumann_stress Path_length       RGAR
random    5960.360    11424.705      5666.9027   375.55576 0.49109106
reordered 7253.697     1111.651       538.7757    91.28657 0.04981317
         Cor_R             ME   Moore_stress Neumann_stress 
     0.2137794   1005.9980469    357.1093200    133.6494969 
         Cor_R             ME   Moore_stress Neumann_stress 
  5.375674e-02   1.026830e+03   1.674042e+02   5.905914e+01 
         Cor_R             ME   Moore_stress Neumann_stress 
     0.2651909   1021.4874878    175.8500473     67.4722984 
         Cor_R             ME   Moore_stress Neumann_stress 
     -0.235639    1025.855347     165.923879      58.181556 

seriation documentation built on Oct. 23, 2020, 6:34 p.m.