Description Usage Arguments Details Value Note Author(s) See Also Examples

View source: R/01-labelMatcher.R

These functions provide a set of tools to find the best match between the labels used by two different algorithms to cluster the same set of samples.

1 2 3 4 5 6 | ```
labelMatcher(tab, verbose = FALSE)
matchLabels(tab)
countAgreement(tab)
labelAccuracy(data, labels, linkage="ward.D2")
bestMetric(data, labels)
remap(fix, vary)
``` |

`tab` |
A contingency table, represented as a square |

`verbose` |
A logical value; should the routine print something out periodically so you know it's still working? |

`data` |
A matrix whose columns represent objects to be clustered and whose rows represent the anonymous features used to perform the clustering. |

`labels` |
A factor (or character vector) of class labels for the
objects in the |

`linkage` |
A linkage rule accepted by the |

`fix` |
A vector of cluater assignments. |

`vary` |
A vector of cluater assignments. |

In the most general sense, clustering can be viewed as a function from
the space of "objects" of interest into a space of "class labels". In
less mathematical terms, this simply means that each object gets
assigned an (arbitrary) class label. This is all well-and-good until
you try to compare the results of running two different clustering
algorithms that use different labels (or even worse, use the same
labels – typically the integers *1, 2, …, K* – with
different meanings). When that happens, you need a way to decide
which labels from the different sets are closest to meaning the
"same thing".

That's where this set of functions comes in. The core algorithm is
implemented in the function `labelMatcher`

, which works on a
contingency table whose entries *N_{ij}* are the number of samples
with row-label = *i* and column-label = *j*. To find the
best match, one computes (heuristically) the values *F_{ij}* that
describe the fraction of all entries in row *i* and column *j*
represented by *N_{ij}*. Perfectly matched labels would consist
of a row *i* and a column *j* where *N_{ij}* is the only
nonzero entry in its row and column, so *F_{ij} = 1*. The largest
value for *F_{ij}* (with ties broken simply by which entry is
closer to the upper-left corner of the matrix) defines the best
match. The matched row and column are then removed from the matrix and
the process repeats recursively.

We apply this method to determine which distance metric, when used in hierarchical clustering, best matches a "gold standard" set of class labels. (These may not really be gold, of course; they can also be a set of labels determined by k-means or another clustering algorithm.) The idea is to cluster the samples using a variety of different metrics, and select the one whose label assignments best macth the standard.

The `labelMatcher`

function returns a list of two vectors of the
same length. These contain the matched label-indices, in the order
they were matched by the algorithm.

The `matchLabels`

function is a user-friendly front-end to the
`labelmatcher`

function. It returns a matrix, with the rows and
columns reordered so the labels match.

The `countAgreement`

function returns an integer, the number of
samples with the "same" labels, computed by summing the diagonal of
the reordered matrix produced by `matchLabels`

.

The `labelAccuracy`

function returns a vector indexed by the set
of nine distance metrics hard-coded in the function. Each entry is
the fraction of samples whose hierarchical clusters match the
prespecified `labels`

.

The `bestMetric`

function is a user-friendly front-end to the
`labelAccuracy`

function. It returns the name of the distance
metric whose hierarchical clusters best match the prespecified
`labels`

.

The `remap`

function takes two sets of integer cluster
assignments and returns a new set of labels for the target that best
matches the source.

The `labelAccuracy`

function should probably allow the user
to supply a list of distance metrics instead of relying on the
hard-coded list internally.

Kevin R. Coombes <[email protected]>

Hierarchical clustering is implemented in the `hclust`

function. We use the extended set of distance metrics provided by the
`distanceMatrix`

function from the ClassDiscovery package.
This set includes all of the metrics from the `dist`

funciton.

1 2 3 4 5 6 7 8 |

Thresher documentation built on March 7, 2019, 5:07 p.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.