# Cluster Sequences By Distance or Sequence

### Description

Groups the sequences represented by a distance matrix into clusters of similarity.

### Usage

1 2 3 4 5 6 7 8 9 10 |

### Arguments

`myDistMatrix` |
A symmetric |

`method` |
An agglomeration method to be used. This should be (an abbreviation of) one of |

`cutoff` |
A vector with the maximum edge length separating the sequences in the same cluster. Multiple cutoffs may be provided in ascending or descending order. (See details section below.) |

`showPlot` |
Logical specifying whether or not to plot the resulting dendrogram. Not applicable if |

`type` |
Character string indicating the type of output desired. This should be (an abbreviation of) one of |

`myXStringSet` |
If |

`collapse` |
Numeric controlling which edges of the tree are removed by collapsing their nodes. If |

`model` |
One or more of the available |

`processors` |
The number of processors to use, or |

`verbose` |
Logical indicating whether to display progress. |

### Details

`IdClusters`

groups the input sequences into clusters using a set dissimilarities representing the distance between *N* sequences. Initially a phylogenetic tree is formed using the specified `method`

. Then each leaf (sequence) of the tree is assigned to a cluster based on its edge lengths to the other sequences. The available clustering methods are described as follows:

Ultrametric methods: The method `complete`

assigns clusters using complete-linkage so that sequences in the same cluster are no more than `cutoff`

percent apart. The method `single`

assigns clusters using single-linkage so that sequences in the same cluster are within `cutoff`

of at least one other sequence in the same cluster. `UPGMA`

(the default) or `WPGMA`

assign clusters using average-linkage which is a compromise between the sensitivity of complete-linkage clustering to outliers and the tendency of single-linkage clustering to connect distant relatives that do not appear to be closely related. `UPGMA`

produces an unweighted tree, where each leaf contributes equally to the average edge lengths, whereas `WPGMA`

produces a weighted result.

Additive methods: `NJ`

uses the Neighbor-Joining method proposed by Saitou and Nei that does not assume lineages evolve at the same rate (the molecular clock hypothesis). The `NJ`

method is typically the most phylogenetically accurate of the above distance-based methods. `ML`

creates a neighbor-joining tree and then iteratively maximizes the likelihood of the tree given the aligned sequences (`myXStringSet`

). This is accomplished through a combination of optimizing edge lengths with Brent's method and improving tree topology with nearest-neighbor interchanges (NNIs). When `method="ML"`

, one or more `MODELS`

of DNA evolution must be specified. Model parameters are iteratively optimized to maximize likelihood, except base frequencies which are empirically determined. If multiple `model`

s are given, the best `model`

is automatically chosen based on BIC calculated from the likelihood and the sample size (defined as the number of variable sites in the DNA sequence).

Sequence-only method: `inexact`

uses a heuristic algorithm to directly assign sequences to clusters without a distance matrix. First the sequences are ordered by length and the longest sequence becomes the first cluster seed. If the second sequence is less than `cutoff`

percent distance then it is added to the cluster, otherwise it becomes a new cluster representative. The remaining sequences are matched to cluster representatives based on their k-mer distribution and then aligned to find the closest sequence. This approach is repeated until all sequences belong to a cluster. In the vast majority of cases, this process results in clusters with members separated by less than `cutoff`

distance, where distance is defined as the percent dissimilarity between the overlapping region of a “glocal” alignment.

Multiple cutoffs may be provided if they are in increasing or decreasing order. If `cutoff`

s are provided in *descending* order then clustering at each new value of `cutoff`

is continued within the prior `cutoff`

's clusters. In this way clusters at lower values of `cutoff`

are completely contained within their umbrella clusters at higher values of `cutoff`

. This is useful for defining taxonomy, where lower level groups (e.g., genera) are expected not to straddle multiple higher level groups (e.g., families). If multiple cutoffs are provided in *ascending* order then clustering at each level of `cutoff`

is independent of the prior level. This may result in fewer high-level clusters for `NJ`

and `ML`

methods, but will have no impact on ultrametric methods. Providing `cutoff`

s in descending order makes `inexact`

clustering faster, but has negligible impact on the other `method`

s.

### Value

If `type`

is `"clusters"`

(the default), then a data.frame is returned with a column for each cutoff specified. This data.frame has dimensions *N*M*, where each one of *N* sequences is assigned to a cluster at the *M*-level of cutoff. The row.names of the data.frame correspond to the dimnames of `myDistMatrix`

.
If `type`

is `"dendrogram"`

, then an object of class `dendrogram`

is returned that can be used for plotting. Leaves of the dendrogram are colored by cluster number.
If `type`

is `"both"`

then a list is returned containing both the `"clusters"`

and `"dendrogram"`

outputs.

### Author(s)

Erik Wright DECIPHER@cae.wisc.edu

### References

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. *Journal of Molecular Evolution*, **17(6)**, 368-376.

Ghodsi, M., Liu, B., & Pop, M. (2011) DNACLUST. *BMC Bioinformatics*, **12(1)**, 271. doi:10.1186/1471-2105-12-271.

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. *Molecular Biology and Evolution*, **4(4)**, 406-425.

### See Also

`DistanceMatrix`

, `Add2DB`

, `MODELS`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ```
# using the matrix from the original paper by Saitou and Nei
m <- matrix(0,8,8)
m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
m[4:8,3] <- c(5, 7, 10, 7, 11)
m[5:8,4] <- c(8, 11, 8, 12)
m[6:8,5] <- c(5, 6, 10)
m[7:8,6] <- c(9, 13)
m[8,7] <- c(8)
# returns an object of class "dendrogram"
tree <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, type="dendrogram")
# example of specifying multiple cutoffs
clusters <- IdClusters(m, cutoff=c(2,6,10,20)) # returns a data frame
head(clusters)
# example of 'inexact' clustering
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
IdClusters(myXStringSet=dna, method="inexact", cutoff=0.05)
``` |