Description Usage Arguments Details Value Author(s) References See Also Examples

Groups the sequences into clusters of similarity or creates a tree from a set of sequences.

1 2 3 4 5 6 7 8 9 10 |

`myDistMatrix` |
A symmetric |

`method` |
An agglomeration method to be used. This should be (an abbreviation of) one of |

`cutoff` |
A vector with the maximum edge length separating the sequences in the same cluster. Multiple cutoffs may be provided in ascending or descending order. (See details section below.) |

`showPlot` |
Logical specifying whether or not to plot the resulting dendrogram. Not applicable if |

`type` |
Character string indicating the type of output desired. This should be (an abbreviation of) one of |

`myXStringSet` |
If |

`collapse` |
Numeric controlling which internal edges of the tree are removed by collapsing their nodes. If |

`model` |
One or more of the available |

`processors` |
The number of processors to use, or |

`verbose` |
Logical indicating whether to display progress. |

`IdClusters`

groups the input sequences into clusters using a set of dissimilarities representing the distance between *N* sequences. In most cases, a phylogenetic tree is formed using the specified `method`

. Then each leaf (sequence) of the tree is assigned to a cluster based on its edge lengths to the other sequences. The available clustering methods are described as follows:

Ultrametric methods: The method `complete`

assigns clusters using complete-linkage so that sequences in the same cluster are no more than `cutoff`

distance apart. The method `single`

assigns clusters using single-linkage so that sequences in the same cluster are within `cutoff`

of at least one other sequence in the same cluster. `UPGMA`

(the default) or `WPGMA`

assign clusters using average-linkage which is a compromise between the sensitivity of complete-linkage clustering to outliers and the tendency of single-linkage clustering to connect distant relatives that do not appear to be closely related. `UPGMA`

produces an unweighted tree, where each leaf contributes equally to the average edge lengths, whereas `WPGMA`

produces a weighted result.

Additive methods: `NJ`

uses the Neighbor-Joining method proposed by Saitou and Nei that does not assume lineages evolve at the same rate (the molecular clock hypothesis). The `NJ`

method is typically the most phylogenetically accurate of the above distance-based methods. `ML`

creates a neighbor-joining tree and then iteratively maximizes the likelihood of the tree given the aligned sequences (`myXStringSet`

). This is accomplished through a combination of optimizing edge lengths with Brent's method and improving tree topology with nearest-neighbor interchanges (NNIs). When `method="ML"`

, one or more `MODELS`

of DNA evolution must be specified. Model parameters are iteratively optimized to maximize likelihood, except base frequencies which are empirically determined. If multiple `model`

s are given, the best `model`

is automatically chosen based on BIC calculated from the likelihood and the sample size (defined as the number of variable sites in the DNA sequence).

Sequence-only method: `inexact`

uses a heuristic algorithm to directly assign sequences to clusters without a distance matrix. First the sequences are ordered by length and the longest sequence becomes the first cluster seed. If the second sequence is less than `cutoff`

percent distance then it is added to the cluster, otherwise it becomes a new cluster representative. The remaining sequences are matched to cluster representatives based on their k-mer distribution and then aligned to find the closest sequence. This approach is repeated until all sequences belong to a cluster. In the vast majority of cases, this process results in clusters with members separated by less than `cutoff`

distance, where distance is defined as the percent dissimilarity between the overlapping region of a “glocal” alignment.

For all `method`

s, multiple cutoffs may be provided in sorted order. If the `cutoff`

s are provided in *descending* order then clustering at each new value of `cutoff`

is continued within the prior `cutoff`

's clusters. In this way clusters at lower values of `cutoff`

are completely contained within their umbrella clusters at higher values of `cutoff`

. This is useful for defining taxonomy, where groups need to be hierarchically nested. If multiple cutoffs are provided in *ascending* order then clustering at each level of `cutoff`

is independent of the prior level. This may result in fewer high-level clusters for `NJ`

and `ML`

methods, but will have no impact on ultrametric methods.

If `type`

is `"clusters"`

(the default), then a data.frame is returned with a column for each cutoff specified. This data.frame has dimensions *N*M*, where each one of *N* sequences is assigned to a cluster at the *M*-level of cutoff. The row.names of the data.frame correspond to the dimnames of `myDistMatrix`

.
If `type`

is `"dendrogram"`

, then an object of class `dendrogram`

is returned that can be used for plotting. Leaves of the dendrogram are colored by cluster number, but the same color may be repeated for different clusters. Note that the cophenetic distance between leaves of the tree is equal to the sum of branch lengths separating the leaves. This is different than trees produced by `hclust`

where leaves are merged at a height equal to their cophenetic distance.
If `type`

is `"both"`

then a list is returned containing both the `"clusters"`

and `"dendrogram"`

outputs.

Erik Wright [email protected]

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. *Journal of Molecular Evolution*, **17(6)**, 368-376.

Ghodsi, M., Liu, B., & Pop, M. (2011) DNACLUST. *BMC Bioinformatics*, **12(1)**, 271. doi:10.1186/1471-2105-12-271.

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. *Molecular Biology and Evolution*, **4(4)**, 406-425.

`DistanceMatrix`

, `Add2DB`

, `MODELS`

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ```
# using the matrix from the original paper by Saitou and Nei
m <- matrix(0,8,8) # only the lower triangle is used
m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
m[4:8,3] <- c(5, 7, 10, 7, 11)
m[5:8,4] <- c(8, 11, 8, 12)
m[6:8,5] <- c(5, 6, 10)
m[7:8,6] <- c(9, 13)
m[8,7] <- c(8)
# returns an object of class "dendrogram"
tree <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, type="dendrogram")
# example of specifying multiple cutoffs
clusters <- IdClusters(m, type="clusters", cutoff=c(2,6,10,20)) # returns a data frame
head(clusters)
# example of creating a complete-linkage tree from an alignment
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
dna # input alignment
d <- DistanceMatrix(dna, type="dist") # returns an object of class 'dist'
IdClusters(d, method="complete", cutoff=0.05, showPlot=TRUE, type="dendrogram")
# example of 'inexact' clustering
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
aa <- translate(dna)
inexact <- IdClusters(myXStringSet=aa, method="inexact", cutoff=seq(0.7, 0.1, -0.1))
head(inexact)
apply(inexact, 2, max) # number of clusters per cutoff
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.