IdClusters: Cluster Sequences By Distance or Sequence
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) References See Also Examples

Groups the sequences into clusters of similarity or creates a tree from a set of sequences.

IdClusters(myDistMatrix = NULL,
           method = "UPGMA",
           cutoff = -Inf,
           showPlot = FALSE,
           type = "clusters",
           myXStringSet = NULL,
           model = MODELS,
           collapse = 0,
           reconstruct = FALSE,
           root = 0,
           processors = 1,
           verbose = TRUE)

`myDistMatrix`	A symmetric N x N distance matrix with the values of dissimilarity between N sequences, an object of class `'dist'`, or `NULL` if `method` is `"inexact"`.
`method`	An agglomeration method to be used. This should be (an abbreviation of) one of `"complete"`, `"single"`, `"UPGMA"`, `"WPGMA"`, `"NJ"`, `"ML"`, or `"inexact"`. (See details section below.)
`cutoff`	A vector with the maximum edge length separating the sequences in the same cluster. Multiple cutoffs may be provided in ascending or descending order. (See details section below.)
`showPlot`	Logical specifying whether or not to plot the resulting dendrogram. Not applicable if `method='inexact'`.
`type`	Character string indicating the type of output desired. This should be (an abbreviation of) one of `"clusters"`, `"dendrogram"`, or `"both"`. Not applicable if `method='inexact'`. (See value section below.)
`myXStringSet`	If `method` is `"ML"` or `reconstruct` is not `FALSE`, the `DNAStringSet` or `RNAStringSet` used in the creation of `myDistMatrix`. If `method` is `"inexact"`, the `DNAStringSet`, `RNAStringSet`, or `AAStringSet` to cluster.
`model`	One or more of the available `MODELS` of evolution. Only applicable if `method` is `"ML"` or `reconstruct` is not `FALSE`.
`collapse`	Numeric controlling which internal edges of the tree are removed by collapsing their nodes. If `collapse` is zero (the default) then nodes at the same height will be collapsed to a single node, resulting in a multifurcating tree. When `collapse` is greater than zero, nodes that are within `collapse` difference in height are made into a single node. A value of `collapse` less than zero will ensure that the `dendrogram` is purely bifurcating. Note that `collapse` has no effect on cluster numbers or `cutoff`.
`reconstruct`	Logical or numeric determining whether to perform ancestral state reconstruction. If `TRUE`, maximum likelihood character states are determined at internal nodes of the `dendrogram` and provided as the `"state"` attribute. A numeric value between zero and one (exclusive) can be provided, in which case that fraction of the likelihood for a state must be greater than the likelihood of all alternative states, otherwise a more ambiguous degeneracy code is used. Requires specification of a `DNAStringSet` or `RNAStringSet` via `myXStringSet`. Only applicable if `type` is `"dendrogram"` or `"both"`.
`root`	Integer specifying the index of the outgroup in `myDistMatrix`, or `0` (the default) to midpoint root the dendrogram.
`processors`	The number of processors to use, or `NULL` to automatically detect and use all available processors.
`verbose`	Logical indicating whether to display progress.

IdClusters groups the input sequences into clusters using a set of dissimilarities representing the distance between N sequences. For all methods except "inexact", a phylogenetic tree is formed and each leaf (sequence) of the tree is assigned to a cluster based on its edge lengths to the other sequences. The available clustering methods are described as follows:

Ultrametric methods: The method complete assigns clusters using complete-linkage so that sequences in the same cluster are no more than cutoff distance apart. The method single assigns clusters using single-linkage so that sequences in the same cluster are within cutoff of at least one other sequence in the same cluster. UPGMA (the default) or WPGMA assign clusters using average-linkage which is a compromise between the sensitivity of complete-linkage clustering to outliers and the tendency of single-linkage clustering to connect distant relatives that do not appear to be closely related. UPGMA produces an unweighted tree, where each leaf contributes equally to the average edge lengths, whereas WPGMA produces a weighted result.

Additive methods: NJ uses the Neighbor-Joining method proposed by Saitou and Nei that does not assume lineages evolve at the same rate (the molecular clock hypothesis). The NJ method is typically the most phylogenetically accurate of the above distance-based methods. ML creates a neighbor-joining tree and then iteratively maximizes the likelihood of the tree given the aligned sequences (myXStringSet). This is accomplished through a combination of optimizing edge lengths with Brent's method and improving tree topology with nearest-neighbor interchanges (NNIs). When method="ML", one or more MODELS of DNA evolution must be specified. Model parameters are iteratively optimized to maximize likelihood, except base frequencies which are empirically determined. If multiple models are given, the best model is automatically chosen based on BIC calculated from the likelihood and the sample size (defined as the number of variable sites in the DNA sequence).

Sequence-only method: inexact uses a heuristic algorithm to directly assign sequences to clusters without a distance matrix. First the sequences are ordered by length and the longest sequence becomes the first cluster seed. If the second sequence is less than cutoff k-mer distance then it is added to the cluster, otherwise it becomes a new cluster representative. The remaining sequences are matched to cluster representatives in a similar fashion until all sequences belong to a cluster. In the majority of cases, this process results in clusters with members separated by less than cutoff distance.

For all methods, multiple cutoffs may be provided in sorted order. If the cutoffs are provided in descending order then clustering at each new value of cutoff is continued within the prior cutoff's clusters. In this way clusters at lower values of cutoff are completely contained within their umbrella clusters at higher values of cutoff. This is useful for defining taxonomy, where groups need to be hierarchically nested. If multiple cutoffs are provided in ascending order then clustering at each level of cutoff is independent of the prior level. This may result in fewer high-level clusters for NJ and ML methods, but will have no impact on ultrametric methods unless root is greater than zero (i.e., outgroup rooted).

If type is "clusters" (the default), then a data.frame is returned with a column for each cutoff specified. This data.frame has dimensions N*M, where each one of N sequences is assigned to a cluster at the M-level of cutoff. The row.names of the data.frame correspond to the dimnames of myDistMatrix. If type is "dendrogram", then an object of class dendrogram is returned that can be used for plotting. Leaves of the dendrogram are colored by cluster number, but the same color may be repeated for different clusters. Note that the cophenetic distance between leaves of the tree is equal to the sum of branch lengths separating the leaves. This is different than trees produced by hclust where leaves are merged at a height equal to their cophenetic distance. If type is "both" then a list is returned containing both the "clusters" and "dendrogram" outputs.

Erik Wright eswright@pitt.edu

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6), 368-376.

Ghodsi, M., Liu, B., & Pop, M. (2011) DNACLUST. BMC Bioinformatics, 12(1), 271. doi:10.1186/1471-2105-12-271.

Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406-425.

Cophenetic, DistanceMatrix, Add2DB, MODELS

# using the matrix from the original paper by Saitou and Nei
m <- matrix(0,8,8) # only the lower triangle is used
m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
m[4:8,3] <- c(5, 7, 10, 7, 11)
m[5:8,4] <- c(8, 11, 8, 12)
m[6:8,5] <- c(5, 6, 10)
m[7:8,6] <- c(9, 13)
m[8,7] <- c(8)

# returns an object of class "dendrogram"
tree <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, type="dendrogram")

# example of specifying multiple cutoffs
clusters <- IdClusters(m, type="clusters", cutoff=c(2,6,10,20)) # returns a data frame
head(clusters)

# example of creating a complete-linkage tree from an alignment
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
dna # input alignment
d <- DistanceMatrix(dna, type="dist") # returns an object of class 'dist'
IdClusters(d, method="complete", cutoff=0.05, showPlot=TRUE, type="dendrogram")

# example of 'inexact' clustering
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
aa <- translate(dna)
inexact <- IdClusters(myXStringSet=aa, method="inexact", cutoff=seq(0.7, 0.1, -0.1))
head(inexact)
apply(inexact, 2, max) # number of clusters per cutoff

Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: XVector

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: RSQLite
================================================================================

Time difference of 0.08 secs

================================================================================

Time difference of 0.01 secs

  cluster2UPGMA cluster6UPGMA cluster10UPGMA cluster20UPGMA
1             7             5              2              1
2             5             3              2              1
3             4             2              2              1
4             3             2              2              1
5             2             1              1              1
6             1             1              1              1
DNAStringSet object of length 175:
      width seq                                             names               
  [1]  1596 ----------------------...---------------------- uncultured bacter...
  [2]  1596 ----------------------...---------------------- uncultured bacter...
  [3]  1596 ----------------------...---------------------- uncultured bacter...
  [4]  1596 ----------------------...---------------------- uncultured bacter...
  [5]  1596 ----------------------...---------------------- uncultured bacter...
  ...   ... ...
[171]  1596 ----------------------...---------------------- uncultured bacter...
[172]  1596 ----------------------...---------------------- uncultured bacter...
[173]  1596 ----------------------...---------------------- uncultured bacter...
[174]  1596 ----------------------...---------------------- uncultured bacter...
[175]  1596 ----------------------...---------------------- uncultured bacter...
================================================================================

Time difference of 0.16 secs

================================================================================

Time difference of 0.53 secs

'dendrogram' with 2 branches and 175 members total, at height 0.1694984 
================================================================================
Time difference of 0.15 secs

                                    cluster0_7inexact cluster0_6inexact
Rickettsia prowazekii str. Dachau                   8                11
Porphyromonas gingivalis W83                       12                 5
Porphyromonas gingivalis TDC60                     12                 5
Porphyromonas gingivalis ATCC 33277                12                 5
Pasteurella multocida 671/90                       13                 4
Pasteurella multocida 36950                        13                 4
                                    cluster0_5inexact cluster0_4inexact
Rickettsia prowazekii str. Dachau                  23                21
Porphyromonas gingivalis W83                        3                 6
Porphyromonas gingivalis TDC60                      3                 6
Porphyromonas gingivalis ATCC 33277                 3                 6
Pasteurella multocida 671/90                        0                 2
Pasteurella multocida 36950                         0                 2
                                    cluster0_3inexact cluster0_2inexact
Rickettsia prowazekii str. Dachau                  27                24
Porphyromonas gingivalis W83                        3                 7
Porphyromonas gingivalis TDC60                      3                 7
Porphyromonas gingivalis ATCC 33277                 3                 7
Pasteurella multocida 671/90                       15                 2
Pasteurella multocida 36950                        15                 2
                                    cluster0_1inexact
Rickettsia prowazekii str. Dachau                  30
Porphyromonas gingivalis W83                        3
Porphyromonas gingivalis TDC60                      3
Porphyromonas gingivalis ATCC 33277                 3
Pasteurella multocida 671/90                       94
Pasteurella multocida 36950                        94
cluster0_7inexact cluster0_6inexact cluster0_5inexact cluster0_4inexact 
               15                33                53                71 
cluster0_3inexact cluster0_2inexact cluster0_1inexact 
               85                94                98

DECIPHER documentation built on Nov. 8, 2020, 8:30 p.m.

DECIPHER index

Package overview Classify Sequences Design Group-Specific FISH Probes Design Group-Specific Primers Design Microarray Probes Design Primers That Yield Group-Specific Signatures Finding Chimeric Sequences Getting Started DECIPHERing The Art of Multiple Sequence Alignment in R The Magic of Gene Finding

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

IdClusters: Cluster Sequences By Distance or Sequence
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Example output

Related to IdClusters in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER Tools for curating, analyzing, and manipulating biological sequences

IdClusters: Cluster Sequences By Distance or Sequence In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Example output

Related to IdClusters in DECIPHER...

R Package Documentation

Browse R Packages

We want your feedback!

DECIPHER
Tools for curating, analyzing, and manipulating biological sequences

IdClusters: Cluster Sequences By Distance or Sequence
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences