knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

ClustPhy is an R package for clustering phylogenetic trees (using PAM or EM clustering), comparing different clusterings (using gap statistics), and visualizing the clusters (in a phylogenetic tree or in a 2D biplot). This document gives a tour of ClustPhy package.

To download ClustPhy, use the following commands:

require("devtools")
install_github("rainali475/ClustPhy", build_vignettes = TRUE)
library("ClustPhy")

To list all sample functions available in the package:

ls("package:ClustPhy")

To list all sample datasets available in the package:

data(package = "ClustPhy")


Components

There are 6 functions available in this package. There are 2 clustering functions: clustPAM and clustEM. They allow users to input phylogenetic trees in newick format either as a character string or a file path and performs clustering via either PAM (k-medoids) or EM (expectation maximization) algorithms. Users can specify the number of clusters they want. The functions plotClustersTree and plotClusters2D can be used to visualize tree clusters on a phylogram or a 2D biplot, respectively. Users can specify whether or not to show a number of designated cluster centers, the symbols used to represent these centers, and the text size for these symbols. plotClusters2D first converts the distance matrix of the tree to a coordinate matrix, then uses principle component analysis to reduce dimensionality of the matrix to plot data points on a 2-dimensional plot. The compareGap function takes as input a distance matrix representation of phylogenetic tree and outputs a set of gap statistics for a range from 1 cluster to k.max clusters. This can be used to select the best clustering scheme for the target tree. The plotGapStat function takes the gap statistics output from compareGap and produces a plot of the gap statistics with a vertical dashed line representing the best number of clusters.

Here is an example that shows how to use ClustPhy to cluster a tree via EM and PAM:

> pam <- clustPAM(6, text = NwkTree2)
> em <- clustEM(6, text = NwkTree2)
> str(pam)
List of 5
 $ distM     : num [1:72, 1:72] 0 138 169 164 208 195 173 195 190 196 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
  .. ..$ : chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
 $ phyloTree :List of 4
  ..$ edge       : int [1:141, 1:2] 73 73 74 75 76 76 75 77 77 78 ...
  ..$ edge.length: num [1:141] 55 55 15 1 12 43 10 29 55 18 ...
  ..$ Nnode      : int 70
  ..$ tip.label  : chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
  ..- attr(*, "class")= chr "phylo"
  ..- attr(*, "order")= chr "cladewise"
 $ clustering: Named int [1:72] 1 1 1 1 1 1 2 2 2 2 ...
  ..- attr(*, "names")= chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
 $ medoids   : chr [1:6] "Orycteropus" "Manis" "Presbytis" "Mus" ...
 $ stats     : num [1:6, 1:5] 6 17 21 6 9 13 138 174 102 93 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "size" "max_diss" "av_diss" "diameter" ...
 - attr(*, "class")= chr "PAMclusts"
> em <- clustEM(6, text = NwkTree2)
> str(em)
List of 6
 $ distM     : num [1:72, 1:72] 0 138 169 164 208 195 173 195 190 196 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
  .. ..$ : chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
 $ phyloTree :List of 4
  ..$ edge       : int [1:141, 1:2] 73 73 74 75 76 76 75 77 77 78 ...
  ..$ edge.length: num [1:141] 55 55 15 1 12 43 10 29 55 18 ...
  ..$ Nnode      : int 70
  ..$ tip.label  : chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
  ..- attr(*, "class")= chr "phylo"
  ..- attr(*, "order")= chr "cladewise"
 $ clustering: Named num [1:72] 1 2 1 1 1 1 2 2 2 2 ...
  ..- attr(*, "names")= chr [1:72] "Edentata" "Orycteropus" "Trichechus" "Procavia" ...
 $ mean      : num [1:72, 1:6] 147.2 84.8 98.6 87.6 92 ...
 $ bic       : num -47979
 $ model     : chr "spherical, unequal volume"
 - attr(*, "class")= chr "EMclusts"

Then, user can use the plotClustersTree function to plot phylograms of both clustering schemes:

### plot the pam clusters
plotClustersTree(pam$phyloTree, pam$clustering, show.centers = pam$medoids, center.symbol = pam$medoids)
### plot the em clusters
plotClustersTree(em$phyloTree, em$clustering)

The PAM clusters phylogram:

The EM clusters phylogram:
User can also use the __*plotClusters2D*__ function to plot the tree data points on a 2D plot: wzxhzdk:6 The dimensionality reduction information are stored in the return object: wzxhzdk:7 These gap statistics can be plotted by __*plotGapStat*__: wzxhzdk:8
## Package References [Li, Y. (2021) ClustPhy: A Phylogenetic Tree Clustering Package. Unpublished.](https://github.com/rainali475/ClustPhy)
## Other References R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/ Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in data: An introduction to cluster analysis. Wiley. Legendre17. (1960, August 1). Finding the coordinates of points from distance matrix. Mathematics Stack Exchange. Retrieved November 17, 2021, from https://math.stackexchange.com/questions/156161/finding-the-coordinates-of-points-from-distance-matrix. ---- wzxhzdk:9

rainali475/ClustPhy documentation built on Dec. 22, 2021, 12:03 p.m.