clustering: Clustering genomes

View source: R/clustering.R

clusteringR Documentation

Clustering genomes

Description

This function cluster the genomes using mash data, accnet data or igraph data. The object produced by accnet function, mash function and/or knnn data could be clustered. accnet objects are clustered using jaccard distance from presence/absence gene/proteins data. mash object uses the mash distances value as similarity. igraph objects could be clustered using the methods availables in igraph

Usage

clustering(data, method, n_clust, d_reduction = FALSE)

Arguments

data

An object of class accnet/mash/igraph

method

Method of clustering

  • for accnet objects:

    • mclust: It perform clustering using Gaussian Finite Mixture Models. It could be combine with d_reduction. This method uses Mclust package. It has been implemented to find the optimal cluster number

    • upgma: It perform a Hierarchical Clustering using UPGMA algorithm. n_cluster must be provided

    • ward.D2 It perform a Hierarchical Clustering using Ward algorithm. n_cluster must be provided

    • hdbscan: It perform a Density-based spatial clustering of applications with noise using DBSCAN package. It find the optimal number of cluster.

  • for mash objects:

    • mclust: It perform clustering using Gaussian Finite Mixture Models. It could be combine with d_reduction. This method uses Mclust package. It has been implemented to find the optimal cluster number

    • upgma: It perform a Hierarchical Clustering using UPGMA algorithm. n_cluster must be provided

    • ward.D2: It perform a Hierarchical Clustering using Ward algorithm. n_cluster must be provided

    • hdbscan: It perform a Density-based spatial clustering of applications with noise using DBSCAN package. It find the optimal number of cluster.

  • for igraph objects

    • greedy: Community structure via greedy optimization of modularity

    • louvain: This method implements the multi-level modularity optimization algorithm for finding community structure

    • walktrap: Community strucure via short random walks

n_clust

Number of cluster (only for Hierarchical methods)

d_reduction

boolean Perform a dimensional reduction (umap) previous to clustering procces.

Value

A membership data.frame with the columns "Source" and "Cluster"

Note

Clustering of igraph objects depends of the network building (see knnn function) and the number of cluster may variate between different setting of the k-nn network. Network based-methods are faster than distance based methods.
Dimensional reduction tries to overcome "the curse of dimensionality" (more variables than samples: https://en.wikipedia.org/wiki/Curse_of_dimensionality). Using umap from uwot package we reduce to two the dimensionality of the dataset. Note that methods based on HDBSCAN allways perform the dimensional reduction.
There is not a universall criteria to select the number of clusters and the best configuration for one dataset may be not be the best one for others.

If you desire to know more about clustering we recommend the book "Practical Guide To Cluster Analysis in R" from Alboukadel Kassambara (STHDA ed.)

See Also

For more information: knnn, accnet, mash, igraph.


irycisBioinfo/PATO documentation built on Oct. 19, 2023, 3:07 p.m.