clustering: Clustering genomes
In irycisBioinfo/PATO: Pangenome Analysis Toolkit

clustering

R Documentation

Clustering genomes

Description

This function cluster the genomes using mash data, accnet data or igraph data. The object produced by accnet function, mash function and/or knnn data could be clustered. accnet objects are clustered using jaccard distance from presence/absence gene/proteins data. mash object uses the mash distances value as similarity. igraph objects could be clustered using the methods availables in igraph

Usage

clustering(data, method, n_clust, d_reduction = FALSE)

Arguments

`data`	An object of class accnet/mash/igraph
`method`	Method of clustering for accnet objects: mclust: It perform clustering using Gaussian Finite Mixture Models. It could be combine with d_reduction. This method uses `Mclust` package. It has been implemented to find the optimal cluster number upgma: It perform a Hierarchical Clustering using UPGMA algorithm. n_cluster must be provided ward.D2 It perform a Hierarchical Clustering using Ward algorithm. n_cluster must be provided hdbscan: It perform a Density-based spatial clustering of applications with noise using DBSCAN package. It find the optimal number of cluster. for mash objects: mclust: It perform clustering using Gaussian Finite Mixture Models. It could be combine with d_reduction. This method uses `Mclust` package. It has been implemented to find the optimal cluster number upgma: It perform a Hierarchical Clustering using UPGMA algorithm. n_cluster must be provided ward.D2: It perform a Hierarchical Clustering using Ward algorithm. n_cluster must be provided hdbscan: It perform a Density-based spatial clustering of applications with noise using DBSCAN package. It find the optimal number of cluster. for igraph objects greedy: Community structure via greedy optimization of modularity louvain: This method implements the multi-level modularity optimization algorithm for finding community structure walktrap: Community strucure via short random walks
`n_clust`	Number of cluster (only for Hierarchical methods)
`d_reduction`	boolean Perform a dimensional reduction (umap) previous to clustering procces.

Value

A membership data.frame with the columns "Source" and "Cluster"

Note

Clustering of igraph objects depends of the network building (see knnn function) and the number of cluster may variate between different setting of the k-nn network. Network based-methods are faster than distance based methods.
Dimensional reduction tries to overcome "the curse of dimensionality" (more variables than samples: https://en.wikipedia.org/wiki/Curse_of_dimensionality). Using umap from uwot package we reduce to two the dimensionality of the dataset. Note that methods based on HDBSCAN allways perform the dimensional reduction.
There is not a universall criteria to select the number of clusters and the best configuration for one dataset may be not be the best one for others.

If you desire to know more about clustering we recommend the book "Practical Guide To Cluster Analysis in R" from Alboukadel Kassambara (STHDA ed.)

irycisBioinfo/PATO
Pangenome Analysis Toolkit

clustering: Clustering genomes
In irycisBioinfo/PATO: Pangenome Analysis Toolkit

Clustering genomes

Description

Usage

Arguments

Value

Note

See Also

Related to clustering in irycisBioinfo/PATO...

R Package Documentation

Browse R Packages

We want your feedback!

irycisBioinfo/PATO Pangenome Analysis Toolkit

clustering: Clustering genomes In irycisBioinfo/PATO: Pangenome Analysis Toolkit

Clustering genomes

Description

Usage

Arguments

Value

Note

See Also

Related to clustering in irycisBioinfo/PATO...

R Package Documentation

Browse R Packages

We want your feedback!

irycisBioinfo/PATO
Pangenome Analysis Toolkit

clustering: Clustering genomes
In irycisBioinfo/PATO: Pangenome Analysis Toolkit