genomeClustering: Clustering genomes based on shared RMS fragments

View source: R/clustering.R

genomeClusteringR Documentation

Clustering genomes based on shared RMS fragments

Description

Alters an RMS-object by clustering genomes who are too similar to distinguish.

Usage

genomeClustering(rms.obj, max.corr = 0.8, verbose = TRUE)

Arguments

rms.obj

A list with RMS data structures, see RMSobject.

max.corr

Maximum correlation between genomes, see Details.

verbose

Logical, turning on/off screen report on progress during clustering.

Details

This function will cluster genomes based on how similar they are in RMS fragment content. If two genomes are very similar with respect to RMS fragment content, their corresponding columns in rms.obj$Cpn.mat are highly correlated. If genomes are too correlated it is impossible to estimate their abundance separately, thus such genomes must be seen as a cluster, and we only estimate the abundance of this cluster.

Genomes are first represented as a graph, where two genomes are connected with an edge if they share at least 1 RMS fragment. For many highly unrelated genomes this step will result in a disconnected graph, i.e. groups of genomes not sharing any fragments between them. These graph components are the first grouping of the genomes. Genomes sharing no fragments will always end up in different clusters anyway. This step saves a lot of memory when the RMS object contain many and unrelated genomes, since all these computations can be done on a sparse Matrix.

Next, clustering within each graph component is done by hierarchical clustering with complete linkage. The distance metric is 1 minus correlation, i.e. genomes with a large correlation close to 1.0) has a small distance (close to 0.0). The max.corr argument indicates where to cut the dendrogram tree to group the genomes. With max.corr = 0.8 we cut the dendrogram at distance 0.2. Note that the distance matrix cannot be a sparse Matrix, and if too many genomes in the rms.obj are in the same graph component, you may run into memory problems.

More technical details: The problem of too similar genomes will be reflected in a close to singular covariance matrix when de-convolving the genome abundances. Clustering by correlation distance is, in theory, no guarantee against this. Two (or more) fairly uncorrelated genomes may still combine into something very correlated with a third genome. However, in reality this rarely happens with RMS data. Use the conditionValue function on the resulting rms.obj$Cpn.mat to see if the clustering resulted in a fairly low condition value to a tolerable size (e.g. around 1e+3 to 1e+4 or less).

Value

An updated RMS object, where Genome.tbl has an additional column named members_genome_id. This Genome.tbl typically has fewer rows than the original, one for each cluster, and this new column indicates which of the original genomes are grouped into each cluster. Each cluster is represented by one of the original genomes (the cluster medoide). The other objects inside the RMS object have also been updated accordingly.

Author(s)

Lars Snipen.

See Also

corrDist, conditionValue.


larssnip/microRMS documentation built on July 19, 2023, 1:06 a.m.