genomeClustering | R Documentation |
Alters an RMS-object by clustering genomes who are too similar to distinguish.
genomeClustering(rms.obj, max.corr = 0.8, verbose = TRUE)
rms.obj |
A |
max.corr |
Maximum correlation between genomes, see Details. |
verbose |
Logical, turning on/off screen report on progress during clustering. |
This function will cluster genomes based on how similar they are in
RMS fragment content. If two genomes are very similar with respect to RMS
fragment content, their corresponding columns in rms.obj$Cpn.mat
are
highly correlated. If genomes are too correlated it is impossible to estimate
their abundance separately, thus such genomes must be seen as a cluster, and
we only estimate the abundance of this cluster.
Genomes are first represented as a graph, where two genomes are connected
with an edge if they share at least 1 RMS fragment. For many highly unrelated
genomes this step will result in a disconnected graph, i.e. groups of genomes
not sharing any fragments between them. These graph components
are the first grouping of the genomes. Genomes sharing no fragments will
always end up in different clusters anyway. This step saves a lot of memory
when the RMS object contain many and unrelated genomes, since all these
computations can be done on a sparse Matrix
.
Next, clustering within each graph component is done by hierarchical
clustering with complete linkage. The distance metric is 1 minus correlation,
i.e. genomes with a large correlation close to 1.0) has a small distance
(close to 0.0). The max.corr
argument indicates where to cut the
dendrogram tree to group the genomes. With max.corr = 0.8
we cut the
dendrogram at distance 0.2
. Note that the distance matrix cannot be
a sparse Matrix
, and if too many genomes in the rms.obj
are
in the same graph component, you may run into memory problems.
More technical details: The problem of too similar genomes will be reflected in
a close to singular covariance matrix when de-convolving the genome
abundances. Clustering by correlation distance is, in theory, no guarantee
against this. Two (or more) fairly uncorrelated genomes may still combine
into something very correlated with a third genome. However, in reality this
rarely happens with RMS data. Use the conditionValue
function on the
resulting rms.obj$Cpn.mat
to see if the clustering resulted in a fairly
low condition value to a tolerable size (e.g. around 1e+3 to 1e+4 or less).
An updated RMS object, where Genome.tbl
has an additional
column named members_genome_id
. This Genome.tbl
typically has
fewer rows than the original, one for each cluster, and this new column
indicates which of the original genomes are grouped into each cluster. Each
cluster is represented by one of the original genomes (the cluster medoide).
The other objects inside the RMS object have also been updated accordingly.
Lars Snipen.
corrDist
, conditionValue
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.