ClusterByK: Predicted pair trimming using K-means.
In npcooley/SynExtend: Tools for Comparative Genomics

ClusterByK

R Documentation

Predicted pair trimming using K-means.

Description

A relatively simple k-means clustering approach to drop predicted pairs that belong to clusters with a PID centroid below a specified user threshold.

Usage

ClusterByK(SynExtendObject,
           UserConfidence = list("PID" = 0.3),
           ClusterScalar = 4,
           MaxClusters = 15L,
           ColSelect = c("p1featurelength",
                         "p2featurelength",
                         "TotalMatch",
                         "Consensus",
                         "PID",
                         "Score",
                         "Delta_Background"),
           ColNorm = "Score",
           ShowPlot = FALSE,
           Verbose = FALSE)

Arguments

`SynExtendObject`	An object of class `PairSummaries`.
`UserConfidence`	A named list of length 1 where the name identifies a column of the `PairSummaries` object, and the value identifies a user confidence. Every k-means cluster with a center value of the column value selected greater than the confidence is retained.
`ClusterScalar`	A numeric value used to scale selection of how many clusters are used in kmeans clustering. A transformed total within-cluster sum of squares value is fit to a right hyperbola, and a scaled half-max value is used to select cluster number. “ClusterScalar” is multiplied by the half-max to adjust cluster number selection.
`MaxClusters`	Integer value indicating the largest number of clusters to test in a series of k-means clustering tests.
`ColSelect`	A character vector of column names indicating which columns to use for k-means clustering. When “p1featurelength”, “p2featurelength”, and “TotalMatch” are included together, they are morphed into a value representing the match size proportional to the longer of the two sequences.
`ColNorm`	A character vector of column names indicating columns the user would like to unit normalize. By default only set to “Score”.
`ShowPlot`	Logical indicating whether or not to plot the CDFs for the PIDs of all k-means clusters for the determined cluster number.
`Verbose`	Logical indicating whether or not to display a progress bar and print the time difference upon completion.

Details

ClusterByK uses a naive k-means routine to select for predicted pairs that belong to clusters whose centroids are greater than or equal to the user specified column-value pair. This means that the confidence is not a minimum, and that pairs with values below the user confidence can be retained. The sum of within cluster sum of squares is used to approximate “knee” selection with the “ClusterScalar” value. With a “ClusterScalar” value of 1 the half-max of a right-hyperbola fitted to the sum of within-cluster sum of squares is used to pick the cluster number for evaluation, “ClusterScalar” is multiplied by the half-max to tune cluster number selection. ClusterByK returns the original object with an appended column and new attributes. The new column “ClusterID” is an integer value indicating which k-means cluster a candidate pair belongs to, while the attribute “Retain” is a named logical vector where the names correspond to ClusterIDs, and the logical value indicates whether the cluster center was above the user suppled column-value pair. This function is intended to be used at the genome-to-genome comparison level, and not say, at the level of an all-vs-all comparison of many genomes. It will work well in all-vs-all cases, but it is not optimized for that scale yet.

Value

An object of class PairSummaries.

Author(s)

Nicholas Cooley npc19@pitt.edu

Examples

data("Endosymbionts_Pairs01", package = "SynExtend")

Pairs02 <- ClusterByK(SynExtendObject = Endosymbionts_Pairs01)

npcooley/SynExtend documentation built on June 8, 2025, 5:24 a.m.

npcooley/SynExtend index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

npcooley/SynExtend
Tools for Comparative Genomics

ClusterByK: Predicted pair trimming using K-means.
In npcooley/SynExtend: Tools for Comparative Genomics

Predicted pair trimming using K-means.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to ClusterByK in npcooley/SynExtend...

R Package Documentation

Browse R Packages

We want your feedback!

npcooley/SynExtend Tools for Comparative Genomics

ClusterByK: Predicted pair trimming using K-means. In npcooley/SynExtend: Tools for Comparative Genomics

Predicted pair trimming using K-means.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Related to ClusterByK in npcooley/SynExtend...

R Package Documentation

Browse R Packages

We want your feedback!

npcooley/SynExtend
Tools for Comparative Genomics

ClusterByK: Predicted pair trimming using K-means.
In npcooley/SynExtend: Tools for Comparative Genomics