clusterKmers: Cluster k-mers

Description Usage Arguments Value Examples

Description

Takes a set of k-mer sequences and returns a list of partitioning the input k-mers into clusters of more similar k-mers. Hierarchical clustering (average linkage) is performed based on Jaccard coefficient distance metric applied treating each k-mer as the set of all tetramers which can be found as substrings within it.

Usage

1
2
clusterKmers(kmers, k = 4, nClusters = NULL, maxClusters = NULL,
    directional = TRUE)

Arguments

kmers

character vector or XStringSet of k-mers to partition into clusters

k

length of sub-k-mers (default k=4 to use tetramers) with which to calculate Jaccard distances for clustering

nClusters

number of clusters to partition kmers into; if set to NULL (default value), selects number of clusters to maximize the average silhouette score (https://en.wikipedia.org/wiki/Silhouette_(clustering)).

maxClusters

if nClusters not specified, can optionally set maximum number of clusters allowed in silhouette score optimization.

directional

logical value: if FALSE, considers each kmer as equivalent to its reverse-complement. Makes sense only if applying to DNA sequences!

Value

list of character vectors (or XStringSet objects as per the class of kmers argument) partitioning kmers into clusters: the character vector at the i-th element of the output list contains the elements from kmers assigned to cluster i.

Examples

1
2
3
4
5
6
kmers <- c(
    'CAGCCTGG', 'CCTGGAA', 'CAGCCTG', 'CCTGGAAC', 'CTGGAACT',
    'ACCTGC', 'CACCTGC', 'TGGCCTG', 'CACCTG', 'TCCAGC',
    'CTGGAAC', 'CACCTGG', 'CTGGTCTA', 'GTCCTG', 'CTGGAAG', 'TTCCAGC'
)
clusterKmers(kmers, directional=FALSE)

sarks documentation built on Nov. 8, 2020, 6:54 p.m.