| build_clusters | R Documentation |
Clusters wide-format sequences using pairwise string dissimilarity and either PAM (Partitioning Around Medoids) or hierarchical clustering. Supports 9 distance metrics including temporal weighting for Hamming distance. When the stringdist package is available, uses C-level distance computation for 100-1000x speedup on edit distances.
build_clusters(
data,
k,
dissimilarity = "hamming",
method = "pam",
na_syms = c("*", "%"),
weighted = FALSE,
lambda = 1,
seed = NULL,
q = 2L,
p = 0.1,
covariates = NULL,
...
)
data |
Input data. Accepts multiple formats:
|
k |
Integer. Number of clusters (must be between 2 and
|
dissimilarity |
Character. Distance metric. One of |
method |
Character. Clustering method. |
na_syms |
Character vector. Symbols treated as missing values.
Default: |
weighted |
Logical. Apply exponential decay weighting to Hamming
distance positions? Only valid when |
lambda |
Numeric. Decay rate for weighted Hamming. Higher values weight earlier positions more strongly. Default: 1. |
seed |
Integer or NULL. Random seed for reproducibility. Default:
|
q |
Integer. Size of q-grams for |
p |
Numeric. Winkler prefix penalty for Jaro-Winkler distance
(clamped to 0–0.25). Default: |
covariates |
Optional. Post-hoc covariate analysis of cluster membership via multinomial logistic regression. Accepts:
Covariates are looked up in |
... |
Additional arguments (currently unused). |
An object of class "net_clustering" containing:
The original input data.
Number of clusters.
Named integer vector of cluster assignments.
Overall average silhouette width.
Named integer vector of cluster sizes.
Clustering method used.
Distance metric used.
The computed dissimilarity matrix (dist object).
Integer vector of medoid row indices (PAM only; NULL for hierarchical methods).
Seed used (or NULL).
Logical, whether weighted Hamming was used.
Lambda value used (0 if not weighted).
seqs <- data.frame(V1 = c("A","B","C","A","B"), V2 = c("B","C","A","B","A"),
V3 = c("C","A","B","C","B"))
cl <- build_clusters(seqs, k = 2)
cl
seqs <- data.frame(
V1 = sample(LETTERS[1:3], 20, TRUE), V2 = sample(LETTERS[1:3], 20, TRUE),
V3 = sample(LETTERS[1:3], 20, TRUE), V4 = sample(LETTERS[1:3], 20, TRUE)
)
cl <- build_clusters(seqs, k = 2)
print(cl)
summary(cl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.