uniqtag: Abbreviate strings to short, unique identifiers.

View source: R/uniqtag.R

uniqtagR Documentation

Abbreviate strings to short, unique identifiers.

Description

Abbreviate strings to unique substrings of k characters.

Usage

uniqtag(xs, k = 9, uniq = make_unique_all_or_none, sep = "-")

Arguments

xs

a character vector

k

the size of the identifier, an integer

uniq

a function to make the abbreviations unique, such as make_unique, make_unique_duplicates, make_unique_all_or_none, make_unique_all, make.unique, or to disable this function, identity or NULL

sep

a character string used to separate a duplicate string from its sequence number

Details

For each string in a set of strings, determine a unique tag that is a substring of fixed size k unique to that string, if it has one. If no such unique substring exists, the least frequent substring is used. If multiple unique substrings exist, the lexicographically smallest substring is used. This lexicographically smallest substring of size k is called the UniqTag of that string.

The lexicographically smallest substring depend on the locale's sort order. You may wish to first call Sys.setlocale("LC_COLLATE", "C")

Value

a character vector of the UniqTags of the strings x

See Also

abbreviate, locales, make.unique

Examples

Sys.setlocale("LC_COLLATE", "C")
states <- sub(" ", "", state.name)
uniqtags <- uniqtag(states)
uniqtags4 <- uniqtag(states, k = 4)
uniqtags3 <- uniqtag(states, k = 3)
uniqtags3x <- uniqtag(states, k = 3, uniq = make_unique)
table(nchar(states))
table(nchar(uniqtags))
table(nchar(uniqtags4))
table(nchar(uniqtags3))
table(nchar(uniqtags3x))
uniqtags3[grep("-", uniqtags3x)]

uniqtag documentation built on June 10, 2022, 9:06 a.m.