seq_dist: Compute distance metrics between integer sequences
In stringdist: Approximate String Matching, Fuzzy Text Search, and String Distance Functions

seq_dist

R Documentation

Compute distance metrics between integer sequences

Description

seq_dist computes pairwise string distances between elements of a and b, where the argument with less elements is recycled. seq_distmatrix computes the distance matrix with rows according to a and columns according to b.

Usage

seq_dist(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  nthread = getOption("sd_num_thread")
)

seq_distmatrix(
  a,
  b,
  method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
  weight = c(d = 1, i = 1, s = 1, t = 1),
  q = 1,
  p = 0,
  bt = 0,
  useNames = c("names", "none"),
  nthread = getOption("sd_num_thread")
)

Arguments

`a`	(`list` of) `integer` or `numeric` vector(s). Will be converted with `as.integer` (target)
`b`	(`list` of) `integer` or `numeric` vector(s). Will be converted with `as.integer` (source). Optional for `seq_distmatrix`.
`method`	Distance metric. See `stringdist-metrics`
`weight`	For `method='osa'` or `'dl'`, the penalty for deletion, insertion, substitution and transposition, in that order. When `method='lv'`, the penalty for transposition is ignored. When `method='jw'`, the weights associated with characters of `a`, characters from `b` and the transposition weight, in that order. Weights must be positive and not exceed 1. `weight` is ignored completely when `method='hamming'`, `'qgram'`, `'cosine'`, `'Jaccard'`, or `'lcs'`
`q`	Size of the `q`-gram; must be nonnegative. Only applies to `method='qgram'`, `'jaccard'` or `'cosine'`.
`p`	Prefix factor for Jaro-Winkler distance. The valid range for `p` is `0 <= p <= 0.25`. If `p=0` (default), the Jaro-distance is returned. Applies only to `method='jw'`.
`bt`	Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than `bt` Applies only to `method='jw'` and `p>0`.
`nthread`	Maximum number of threads to use. By default, a sensible number of threads is chosen, see `stringdist-parallelization`.
`useNames`	label the output matrix with `names(a)` and `names(b)`?

Value

seq_dist returns a numeric vector with pairwise distances between a and b of length max(length(a),length(b).

For seq_distmatrix there are two options. If b is missing, the dist object corresponding to the length(a) X length(a) distance matrix is returned. If b is specified, the length(a) X length(b) distance matrix is returned.

If any element of a or b is NA_integer_, the distance with any matched integer vector will result in NA. Missing values in the sequences themselves are treated as a number and not treated specially (Also see the examples).

Notes

Input vectors are converted with as.integer. This causes truncation for numeric vectors (e.g. pi will be treated as 3L).

Examples

# Distances between lists of integer vectors. Note the postfix 'L' to force 
# integer storage. The shorter argument is recycled over (\code{a})
a <- list(c(102L, 107L))                        # fu
b <- list(c(102L,111L,111L),c(102L,111L,111L))  # foo, fo
seq_dist(a,b)

# translate strings to a list of integer sequences 
a <- lapply(c("foo","bar","baz"),utf8ToInt)
seq_distmatrix(a)

# Note how missing values are treated. NA's as part of the sequence are treated 
# as an integer (the representation of NA_integer_).
a <- list(NA_integer_,c(102L, 107L))
b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_))  
seq_dist(a,b)

## Not run: 
# Distance between sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
a <- "Mary had a little lamb"
a.words <- strsplit(a,"[[:blank:]]+")
a.int <- hashr::hash(a.words)
b <- c("a little lamb had Mary",
           "had Mary a little lamb")
b.int <- hashr::hash(strsplit(b,"[[:blank:]]+"))
seq_dist(a.int,b.int)

## End(Not run)

stringdist documentation built on April 12, 2025, 2:01 a.m.