# seq_dist: Compute distance metrics between integer sequences In stringdist: Approximate String Matching, Fuzzy Text Search, and String Distance Functions

 seq_dist R Documentation

## Compute distance metrics between integer sequences

### Description

seq_dist computes pairwise string distances between elements of a and b, where the argument with less elements is recycled. seq_distmatrix computes the distance matrix with rows according to a and columns according to b.

### Usage

seq_dist(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
)

seq_distmatrix(
a,
b,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw"),
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
useNames = c("names", "none"),
)

### Arguments

 a (list of) integer or numeric vector(s). Will be converted with as.integer (target) b (list of) integer or numeric vector(s). Will be converted with as.integer (source). Optional for seq_distmatrix. method Distance metric. See stringdist-metrics weight For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', or 'lcs' q Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'. p Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'. bt Winkler's boost threshold. Winkler's prefix factor is only applied when the Jaro distance is larger than bt Applies only to method='jw' and p>0. nthread Maximum number of threads to use. By default, a sensible number of threads is chosen, see stringdist-parallelization. useNames label the output matrix with names(a) and names(b)?

### Value

seq_dist returns a numeric vector with pairwise distances between a and b of length max(length(a),length(b).

For seq_distmatrix there are two options. If b is missing, the dist object corresponding to the length(a) X length(a) distance matrix is returned. If b is specified, the length(a) X length(b) distance matrix is returned.

If any element of a or b is NA_integer_, the distance with any matched integer vector will result in NA. Missing values in the sequences themselves are treated as a number and not treated specially (Also see the examples).

### Notes

Input vectors are converted with as.integer. This causes truncation for numeric vectors (e.g. pi will be treated as 3L).

seq_sim, seq_amatch, seq_qgrams

### Examples

# Distances between lists of integer vectors. Note the postfix 'L' to force
# integer storage. The shorter argument is recycled over (\code{a})
a <- list(c(102L, 107L))                        # fu
b <- list(c(102L,111L,111L),c(102L,111L,111L))  # foo, fo
seq_dist(a,b)

# translate strings to a list of integer sequences
a <- lapply(c("foo","bar","baz"),utf8ToInt)
seq_distmatrix(a)

# Note how missing values are treated. NA's as part of the sequence are treated
# as an integer (the representation of NA_integer_).
a <- list(NA_integer_,c(102L, 107L))
b <- list(c(102L,111L,111L),c(102L,111L,NA_integer_))
seq_dist(a,b)

## Not run:
# Distance between sentences based on word order. Note: words must match exactly or they
# are treated as completely different.
#
# For this example you need to have the 'hashr' package installed.
a <- "Mary had a little lamb"
a.words <- strsplit(a,"[[:blank:]]+")
a.int <- hashr::hash(a.words)
b <- c("a little lamb had Mary",