mamatch: Extension to stringdist::amatch, returning multiple matched

Description Usage Arguments Value See Also Examples

Description

When trying to find matches in large sets, often mutliple results are possible/likely. Especially when using multiple criteria, it can be useful to first have a broad search.
An example is trying to match a list of people to another list of people: a first approach would be matching family-names, later extended to include first names, place of origin, etc.
Or if you're unsure what exact method to use, you can experiment with one first, then use others to further limit results, without having to check your entire dataset again. So this function gives the most likely matches: the maxmatch lowest distance matches, up to maxDist away.
For ties, the first matches in table are returned

Usage

1
2
3
4
5
6
mamatch(x, table, nomatch = NA, matchNA = TRUE, method = c("osa",
  "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
  "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1),
  maxDist = 0.1, q = 1, p = 0, bt = 0,
  nthread = getOption("sd_num_thread"), maxmatch = 10, limitMem = 0,
  returnAs = c("matrix", "list"), dupls = TRUE)

Arguments

x, table, matchNA, method, useBytes, weight, maxDist, q, p, bt, nthread

See amatch

nomatch

See also amatch, but for returnAs=='list', it can be NULL

maxmatch

Maximum number of matches to return.
Capped at length(table)

limitMem

Limit memory usage. For large x and table, a lot of memory is needed for the matrix with distances. (Internally, this script calls stringdistmatrix, which means a matrix of length(x) * length(table) * 8 bytes.
You can provide this parameter to chunk processing, with matrices of ~ limitMem bytes. 0 (default) means unlimited, otherwise size in bytes.
Note that this is not all the needed memory, but for large x and table, the size of the matrix is the dominant factor.

returnAs

comparable to simplify in sapply: should result be returned as a list or an matrix?
"matrix" fills any non-matches with nomatch, so it always returns a matrix (even when sapply wouldn't)

dupls

Are there possibly duplicates present? Decides what kind of algorithm is used.
if TRUE, only the distance between the unique values are calculated, but more overhead is used to return the mapping to the original values.
if FALSE, all distances are calculated, which takes longer if duplicates are present. But there is less overhead.

Note that the results may differ in detail in case of ties. To be precise: for dupls==TRUE, indices equal to a first match are returned before differing indices, e.g:
mamatch('abcde', c('abcd1','abcd2','abcd3','xxx','xxx','abcd1'), method='jw', p=.1)[,1]
returns c(1,2,3,6,NA,NA) for dupls==FALSE, and c(1,2,6,3,NA,NA) for dupls==TRUE, as all are tied

Value

For returnAs=="list", a list of length(x), with elements of length between length(nomatch) and maxmatch, with indices of closest matches in table.

For returnAs=="matrix", an matrix of length(x) columns and maxmatch rows (even if no elements have that many matches). Non-matches are filled in with nomatch.

In both cases, for ties the first match gets priority.

See Also

amatch

Examples

1
2
3
4
5
6
7
8
9
set.seed(1)
x <- replicate(paste(letters[ceiling(runif(n = 20)*26)], collapse=''), n = 50)
table <- replicate(paste(letters[ceiling(runif(n = 20)*26)], collapse=''), n = 200)
normal_amatch <- stringdist::amatch(x, table, method='jw', p=.1, maxDist=.5)
multi_match <- mamatch(x, table, method='jw', p=.1, maxDist = .5, maxmatch=10, returnAs='matrix')
print(identical(normal_amatch, multi_match[1,]))
# What do the closest matches for number 1 look like?
print(x[1])
print(table[multi_match[,1]])

EmilBode/EmilMisc documentation built on Feb. 24, 2020, 4:11 p.m.