| jaccard_string_group | R Documentation |
Performs fuzzy string grouping in which similar strings are assigned to the
same group. Uses the cluster_fast_greedy() community detection algorithm
from the igraph package to create the groups. Must have igraph installed
in order to use this function.
jaccard_string_group(
string,
n_gram_width = 2,
n_bands = 45,
band_width = 8,
threshold = 0.7,
progress = FALSE,
nthread = NULL
)
string |
a character you wish to perform entity resolution on. |
n_gram_width |
the length of the n_grams used in calculating the
jaccard similarity. For best performance, I set this large enough that the
chance any string has a specific n_gram is low (i.e. |
n_bands |
the number of bands used in the minihash algorithm (default
is 40). Use this in conjunction with the |
band_width |
the length of each band used in the minihashing algorithm
(default is 8) Use this in conjunction with the |
threshold |
the jaccard similarity threshold above which two strings should be considered a match (default is .95). The similarity is euqal to 1
|
progress |
set to true to report progress of the algorithm |
nthread |
Maximum number of threads to use. If |
a string vector storing the group of each element in the original input strings. The input vector is grouped so that similar strings belong to the same group, which is given a standardized name.
if (requireNamespace("igraph", quietly = TRUE)) {
string <- c(
"beniamino", "jack", "benjamin", "beniamin",
"jacky", "giacomo", "gaicomo"
)
jaccard_string_group(
string,
threshold = 0.2,
n_bands = 90,
n_gram_width = 1
)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.