Description Usage Arguments Value
Function aligns two texts side by side as a data.frame with change type and distance given as well
1 2 3 4 5 6 | diff_align(text1 = NULL, text2 = NULL, tokenizer = NULL,
ignore = NULL, clean = NULL, distance = c("lv", "osa", "dl",
"hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"),
useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1),
maxDist = 0, q = 1, p = 0, nthread = getOption("sd_num_thread"),
verbose = TRUE, ...)
|
text1 |
first text |
text2 |
second text |
tokenizer |
defaults to NULL which will trigger linewise tokenization; accepts a function that turns a text into a token data frame; a token data frame has at least three columns: from (first character of token), to (last character of token) token (the token) |
ignore |
defaults to NULL which means that nothing is ignored; function that accepts a token data frame (see above) and returns a possibly subseted data frame of hte same form |
clean |
defaults to NULL which means that nothing cleaned; accepts a function that takes a vector of tokens and returns a vector of same length - potentially clean up |
distance |
defaults to Levenshtein ("lv"); see amatch, stringdist-metrics, stringdist |
useBytes |
Perform byte-wise comparison, see
|
weight |
For |
maxDist |
maximum amount of distance before no matching will be done anymore |
q |
Size of the q-gram; must be nonnegative. Only applies to
|
p |
Penalty factor for Jaro-Winkler distance. The valid range for
|
nthread |
Maximum number of threads to use. By default, a sensible
number of threads is chosen, see |
verbose |
should function report on its doings via messages or not |
... |
further arguments passed through to distance function |
dataframe with tokens aligned according to distance
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.