Description Usage Arguments Value
Function aligns two texts side by side as a data.frame with change type and distance given as well
1 2 3 4 5 | diff_align(text1 = NULL, text2 = NULL, tokenizer = NULL, ignore = NULL,
clean = NULL, distance = c("lv", "osa", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1,
i = 1, s = 1, t = 1), maxDist = 0, q = 1, p = 0,
nthread = getOption("sd_num_thread"), verbose = TRUE, ...)
|
text1 |
first text |
text2 |
second text |
tokenizer |
defaults to NULL which will trigger linewise tokenization; accepts a function that turns a text into a token data frame; a token data frame has at least three columns: from (first character of token), to (last character of token) token (the token) |
ignore |
defaults to NULL which means that nothing is ignored; function that accepts a token data frame (see above) and returns a possibly subseted data frame of hte same form |
clean |
defaults to NULL which means that nothing cleaned; accepts a function that takes a vector of tokens and returns a vector of same length - potentially clean up |
distance |
defaults to Levenshtein ("lv"); see amatch, stringdist-metrics, stringdist |
useBytes |
Perform byte-wise comparison, see
|
weight |
For |
maxDist |
[DEPRECATED AND WILL BE REMOVED|2016] Currently kept for
backward compatibility. It does not offer any speed gain. (In fact, it
currently slows things down when set to anything different from
|
q |
Size of the q-gram; must be nonnegative. Only applies to
|
p |
Penalty factor for Jaro-Winkler distance. The valid range for
|
nthread |
Maximum number of threads to use. By default, a sensible
number of threads is chosen, see |
verbose |
should function report on its doings via messages or not |
... |
further arguments passed through to distance function |
dataframe with tokens aligned according to distance
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.