diff_align: algining texts

Description Usage Arguments Value

View source: R/diff_align.R

Description

Function aligns two texts side by side as a data.frame with change type and distance given as well

Usage

1
2
3
4
5
diff_align(text1 = NULL, text2 = NULL, tokenizer = NULL, ignore = NULL,
  clean = NULL, distance = c("lv", "osa", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1,
  i = 1, s = 1, t = 1), maxDist = 0, q = 1, p = 0,
  nthread = getOption("sd_num_thread"), verbose = TRUE, ...)

Arguments

text1

first text

text2

second text

tokenizer

defaults to NULL which will trigger linewise tokenization; accepts a function that turns a text into a token data frame; a token data frame has at least three columns: from (first character of token), to (last character of token) token (the token)

ignore

defaults to NULL which means that nothing is ignored; function that accepts a token data frame (see above) and returns a possibly subseted data frame of hte same form

clean

defaults to NULL which means that nothing cleaned; accepts a function that takes a vector of tokens and returns a vector of same length - potentially clean up

distance

defaults to Levenshtein ("lv"); see amatch, stringdist-metrics, stringdist

useBytes

Perform byte-wise comparison, see stringdist-encoding.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or soundex.

maxDist

[DEPRECATED AND WILL BE REMOVED|2016] Currently kept for backward compatibility. It does not offer any speed gain. (In fact, it currently slows things down when set to anything different from Inf).

q

Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.

p

Penalty factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

nthread

Maximum number of threads to use. By default, a sensible number of threads is chosen, see stringdist-parallelization.

verbose

should function report on its doings via messages or not

...

further arguments passed through to distance function

Value

dataframe with tokens aligned according to distance


diffrprojects documentation built on May 2, 2019, 1:43 p.m.