diff_align: algining texts

Description Usage Arguments Value

View source: R/diff_align.R

Description

Function aligns two texts side by side as a data.frame with change type and distance given as well

Usage

1
2
3
4
5
6
diff_align(text1 = NULL, text2 = NULL, tokenizer = NULL,
  ignore = NULL, clean = NULL, distance = c("lv", "osa", "dl",
  "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"),
  useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1),
  maxDist = 0, q = 1, p = 0, nthread = getOption("sd_num_thread"),
  verbose = TRUE, ...)

Arguments

text1

first text

text2

second text

tokenizer

defaults to NULL which will trigger linewise tokenization; accepts a function that turns a text into a token data frame; a token data frame has at least three columns: from (first character of token), to (last character of token) token (the token)

ignore

defaults to NULL which means that nothing is ignored; function that accepts a token data frame (see above) and returns a possibly subseted data frame of hte same form

clean

defaults to NULL which means that nothing cleaned; accepts a function that takes a vector of tokens and returns a vector of same length - potentially clean up

distance

defaults to Levenshtein ("lv"); see amatch, stringdist-metrics, stringdist

useBytes

Perform byte-wise comparison, see stringdist-encoding.

weight

For method='osa' or 'dl', the penalty for deletion, insertion, substitution and transposition, in that order. When method='lv', the penalty for transposition is ignored. When method='jw', the weights associated with characters of a, characters from b and the transposition weight, in that order. Weights must be positive and not exceed 1. weight is ignored completely when method='hamming', 'qgram', 'cosine', 'Jaccard', 'lcs', or soundex.

maxDist

maximum amount of distance before no matching will be done anymore

q

Size of the q-gram; must be nonnegative. Only applies to method='qgram', 'jaccard' or 'cosine'.

p

Penalty factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

nthread

Maximum number of threads to use. By default, a sensible number of threads is chosen, see stringdist-parallelization.

verbose

should function report on its doings via messages or not

...

further arguments passed through to distance function

Value

dataframe with tokens aligned according to distance


petermeissner/diffrprojects documentation built on Dec. 29, 2020, 3:59 a.m.