diff_align: algining texts
In petermeissner/diffrprojects: Projects for Text Version Comparison and Analytics in R

Function aligns two texts side by side as a data.frame with change type and distance given as well

diff_align(text1 = NULL, text2 = NULL, tokenizer = NULL,
  ignore = NULL, clean = NULL, distance = c("lv", "osa", "dl",
  "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"),
  useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1),
  maxDist = 0, q = 1, p = 0, nthread = getOption("sd_num_thread"),
  verbose = TRUE, ...)

`text1`	first text
`text2`	second text
`tokenizer`	defaults to NULL which will trigger linewise tokenization; accepts a function that turns a text into a token data frame; a token data frame has at least three columns: from (first character of token), to (last character of token) token (the token)
`ignore`	defaults to NULL which means that nothing is ignored; function that accepts a token data frame (see above) and returns a possibly subseted data frame of hte same form
`clean`	defaults to NULL which means that nothing cleaned; accepts a function that takes a vector of tokens and returns a vector of same length - potentially clean up
`distance`	defaults to Levenshtein ("lv"); see amatch, stringdist-metrics, stringdist
`useBytes`	Perform byte-wise comparison, see `stringdist-encoding`.
`weight`	For `method='osa'` or `'dl'`, the penalty for deletion, insertion, substitution and transposition, in that order. When `method='lv'`, the penalty for transposition is ignored. When `method='jw'`, the weights associated with characters of `a`, characters from `b` and the transposition weight, in that order. Weights must be positive and not exceed 1. `weight` is ignored completely when `method='hamming'`, `'qgram'`, `'cosine'`, `'Jaccard'`, `'lcs'`, or `soundex`.
`maxDist`	maximum amount of distance before no matching will be done anymore
`q`	Size of the q-gram; must be nonnegative. Only applies to `method='qgram'`, `'jaccard'` or `'cosine'`.
`p`	Penalty factor for Jaro-Winkler distance. The valid range for `p` is `0 <= p <= 0.25`. If `p=0` (default), the Jaro-distance is returned. Applies only to `method='jw'`.
`nthread`	Maximum number of threads to use. By default, a sensible number of threads is chosen, see `stringdist-parallelization`.
`verbose`	should function report on its doings via messages or not
`...`	further arguments passed through to distance function

dataframe with tokens aligned according to distance

petermeissner/diffrprojects documentation built on Dec. 29, 2020, 3:59 a.m.

petermeissner/diffrprojects index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

petermeissner/diffrprojects
Projects for Text Version Comparison and Analytics in R

diff_align: algining texts
In petermeissner/diffrprojects: Projects for Text Version Comparison and Analytics in R

Description

Usage

Arguments

Value

Related to diff_align in petermeissner/diffrprojects...

R Package Documentation

Browse R Packages

We want your feedback!

petermeissner/diffrprojects Projects for Text Version Comparison and Analytics in R

diff_align: algining texts In petermeissner/diffrprojects: Projects for Text Version Comparison and Analytics in R

Description

Usage

Arguments

Value

Related to diff_align in petermeissner/diffrprojects...

R Package Documentation

Browse R Packages

We want your feedback!

petermeissner/diffrprojects
Projects for Text Version Comparison and Analytics in R

diff_align: algining texts
In petermeissner/diffrprojects: Projects for Text Version Comparison and Analytics in R