spelldist: Spelling distance matrix

Description Usage Arguments Details Value

View source: R/spelldist.R

Description

Calculate spelling distance matrix according to costs of insertion-deletion and substitution of characters of character combinations

Usage

1
2
3
4
5
spelldist(words, standard = NULL, asdist = FALSE, indel = NULL,
  sm = NULL, cost_method = c("INDELS", "CONSTANT", "TRATE", "FUTURE",
  "FEATURES", "INDELSLOG"), dist_method = c("OM", "OMloc", "OMslen",
  "OMspell", "OMstran", "HAM", "DHD", "CHI2", "EUCLID", "LCS", "LCP",
  "RLCP", "NMS", "NMSMST", "SVRspell", "TWED"), ...)

Arguments

words

a vector of word forms, case-insensitive.

standard

a vector of word forms in standard spelling, case-insensitive. If NULL, will use real, thus returning a symmetric distance matrix.

asdist

logical, whether to return the spelling distance as a dist object rather than a matrix. If standard on a distance matrix can be returned, ignoring asdist.

indel

an optional named numeric vector of indel costs. The names are what is being inserted or deleted, e.g. c('a'=1,'bb'=0.2). The names do not have to be single-character.

sm

an optional numeric matrix with rownames and colnames as the characters that are being substituted. An element sm[i,j] is the cost of substituting rownames(sm)[i] with colnames(sm)[j]. The names do not have to be single-character

cost_method

method for calculating those indel and substitution costs that are not explicitly provided by indel and sm. See TraMineR::seqcost for descriptions.

dist_method

method for calculating spelling distance. See TraMineR::seqdist for descriptions

...

Optional arguments to be passed to TraMineR::seqdist (all other than the seqdata, method, indel and sm)

Details

The calculation is case-insensitive: words, standard, as well as the dimnames of indel and sm will be coerced to lower case. indel, sm provide a way to deal with multi-character rules. E.g. presence of an element sm['ll','l'] indicates that "ll" will be treated as a single character. For technical reasons, however, the number of such distinct multi-character names (in indel and sm in total) must not exceed 26.

Value

a list of: m - matrix (or dist object) of spelling distances, with rows corresponding to words; indel and sm - as were used in the calculation; elapsed - elapsed time in seconds.


rushkin/parseR documentation built on May 17, 2019, 12:52 p.m.