View source: R/stroke_edit_distance.R
sedist | R Documentation |
Variants of the stroke edit distance proposed by Yencken (2010). Each kanji is encoded as sequence of stroke types according to its stroke order, using the type attribute from the kanjiVG data. Then the edit distance (a.k.a.\ Levenshtein distance) between sequences is computed and divided by the maximum of the number of strokes
sedist(k1, k2, type = c("full", "before_slash", "first"))
k1 , k2 |
atomic vectors or lists of kanji in any format that can be treated by |
type |
the type of stroke edit distance to compute. See details. |
The kanjiVG type attribute is a single string composed of a CJK strokes Unicode character, an optional
latin letter providing further information and possibly a variant (another CJK strokes character with optional
letter) separated by "/". If type
is "full"' a match is only counted if two strings are exactly the
same, "before_slash" ignores any slashes and what comes after them, "first" only considers the first
character of each string (so the first CJK stroke character) when counting matches.
The stroke edit distance used by Yencken (2010) is obtained by setting type = "all" (the default), except that the underlying kanjiVG data has significantly changed since then. Comparing with the values in dstrokedit we get an agreement of 96.3 percent, whereas the other distances disagree by a small amount (usually 1-2 edit operations).
A length(k1)
x length(k2)
matrix of stroke edit distances.
Requires kanjistat.data package.
Yencken, Lars (2010). Orthographic support for passing the reading hurdle in Japanese.
PhD Thesis, University of Melbourne, Australia
ind1 <- 384
k1 <- convert_kanji(ind1, "character")
ind2 <- which(dstrokedit[ind1,] > 0)
# dstrokedit contains only the "closest" kanji
k2 <- convert_kanji(ind2, "character")
row_a <- dstrokedit[ind1, ind2]
if (requireNamespace("kanjistat.data", quietly = TRUE)) {
row_b <- sedist(k1, k2)
mat <- rbind(row_a, row_b)
rownames(mat) = c(k1, k1)
colnames(mat) = k2
mat
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.