writeprofile: Writing of a orthography profile skeleton
In cysouw/qlcTokenize: Processing of orthographies using orthography profiles

Description Usage Arguments Details Value Author(s) References See Also Examples

To process strings, it is often very useful to tokenise them into graphemes (i.e. functional units of the orthography), and possibly replace those graphemes by other symbols to harmonize the orthographic representation of different orthographic representations (‘transcription’). As a quick and easy way to specify, save, and document the decisions taken for the tokenization, we propose using an orthography profile.

Provided here is a function to prepare a skeleton for an orthography profile. This function takes some strings and lists detailed information on the Unicode characters in the strings.

1
2
3

write.profile(strings, 
    normalize = NULL, info = TRUE, editing = FALSE, sep = NULL, 
    file.out = NULL, collation.locale = NULL)

`strings`	A vector of strings on which to base the orthography profile
`normalize`	Should any unicode normalization be applied before making a profile? By default, no normalization is applied, giving direct feedback on the actual encoding as observed in the strings. Other options are `NFC` and `NFD`. In combination with `sep` these options can lead to different insights into the structure of your strings (see examples below).
`info`	Add columns with Unicode information on the graphemes: Unicode code points, Unicode names, and frequency of occurrence in the input strings.
`editing`	Add empty columns for further editing of the orthography profile: left context, right context, class, and translitation. See `tokenize` for detailed information on their usage.
`sep`	separator to separate the strings. When NULL (by default), then unicode character definitions are used to split (as provided by UCI, ported to R by `stringi::stri_split_boundaries`. When `sep` is specified, strings are split by this separator. Often useful is `sep = ""` to split by unicode codepoints (see examples below).
`file.out`	Filename for writing the profile to disk. When `NULL` the profile is returned as an R dataframe consisting of strings. When `file.out` is specified (as a path to a file), then the profile is written to disk and the R dataframe is returned invisibly.
`collation.locale`	Specify to ordering to be used in writing the profile. By default it uses the ordering as specified in the current locale (check `Sys.getlocale("LC_COLLATE")`).

String are devided into default grapheme clusters as defined by the Unicode specification. Underlying code is due to the UCI as ported to R in the stringi package.

A dataframe with strings representing a skeleton of an orthography profile.

Michael Cysouw <cysouw@mac.com>

Moran & Cysouw (forthcoming)

tokenize

# produce statistics, showing two different kinds of "A"s in Unicode.
# look at the output of "example" in the console to get the point!
(example <- "\u0041\u0391\u0410")
write.profile(example)

# note the differences. Again, look at the example in the console!
(example <- "ÙÚÛ\u0055\u0300\u0055\u0301\u0055\u0302")
# default settings
write.profile(example)
# split according to unicode codepoints
write.profile(example, sep = "")
# after NFC normalization unicode codepoints have changed
write.profile(example, normalize = "NFC", sep = "")
# NFD normalization gives yet another structure of the codepoints
write.profile(example, normalize = "NFD", sep = "")
# note that NFC and NFD normalization are identical under unicode character definitions!
write.profile(example, normalize = "NFD")
write.profile(example, normalize = "NFC")