tokenize_old: Tokenization of character strings based on an orthography...
In cysouw/qlcTokenize: Processing of orthographies using orthography profiles

Description Usage Arguments Details Value Author(s) Examples

To process strings it is often very useful to tokenise them into graphemes (i.e. functional units of the orthography), and possibly replace those graphemes by other symbols to harmonize the orthographic representation of different orthographic representations (‘transcription’). As a quick and easy way to specify, save, and document the decisions taken for the tokenization, we propose using an orthography profile.

This function is the main function to produce, test and apply orthography profiles.

tokenize_old(strings, 
  orthography.profile = NULL, transliterate = FALSE,
  graphemes = "graphemes", replacements = "replacements",
  sep = " ", sep.replacement = "#", missing = "\u2047",
  normalize = "NFC", size.order = TRUE, context = FALSE,
  global.match = TRUE, file = NULL)

`strings`	vector of strings to the tokenized.
`orthography.profile`	orthography profile specifying the graphemes for the tokenization, and possibly any replacements of the available graphemes. Can be a filename or an object as returned by `read.orthography.profile`. If NULL then the orthography profile will be created on the fly using the defaults of `write.orthography.profile`.
`transliterate`	logical: should orthographic transliteration be performed after tokenization. Defaults to FALSE.
`graphemes`	name (or number) of the column in the orthography profile listing the graphemes to be used for the tokenization. Defaults to "graphemes"
`replacements`	name (or number) of the column in the orthography profile listing the replacements. Defaults to "replacements"
`sep`	separator to be inserted between graphemes. Defaults to space.
`sep.replacement`	if the specified separator (by default: space) already occurs in the data, what should it be replaced with (by default: hash) ?
`missing`	character to be inserted at transliteration when no transliteration is specified. Defaults to DOUBLE QUESTION MARK at U+2047.
`normalize`	which normalization to use, defaults to "NFC". Other option is "NFD". Any other input will result in no normalisation being performed.
`size.order`	by default graphemes will be identified largest first. If FALSE then the order as specified in the orthography profile will be used.
`context`	when context = TRUE then the profile is assumed to have columns named "left" and "right" specifying the context for transliteration.
`global.match`	how should the tokenization be performed. By default, use global match, i.e. each grapheme will be replaced throughout the whole string before the next grapheme is taken up. If FALSE, then the replacements will be performed along each string, similar in result to a finite state transducer, i.e. which grapheme matches the start, then proceed along the string.
`file`	filename for results to be written. No suffix should be specified, as various different files with different suffixes are produced.

The tokenize function will tokenize (and replace, i.e. transliterate) strings into graphemes. First, the graphemes from the .prf table will used for the tokenization , starting from the largest graphemes (in unicode-code-point count using NFC normalisation by default). Any unmatched sequences in the data will be reported with a warning. Any rules specified in the .rules file will be applied at the end.

Without specificatino of file.out, the function tokenize will return a list of three:

`strings`	the vector with the parsed strings
`profile`	a dataframe with the graphemes and some more information
`warnings`	a table with all original strings and the unmatched parts

When file is specified, these three tables will be written to three different files, file.txt for the strings, file.prf for the orthrography profile, and file_warnings.txt for the warnings. Note that when replacements are made (i.e. when replace = TRUE), then no orthography profile is produced. Likewise, when there are not warnings, then no file with warning is produced.

Michael Cysouw

# make an ad-hoc orthography profile
gr <- cbind(c("a","ä","n","ng","ch","sch"),c("a","e","n","N","x","sh"))
colnames(gr) <- c("graphemes","replacements")
( op <- list(graphs = gr, rules = NULL) )

# tokenization
tokenize_old(
    c("nana", "änngschä", "ach")
    , op
    , graphemes = "graphemes")
    
# with replacements and an error message
tokenize_old(
    c("Naná", "änngschä", "ach")
    , op
    , graphemes = "graphemes"
    , replacements = "replacements")