tokenize: Tokenization and transliteration of character strings based...
In qlcData: Processing Data for Quantitative Language Comparison (QLC)

Description Usage Arguments Details Value Note Note Author(s) References See Also Examples

To process strings it is often very useful to tokenise them into graphemes (i.e. functional units of the orthography), and possibly replace those graphemes by other symbols to harmonize the orthographic representation of different orthographic representations (‘transcription/transliteration’). As a quick and easy way to specify, save, and document the decisions taken for the tokenization, we propose using an orthography profile.

This function is the main function to produce, test and apply orthography profiles.

tokenize(strings, 
  profile = NULL, transliterate = NULL,
  method = "global", ordering = c("size", "context", "reverse"),
  sep = " ", sep.replace = NULL, missing = "\u2047", normalize = "NFC",
  regex = FALSE, silent = FALSE,
  file.out = NULL)

`strings`	Vector of strings to the tokenized. It is also possibly to pass a filename, which will then simply be read as `scan(strings, sep = "\n", what = "character")`.
`profile`	Orthography profile specifying the graphemes for the tokenization, and possibly any replacements of the available graphemes. Can be a reference to a file or an R object. If NULL then the orthography profile will be created on the fly using the defaults of `write.profile`.
`transliterate`	Default `NULL`, meaning no transliteration is to be performed. Alternatively, specify the name of the column in the orthography profile that should be used for replacement.
`method`	Method to be used for parsing the strings into graphemes. Currently two options are implemented: `global` and `linear`. See Details for further explanation.
`ordering`	Method for ordering. Currently three different methods are implemented, which can be combined (see Details below): `size`, `context`, `reverse` and `frequency`. Use `NULL` to prevent ordering and use the top to bottom order as specified in the orthography profile.
`sep`	Separator to be inserted between graphemes. Defaults to space. This function assumes that the separator specified here does not occur in the data. If it does, unexpected things might happen. Consider removing the chosen seperator from your strings first, e.g. by using `gsub` or use the option `sep.replace`.
`sep.replace`	Sometimes, the chosen separator (see above) occurs in the strings to be parsed. This is technically not a problem, but the result might show unexpected sequences. When `sep.replace` is specified, this marking is inserted in the string at those places where the `sep` marker occurs. Typical usage in linguistics would be `sep = " ", sep.replace = "#"` adding spaces between graphemes and replacing spaces in the input string by hashes in the output string.
`missing`	Character to be inserted at transliteration when no transliteration is specified. Defaults to DOUBLE QUESTION MARK at U+2047. Change this when this character appears in the input string.
`normalize`	Which normalization to use before tokenization, defaults to "NFC". Other option is "NFD". Any other input will result in no normalisation being performed.
`regex`	Logical: when `regex = FALSE` internally the matching of graphemes is done exact, i.e. without using regular expressions. When `regex = TRUE` ICU-style regular expression (see `stringi-search-regex`) are used for all content in the profile (including the Grapheme-column!), so any reserved characters have to be escaped in the orthography profile. Specifically, add a slash "\" before any occurrence of the characters `[](){}\|+*.-!?<cb><86>$\` in your profile (except of course when these characters are used in their regular expression meaning). Note that this parameter also influences whether contexts should be considered in the tokenization (internally, contextual searching uses regular expressions). By default, when `regex = FALSE`, context is ignored. If `regex = TRUE` then the function checks whether there are columns called `Left` (for the left context) and `Right` (for the right context), and optionally a column called `Class` (for the specification of grapheme-classes) in the orthography profile. These are hard-coded column-names, so please adapt your orthography profile accordingly. The columns `Left` and `Right` allow for regular expression to specify context.

`silent`	Logical: by default missing characters in the strings are reported with a warning. use `silent = TRUE` to supress these warnings.
`file.out`	Filename for results to be written. No suffix should be specified, as various different files with different suffixes are produced (see Details below). When `file.out` is specified, then the data is written to disk AND the R dataframe is returned invisibly.

Given a set of graphemes, there are at least two different methods to tokenize strings. The first is called global here: this approach takes the first grapheme, matches this grapheme globally at all places in the string, and then turns to the next string. The other approach is called linear here: this approach walks through the string from left to right. At the first character it looks through all graphemes whether there is any match, and then walks further to the end of the match and starts again. In some special cases these two methods can lead to different results (see Examples).

The ordering or the lines in the ortography profile is of crucial importance, and different orderings will lead to radically different results. To simply use the top to bottom ordering as specified in the profile, use order = NULL. Currently, there are four different ordering strategies implemented: size, context, reverse and frequency. By specifying more than one in a vector, these orderings are used to break ties, e.g. the default specification c("size", "context", "reverse") will first order by size, and for those with the same size, it will order by whether any context is specifed (with context coming first). For lines that are still tied (i.e. the have the same size and all either have or have no context) the order will be reversed in comparison to the order as attested in the profile. Reversing order can be useful, because hand-written profiles tend to put general rules before specific rules, which mostly should be applied in reverse order.

size: order the lines in the profile by the size of the grapheme, largest first. Size is measured by number of Unicode characters after normalization as specified in the option normalize. For example, <c3><a9> has a size of 1 with normalize = "NFC", but a size of 2 with normalize = "NFD".
context: order the lines by whether they have any context specified, lines with context coming first. Note that this only works when the option regex = TRUE is also chosen.
reverse: order the lines from bottom to top.
frequency: order the lines by the frequency with which they match in the specified strings before tokenization, least frequent coming first. This frequency of course depends crucially on the available strings, so it will lead to different orderings when applied to different data. Also note that this frequency is (necessarily) measured before graphemes are identified, so these ordering frequencies are not the same as the final frequencies shown in the outpur. Frequency of course also strongly differs on whether context is used for the matching through regex = TRUE.

Without specificatino of file.out, the function tokenize will return a list of four:

`strings`	a dataframe with the original and the tokenized/transliterated strings
`profile`	a dataframe with the graphemes with added frequencies. The dataframe is ordered according to the order that resulted from the specifications in `ordering`.
`errors`	a dataframe with all original strings that contain unmatched parts.
`missing`	a dataframe with the graphemes that are missing from the original orthography profilr, as indicated in the errors. Note that the report of missing characters does currently not lead to correct results for transliterated strings.

When file.out is specified, these four tables will be written to three different tab-separated files (with header lines): file_strings.tsv for the strings, file_profile.tsv for the orthrography profile, file_errors.tsv for the strings that have unidentifyable parts, and file_missing.tsv for the graphemes that seem to be missing. When there is nothing missing, then no file for the missing strings is produced.

When regex = TRUE, regular expressions are acceptable in the columns ‘Grapheme', ’Left' and 'Right'. Backreferences in the transliteration column are not possible (yet). When regular expressions are allowed, all literal uses of special regex-characters have to be escaped! Any literal occurrence of the following characters has then to be preceded by a backslash \ .

- (U+002D, HYPHEN-MINUS)
! (U+0021, EXCLAMATION MARK)
? (U+003F, QUESTION MARK)
. (U+002E, FULL STOP)
( (U+0028, LEFT PARENTHESIS)
) (U+0029, RIGHT PARENTHESIS)
\[ (U+005B, LEFT SQUARE BRACKET)
\] (U+005D, RIGHT SQUARE BRACKET)
{ (U+007B, LEFT CURLY BRACKET)
} (U+007D, RIGHT CURLY BRACKET)
| (007C, VERTICAL LINE)
* (U+002A, ASTERISK)
\ (U+005C, REVERSE SOLIDUS)
<cb><86> (U+005E, CIRCUMFLEX ACCENT)
+ (U+002B, PLUS SIGN)
$ (U+0024, DOLLAR SIGN)

Note that overlapping matching does not (yet) work with regular expressions. That means that for example "aa" is only found once in "aaa". In some special cases this might lead to problems that might have to be explicitly specified in the profile, e.g. a grapheme "aa" with a left context "a". See examples below. This problem arises because overlap is only available in literal searches stri_opts_fixed, but the current function uses regex-searching, which does not catch overlap stri_opts_regex.

There is a bash-executable distributed with this package (based on the docopt package) that let you use this function directly in a bash-terminal. The easiest way to use this executable is to softlink the executable to some directory in your bash PATH, for example /usr/local/bin. To softlink the function tokenize to this directory, use something like the following in your bash terminal:

ln -is `Rscript -e 'cat(file.path(find.package("qlcData"), "exec", "tokenize"))'` /usr/local/bin

Michael Cysouw <cysouw@mac.com>

Moran & Cysouw (forthcoming)

See also write.profile for preparing a skeleton orthography profile.

# simple example with interesting warning and error reporting
# the string might look like "AABB" but it isn't...
(string <- "\u0041\u0410\u0042\u0412")
tokenize(string,c("A","B"))

# make an ad-hoc orthography profile
profile <- cbind(
    Grapheme = c("a","<c3><a4>","n","ng","ch","sch"), 
    Trans = c("a","e","n","N","x","sh"))
# tokenization
tokenize(c("nana", "<c3><a4>nngsch<c3><a4>", "ach"), profile)
# with replacements and a warning
tokenize(c("Nan<c3><a1>", "<c3><a4>nngsch<c3><a4>", "ach"), profile, transliterate = "Trans")

# different results of ordering
tokenize("aaa", c("a","aa"), order = NULL)
tokenize("aaa", c("a","aa"), order = "size")

# regexmatching does not catch overlap, which can lead to wrong results
# the second example results in a warning instead of just parsing "ab bb"
# this should occur only rarely in natural language
tokenize("abbb", profile = c("ab","bb"), order = NULL)
tokenize("abbb", profile = c("ab","bb"), order = NULL, regex = TRUE)

# different parsing methods can lead to different results
# note that in natural language this is VERY unlikely to happen
tokenize("abc", c("bc","ab","a","c"), order = NULL, method = "global")$strings
tokenize("abc", c("bc","ab","a","c"), order = NULL, method = "linear")$strings

[1] "A<U+0410>B<U+0412>"
$strings
           originals             tokenized
1 A<U+0410>B<U+0412> A <U+2047> B <U+2047>

$profile
  Grapheme Frequency
1        B         1
2        A         1

$errors
           originals                errors
1 A<U+0410>B<U+0412> A <U+2047> B <U+2047>

$missing
  Grapheme Frequency Codepoint                UnicodeName
1 <U+0410>         1    U+0410  CYRILLIC CAPITAL LETTER A
2 <U+0412>         1    U+0412 CYRILLIC CAPITAL LETTER VE

Warning message:
In tokenize(string, c("A", "B")) : 
There were unknown characters found in the input data.
Check output$errors for a table with all problematic strings.
$strings
               originals                  tokenized
1                   nana                    n a n a
2 <c3><a4>nngsch<c3><a4> <c3><a4> n ng sch <c3><a4>
3                    ach                       a ch

$profile
  Grapheme Trans Frequency
2 <c3><a4>     e         2
6      sch    sh         1
5       ch     x         1
4       ng     N         1
3        n     n         3
1        a     a         3

$errors
NULL

$missing
NULL

$strings
               originals
1            Nan<c3><a1>
2 <c3><a4>nngsch<c3><a4>
3                    ach
                                                                      tokenized
1 <U+2047> a n <U+2047> <U+2047> <U+2047> <U+2047> <U+2047> a <U+2047> <U+2047>
2                                                    <c3><a4> n ng sch <c3><a4>
3                                                                          a ch
                                                                 transliterated
1 <U+2047> a n <U+2047> <U+2047> <U+2047> <U+2047> <U+2047> a <U+2047> <U+2047>
2                                                                    e n N sh e
3                                                                           a x

$profile
  Grapheme Trans Frequency
2 <c3><a4>     e         2
6      sch    sh         1
5       ch     x         1
4       ng     N         1
3        n     n         2
1        a     a         3

$errors
    originals
1 Nan<c3><a1>
                                                                         errors
1 <U+2047> a n <U+2047> <U+2047> <U+2047> <U+2047> <U+2047> a <U+2047> <U+2047>

$missing
  Grapheme Frequency Codepoint            UnicodeName
1        1         1    U+0031              DIGIT ONE
2        3         1    U+0033            DIGIT THREE
3        <         2    U+003C         LESS-THAN SIGN
4        >         2    U+003E      GREATER-THAN SIGN
5        N         1    U+004E LATIN CAPITAL LETTER N
6        c         1    U+0063   LATIN SMALL LETTER C

Warning message:
In tokenize(c("Nan<c3><a1>", "<c3><a4>nngsch<c3><a4>", "ach"), profile,  : 
There were unknown characters found in the input data.
Check output$errors for a table with all problematic strings.
$strings
  originals tokenized
1       aaa     a a a

$profile
  Grapheme Frequency
1        a         3
2       aa         0

$errors
NULL

$missing
NULL

$strings
  originals tokenized
1       aaa      aa a

$profile
  Grapheme Frequency
1       aa         1
2        a         1

$errors
NULL

$missing
NULL

$strings
  originals tokenized
1      abbb     ab bb

$profile
  Grapheme Frequency
1       ab         1
2       bb         1

$errors
NULL

$missing
NULL

$strings
  originals            tokenized
1      abbb ab <U+2047> <U+2047>

$profile
  Grapheme Frequency
1       ab         1
2       bb         0

$errors
  originals               errors
1      abbb ab <U+2047> <U+2047>

$missing
  Grapheme Frequency Codepoint          UnicodeName
1        b         2    U+0062 LATIN SMALL LETTER B

Warning message:
In tokenize("abbb", profile = c("ab", "bb"), order = NULL, regex = TRUE) : 
There were unknown characters found in the input data.
Check output$errors for a table with all problematic strings.
  originals tokenized
1       abc      a bc
  originals tokenized
1       abc      ab c