profile_old: Writing and reading of orthography profiles
In cysouw/qlcTokenize: Processing of orthographies using orthography profiles

Description Usage Arguments Details Value Author(s) Examples

To process strings, it is often very useful to tokenise them into graphemes (i.e. functional units of the orthography), and possibly replace those graphemes by other symbols to harmonize the orthographic representation of different orthographic representations (‘transcription’). As a quick and easy way to specify, save, and document the decisions taken for the tokenization, we propose using an orthography profile. Function to write and read orthography profiles are provided here.

write.orthography.profile_old(strings, 
  replacements = TRUE, sep = NULL, file = NULL, info = TRUE)

read.orthography.profile_old(file, 
  graphemes = "graphemes",  replacements = "replacements",
  left.context = NULL, right.context = NULL)

`strings`	vector of strings to the tokenized.
`graphemes`	name (or number) of the column in the orthography profile listing the graphemes to be used for the tokenization.
`replacements`	for writing profiles: logical (should a column with replacements be added?). For reading profiles:string with name (or number) of the column in the orthography profile listing the replacements.
`sep`	separator to separate the strings. When NULL (by default), then unicode character definitions are used to split. When `sep` is specified, strings are split by this separator.
`file`	filename for profiles to be written, or for profile to be read. When writing, filenames not ending in `.prf` get this suffix attached. When reading, specificiation of names without the suffix `.prf` also works.
`info`	logical: should extra Unicode-info (codepoints and Unicode names) be added in the generation of an orthography profile? Defaults to `FALSE`.
`left.context`	name of column in orthography profile with specification of the left context of a transliteration.
`right.context`	name of column in orthography profile with specification of the right context of a transliteration.

To produce an orthography profile, consider using write.orthography.profile to produce a useful starting point. This function will take a vector of strings and produce a table with all graphemes and their frequency. Combining diacritics and spacing modifier letters are combined with their preceding characters to obtain a reasonable first guess at available graphemes. There is no attempt made to recognize 'tailored' multigraphs (like 'sch' of 'aa'). Such multigraphs can be specified manually in the output file of this function.

The function read.orthography.profile can be used to read any saved orthography profile into R, though mostly this function will be used internally by tokenize. An orthography profile currently consists of minimally a tab-separated table with a column of graphemes to be separated, typically using a .prf suffix. Further columns with replacements can be specified. When further rule-based changes are needed (for complex orthographic regularities), these can be specified as regex 'pattern' and 'replacement' in a separate 2-column tab-separated file with the same name, but using a .rules suffix.

write.orthography.profile produces a dataframe with all grapheme(-clusters) and frequencies. By default, a column 'replacements' is added, which is identical to the graphemes. This column is a useful starting point to specify orthographic changes to be used by tokenize. Also by default the Unicode codepoints and names are added to make it easier to find encoding inconsistencies in the provided data. When file is specified, the profile will be written to this file.

read.orthography profile reads a profile from disk, possible also included any rules file.

Michael Cysouw

# produce statistics
example <- "nana änngschä ach"
write.orthography.profile_old(example)

# make a better orthography profile
gr <- cbind(c("a","ä","n","ng","ch","sch"),c("a","e","n","N","x","sh"))
colnames(gr) <- c("graphemes","replacements")
( op <- list(graphs = gr, rules = NULL) )