transcription: Constructor function for the 'transcription' class.
In soundcorrs: Semi-Automatic Analysis of Sound Correspondences

Description Usage Arguments Details Value Fields See Also Examples

Take a data frame containing transcription and turn it into a transcription object, as required by the soundcorrs constructor function. In the normal workflow, the user should have no need to call this function other than through read.transcription.

transcription(
  data,
  col.grapheme = "GRAPHEME",
  col.meta = "META",
  col.value = "VALUE"
)

`data`	[data.frame] Data frame containing the transcription and its meaning.
`col.grapheme`	[character] Name of the column with graphemes. Defaults to `"GRAPHEME"`.
`col.meta`	[character] Name of the column with the coverage of metacharacters. If empty string or `NA`, the column will be generated automatically. Defaults to `"META"`.
`col.value`	[character] Name of the column with values of graphemes. Defaults to `"VALUE"`.

The primary reason why transcription needs to be defined, are regular expressions. R has a powerful system of regular expressions but they are general, not designed specifically for use in linguistics. Linguistics has its own convention of regular expressions, or rather two conventions, and to emulate them, it is necessary for soundcorrs to know the linguistic value of individual graphemes. One convention is the traditional, 'European' one where typically single characters represent entire classes of sounds, e.g. "C" stands for 'any consonant', "A" for 'any back vowel', etc. The other convention is the 'binary', 'American' notation where instead of using single characters, one lists all the distinctive features, e.g. "[+cons]" or "[+vowel,+back]". Having the values of graphemes encoded in a transcription object, expandMeta is able to translate these two notations into regular expressions that R can understand.

This constructor function is not really intended for the end user. Whenever possible, read.transcription should be used instead. Regardless of the function used, a data frame with two columns is required in order to create a transcription object: one column for the graphemes, and one for their values. It is probably not necessary, but nevertheless recommended, just to be on the safe side, that graphemes be single characters. (This also excludes combining diacritical marks.) Values must be separated by commas, without spaces. Typically, they will be phonetic features, but in principle they can be anything. A transcription may also have a third column that holds the string that the given grapheme is going to be turned into by expandMeta. Regular graphemes should be simply repeated in this column, whereas metacharacters (such as "C" or "A" mentioned above) should be expanded into all the graphemes they represent, separated by a bar ("|"), and enclosed in brackets, e.g. "(a|o|u)". If the third column is missing, this function will generate it automatically. Note, however, that the generation is based on the value column, and any grapheme whose value is a subset of the value of another grapheme, will be considered a metacharacter. For example, if "p" is defined as "cons,stop,blab", and "b" as "cons,stop,blab,voiced", "p" will be considered a metacharacter for both "p" and "b", and translated into "(p|b)" by expandMeta.

Graphemes cannot contain in them characters reserved for regular expressions: . + * ^ \ $ ? | ( ) [ ] { }, and they also cannot contain in them characters defined as metacharacters in the transcription. For example, if "A" is defined as "vowel,back", and therefore represents all the back vowels in the transcription, a regular grapheme "A:" is forbidden. A metacharacter "A:", on the other hand, is permitted (e.g. for 'any long back vowel'), though it is recommended that such overlapping metacharacters be avoided as much as possible.

Lastly, a transcription must contain so-called linguistic zero. This is a character which signifies an empty segment in a word, a segment which has been only added in order to align the segments in all the words in a pair/triple/.... For example, English passport has two phonemes fewer than Spanish pasaporte id., so in order for the two words to be aligned, the English one needs two filler segments:p|a|s|-|p|o|r|t|- : p|a|s|a|p|o|r|t|e. To designate a character as linguistic zero in a transcription, its value must be "NULL".

Two sample transcriptions are available: trans-common, trans-ipa; they can be loaded with the help of loadSampleDataset.

[transcription] An object containing the provided data.

data: [data.frame] The original data frame.
cols: [character list] Names of the important columns in the data frame.
meta: [character] A vector of character strings which act as metacharacters in regular expressions. Mostly useful to speed up expandMeta.
values: [character] A named list with values of individual graphemes exploded into vectors.
zero: [character] A regular expression to catch linguistic zeros.

link{expandMeta}, print.transcription, read.transcription

1
2
3

# path to a sample transcription
fName <- system.file ("extdata", "trans-common.tsv", package="soundcorrs")
fut <- transcription (read.table(fName,header=TRUE))