knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(knitr)
library(interlineaR)
library(kableExtra)

Introduction

This package contains utility functions for importing into R corpus in various formats containing interlinearized corpora or dictionaries produced by descriptive linguistics softwares, such as SIL Toolbox of SIL Fieldworks.

All functions reading interlinearized texts return a list of data frame, where each data frame correspond to an unit (text, sentence, word, morpheme) and each row in the data frame describe an occurrence of the corresponding unit. The set of tables is relational: in each data frame, some columns give IDs pointing to rows in the other data frame: you can join morphemes to the words, sentences or texts they belong to.

This pivot format allows for various reshaping into R (for instance, grouping morphemes by words) as well as conversion between formats.

Reading EMELD XML interlinear corpus

Turning the EMELD XML document into a set of data frames

EMELD is an XML vocabulary introduced in Baden Hughes, Steven Bird and Catherine Bow, Encoding and Presenting Interlinear Text Using XML Technologies, [http://www.aclweb.org/anthology/U03-1008], it is used by SIL Fieldworks as an export format.

corpuspath <- system.file("exampleData", "tuwariInterlinear.xml", package="interlineaR")
corpus <- read.emeld(corpuspath, vernacular.languages="tww")

The returned object is a named list. Each slot of the list contain a data.frame. They are named 'morphemes', 'words', 'sentences', 'texts' (unless some of them have been discarted through the function arguments). Each row in the data frame describe an occurrences of a linguistic unit (texts, sentences, words, morphemes.)

Let's look at the first rows of the "texts" data.frame:

head(corpus$morphemes)
kable(head(corpus$morphemes))

The first columns contain ''ids'' (referencing to which text, sentence or word each morpheme belongs to). Other columns contains information extracted from the document. The names of the column are made of the field name and the language of the field, separated by a dot. (The parameters of read.emeld allow you to indicate which field, and in wich language(s), you are interested in, for each unit). Each field may be repeated in different languages.

The "words", "sentences" and "texts" table are made according to the same principles:

head(corpus$words)
kable(head(corpus$words))
head(corpus$sentences)
kable(head(corpus$sentences), booktabs = T)
head(corpus$texts)
kable(head(corpus$texts))

Contructing data set combining information from several data frame

This set of data.frame is similar to the tables of a database: rows from various tables are pointing to each other through ids.

Using these ids, new data.frame aggregating pieces of information coming from several data frame can be built. For instance, a table containing the columns of the morphemes and the words can be built using:

morphemes_words <- merge(corpus$morphemes, corpus$words[,-c(1,2)], by="word_id", suffixes = c(".morpheme",".word"))
head(morphemes_words)
morphemes_words <- merge(corpus$morphemes, corpus$words[,-c(1,2)], by="word_id", suffixes = c(".morpheme",".word"))
kable(head(morphemes_words))

Reading Toolbox interlinear corpus

Toolbox [https://software.sil.org/toolbox/] is widely used for producing interlinearized corpora. It uses a specific text-based format.

corpuspath <- system.file("exampleData", "tuwariToolbox.txt", package="interlineaR")
corpus <- read.toolbox(corpuspath)

Just as read.emeld, the result is a list containing the slots 'morphemes', 'sentences', 'words' and 'texts'. These slots are data frames, where each line describe an occurrence:

head(corpus$morphemes)
kable(head(corpus$morphemes))

As with read.emeld, optional fields can be declared. For instance, the kakabe corpus (by Alexandra Vydrina) also contains morpheme glosses in russian and french

path <- system.file("exampleData", "kakabe.txt", package="interlineaR")
corpus <- read.toolbox(path, morpheme.fields.suppl = c("gr", "gf"))
head(corpus$morphemes)
path <- system.file("exampleData", "kakabe.txt", package="interlineaR")
corpus <- read.toolbox(path, morpheme.fields.suppl = c("gr", "gf"))
kable(head(corpus$morphemes))

The 'sentences' data frame contains a numeric id, the reference created for each sentence ("ref" field in toolbox), plus (as with read.emeld) the original text, the free translation as well as the note ("tx", "nt", "ft" field in toolbox).

head(corpus$sentences)
kable(head(corpus$sentences))

the text data.frame contains a numeric id and the title (toolbox ID) of each text.

Reading LIFT XML dictionary

This XML vocabulary has been introduced by SIL : [http://code.google.com/p/lift-standard] and is used by SIL Fieldworks as an export format.

read.lift() produce a list of three data frame: "entries", "senses", "examples" ("relations" should be added). These set of table are linked through IDs, as in a relational database. All the fields of the dictionary, in all languages, can be extracted. THe arguments of read.lift() allow to manually list the fields you are interested in for each data frame; you can also reduce the field (columns) to those that have non-empty values in some columns with simplify=TRUE.

dicpath <- system.file("exampleData", "tuwariDictionary.lift", package="interlineaR")
dictionary <- read.lift(dicpath, vernacular.languages="tww", simplify=TRUE)

table of the lexial units:

head(dictionary$entries)
kable(head(dictionary$entries))
head(dictionary$senses)
kable(head(dictionary$senses))
head(dictionary$examples)
kable(head(dictionary$examples))

Some fields in LIFT may be repeated. For instance, several "semantic domain" can be expressed in the sense element:

<trait name ="semantic-domain-ddp4" value="1.3 Water"/>
<trait name ="semantic-domain-ddp4" value="6.7 Tool"/>

In that case, the value are concatenated, and the column "semantic-domain-ddp4" contains a value "1.3 Water,6.7 Tool".

Some other fields are both repeated and appearing as key-value pair, reflecting categories created for a language. In the following chunk, "Noun-infl-class" and "Noun-infl-class2" are two categories created for the nouns of a given language:

<grammatical-info value="Noun">
  <trait name="Noun-infl-class" value="fo"/>
  <trait name="Noun-infl-class2" value="hei"/>
</grammatical-info>

In that case, the column "trait" in the data.frame example will turn out as: "Noun-infl-class:fo,Noun-infl-class2:hei".



sylvainloiseau/interlineaR documentation built on May 20, 2019, 8:33 a.m.