In agricolamz/lingglosses: Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation

library(lingglosses)
# in order to have the same list of glosses through the whole document
options("lingglosses.refresh_glosses_list" = FALSE)

Introduction

The list of abbreviations is an obligatory part of linguistic articles that nobody reads. These lists contain definitions of abbreviations used in the article (e.g. the names of corpora or sign languages), but also a list of linguistic glosses --- abbreviations used in interlinear glossed examples. There is a document proposing standardized glossing rules [@comrie08], which ends with a list of 84 standard abbreviations. A much bigger list of standard abbreviations is present on Wikipedia. However, researchers can deviate from the proposed abbreviations and use their own instead.

The following list of abbreviations, which I came across in a published article, makes it clear that there is room for improvement in compiling such lists:

NOM = nominative, GEN = nominative, DAT = nominative, ACC = accusative, VOC = accusative, LOC = accusative, INS = accusative, PL = plural, SG = singular

Besides the obvious errors, this list contains more problems that I would like to point out:

the lack of alphabetic order;
some abbreviations used in the article (r add_gloss("SBJV"), r add_gloss("IMP")) are absent in the list.

The main goal of the lingglosses R package is to provide an option for creating:

interlinear glossed linguistic glosses for an .html output of rmarkdown [@xie18][^latex];
a semi-automatically compiled list of glosses.

[^latex]: If you want to render a .pdf version you can either use latex and multiple linguistic packages developed for it (see e. g. gb4e, langsci, expex, philex), or you can render .html first and convert it to .pdf afterwards.

You can install the stable version of the package from CRAN:

install.packages("lingglosses")

You can also install the development version of lingglosses from GitHub with:

# install.packages("remotes")
remotes::install_github("agricolamz/lingglosses")

In order to use the package you need to load it with the library() call:

library(lingglosses)

You can go through the examples in this tutorial or you can create a lingglosses example from the rmarkdown template (File > New File > R Markdown... > From Template > lingglosses Document).

Create glossed examples with `gloss_example()`

Basic usage

The main function of the lingglosses package is gloss_example(). This package has the following arguments:

transliteration;
glosses;
free_translation;
comment;
grammaticality;
annotation[^orth];
line_length.

[^orth]: I used annotation for representing orthography, but it also possible to use this tier for the annotation of words, like here:

gloss_example(transliteration = "eze a za a",
              glosses = "NP PRFX ROOT SFX",
              free_translation = "Eze swept... (Igbo, from [@goldsmith79: 209])",
              annotation = "HL H L H")

All arguments except the last one are self-explanatory.

gloss_example(transliteration = "bur-e-**ri** c'in-ne-sːu-w",
              glosses = "fly-NPST-**INF** know-HAB-NEG-M",
              free_translation = "I cannot fly. (Zilo Andi, East Caucasian)",
              comment = "(lit. do not know how to)",
              annotation = "Бурери цIиннессу.",
              grammaticality = "*")

In this first example you can see that:

the transliteration line is italic by default (if you do not want it, just add the argument italic_transliteration = FALSE)[^ital-all];
you can use standard markdown syntax (e.g. **a** for bold);
the free translation line is automatically framed with quotation marks.

[^ital-all]: Sometimes it is make sense to set this option ones for the whole document using the following code options("lingglosses.italic_transliteration" = FALSE).

Since the function arguments' names are optional in R, users can omit them as long as they follow the order of the arguments (you can always find the correct order in ?gloss_example):

gloss_example("bur-e-**ri** c'in-ne-sːu-w",
              "fly-NPST-**INF** know-HAB-NEG-M",
              "I cannot fly. (Zilo Andi, East Caucasian)",
              "(lit. do not know how to)",
              "Бурери цIиннессу.",
              "*")

It is possible to number and call your examples using the standard rmarkdown tool for generating lists (@):

(@) my first example
(@) my second example
(@) my third example

renders as:

(@) my first example (@) my second example (@) my third example

In order to reference examples in the text you need to give them names:

(@my_ex) example for referencing

(@my_ex) example for referencing

With the names settled you can reference the example (@my_ex) in the text using the following code (@my_ex).

So this kind of example referencing can be used with lingglosses examples like in (@lingglosses1) and (@lingglosses2). The only important details are:

change your code chunk argument to echo = FALSE (or specify it for all code chunks with the following comand in the begining of the document knitr::opts_chunk$set(echo = FALSE"));
do not put an empty line between the reference line (with (@...)) and the code chunk with lingglosses code.

(@lingglosses1)

gloss_example("bur-e-**ri** c'in-ne-sːu",
              "fly-NPST-**INF** know-HAB-NEG",
              "I cannot fly. (Zilo Andi, East Caucasian)",
              "(lit. do not know how to)")

(@lingglosses2) Zilo Andi, East Caucasian

gloss_example("bur-e-**ri** c'in-ne-sːu",
              "fly-NPST-**INF** know-HAB-NEG",
              "I cannot fly.",
              "(lit. do not know how to)")

Sometimes people gloss morpheme by morpheme (this is especially useful for polysynthetic languages). It is also possible in lingglosses. You can annotate slots with the annotation argument, see footnote 2 for the details.

(@) Abaza, West Caucasian [@arkadiev20: example 5.2]

gloss_example("s- z- á- la- nəq'wa -wa -dzə -j -ɕa -t'",
              "1SG.ABS POT 3SG.N.IO LOC pass IPF LOC 3SG.M.IO seem(AOR) DCL",
              "It seemed to him that I would be able to pass there.")

The glossing extraction algorithm implemented in lingglosses is case sensitive, so if you want to escape it you can use curly brackets:

(@) Kvankhidatli Andi, [@verhees19: 203]

gloss_example("den=no he.ʃː-qi hartʃ'on-k'o w-uʁi w-uk'o.",
              "{I}=ADD DEM.M-INS watch-CVB M-stand.AOR M-be.AOR",
              "And I stood there, watching him.")

In the example above {I} is just the English word I that will be escaped and will not appear in the gloss list as marker of class I.

It make sense to avoid to use single quotes for the quotation, since it can cause some troubles for the package's functions and use escape slash for quotations, like in the following example:

(@) Kunbzang Japhug, [@jacques21: 1143]

gloss_example("\"a-pi ɲɯ-ɕpaʁ-a\" ti ɲɯ-ŋu",
              "1SG.POSS-elder.sibling SENS-be.thirsty-1SG say:FACT SENS-be",
              "She said: \"Sister, I am thirsty.\"")

After a while I was asked to make it possible to add sole line examples:

gloss_example("Learn to value yourself, which means: to fight for your happiness. (Ayn Rand)",
              line_length = 100)

Multiline examples

Sometimes examples are too long and do not fit onto the page. In that case you need to add the argument results='asis' to your chunk. gloss_example() will then automatically split your example into multiple rows.

(@tsa_ex) Mishlesh Tsakhur, East Caucasian [@maisak07: 386]

gloss_example('za-s jaːluʁ **wo-b** **qa-b-ɨ**; turs-ubɨ qal-es-di ǯiqj-eː jaːluʁ-**o-b** **qa-b-ɨ**', 
               '1SG.OBL-DAT shawl.3 AUX-3 PRF-3-bring.PFV woolen_sock-PL NPL.bring-PL-A.OBL place-IN shawl.3-AUX-3 PRF-3-bring.PFV',
               '(they) **brought** me a shawl; instead of (lit. in place of bringing) woolen socks, (they) **brought** a shawl.',
               '(Woolen socks are considered to be more valuable than a shawl.)')

If you are not satisfied with the result of the automatic split you can change the value of the line_length argument (the default value is 70, that means 70 characters of the longest line).

Add audio and video

It is possible to add a soundtrack to the example using an audio_path argument. It can be both: a path to the file or an URL.

(@) Abaza, West Caucasian (my field recording)

gloss_example("á-ɕa",
              "DEF-brother",
              "This brother",
              audio_path = "abaza_brother.wav")

You can hear the recording if you click on the note icon above. If you do not like the icon, you can change it to any text using an audio_label argument.

Adding video is also possible:

(@) Ukrainian Sign Language (video from https://www.spreadthesign.com)

gloss_example("PIECE",
              "piece",
              video_path = "USL_piece.mp4")

There are additional arguments video_width and video_hight for width and hight.

In-text examples

When an example is small, the author may not want to put it in a separate paragraph, but prefer to display it as part of the running text. This is possible to achieve using the standard for rmarkdown inline code. The result of the R code can be inserted into the rmarkdown document using the backtick symbol and the small r, for example `r 2+2` will be rendered as r 2+2. Currently lingglosses can not automatically detect whether code was provided via code chunk or inline. So if you want to use an in-text glossed example and want the glosses to appear in list, it is possible to write them using the gloss_example() with the intext = TRUE argument. Here is a Turkish example from (@delancey97): r gloss_example("Kemal gel-miş", "Kemal come-MIR", intext = TRUE) that was produced with the following inline code:

`r gloss_example("Kemal gel-miş", "Kemal come-MIR", intext = TRUE)`

In the third section I show how you can create a semi-automatically compiled list of abbreviations for your document. As an example I provide the list for this exact document. Even though the r add_gloss("MIR") gloss appears only in this exact section in the in-text example above, it appears in the lists presented in the third section.

Stand-alone glosses with `add_gloss()`

Sometimes glosses are used in other environments besides examples, e.g. in a table or in the text. So if you want to use in-text glosses and want them to appear in the glosses list, it is possible to add them using the add_gloss() function. As an example I adapted part of the verbal inflection paradigm of Andi (East Caucasian) from Table 2 [@verhees19: 199]:

| | r add_gloss("AFF") | r add_gloss("NEG") | |----------------------|----------------------|----------------------| | r add_gloss("AOR") | -∅ | -sːu | | r add_gloss("MSD") | -r | -sːu-r | | r add_gloss("HAB") | -do | -do-sːu | | r add_gloss("FUT") | -dja | -do-sːja | | r add_gloss("INF") | -du | -du-sːu |

that is generated using the folowing markdown[^poortable] code[^tablesgenerator]:

[^poortable]: The table generated with markdown is visually poor. There are a lot of other ways to generate a table in R: kable() from knitr; kableExtra package, DT package and many others.

[^tablesgenerator]: It is easier to generate Markdown or Latex tables with Libre Office or MS Excel and then use an online table generator website like https://www.tablesgenerator.com/.

|                      | `r add_gloss("AFF")` | `r add_gloss("NEG")` |
|----------------------|----------------------|----------------------|
| `r add_gloss("AOR")` | -∅                   | *-sːu*               |
| `r add_gloss("MSD")` | *-r*                 | *-sːu-r*             |
| `r add_gloss("HAB")` | *-do*                | *-do-sːu*            |
| `r add_gloss("FUT")` | *-dja*               | *-do-sːja*           |
| `r add_gloss("INF")` | *-du*                | *-du-sːu*            |

In the third section I show you how to create a semi-automatically compiled abbreviation list for your document. As an example I provide the list of abbreviations for this exact document. Even though the r add_gloss("FUT") and r add_gloss("MSD") glosses appears only in this exact section in the table above, it appears in the lists presented in the third section.

Glossing Sign languages

Unfortunately, gloss extraction implemented in lingglosses is case sensitive. That makes it hard to use for the glossing of Sign Languages, because:

1) Sign linguists gloss lexical items with capitalized English translations; 2) Sign language glosses are sometimes split into two lines, each of which is associated with one hand (or even more if you want to account for non-manual markers); 3) Sign language glosses should be somehow aligned with video/pictures (see the fascinating signglossR by Calle Börstell); 4) There can be empty space in glosses; 5) There can be some placeholders that corresponds to an utterance by one articulator (e.g. a hand), which are held stationary in the signing space during the articulation made by another articulator.

I will illustrate these problems with an example from Russian Sign Language [@kimmelman12: 421]:

gloss_example(glosses = c("LH: {CHAIR} ________",
                          "RH: {} CL:{SIT}.{ON}"),
              free_translation = "The cat sits on the chair", 
              comment = "[RSL; Eks3–12]",
              drop_transliteration = TRUE)

The capitalization that is not used for morphemic glossing is embraced with curly brackets, so that lingglosses does not treat these items as glosses. Two separate gloss lines for different hands are provided with a vector with two elements (see c() function for the vector creation). It is important to provide the drop_transliteration = TRUE argument, otherwise internal tests within the gloss_example() function will fail.

It is also possible to use pictures in a transliteration line, see an example from Kazakh-Russian Sign Language [@kuznetsova21: 51] (pictures are used with the permission of the author Anna Kuznetsova):

gloss_example("![](when.png) ![](mom.png) ![](tired.png)",
              c("br_raise_______ {} {}",
                "chin_up_______ {} {}",
                "{WHEN} {MOM} {TIRED}"),
              "When was mom tired?")

The first line corresponds to pictures in markdown format that should be located in the same folder (otherwise you need to specify the path to them, e.g. ![](images/your_plot.png)). The next three lines correspond to different lines in the example with some non-manual articulation: as before, all glossing lines are stored as a vector of strings. The user can replace {} with _______ in order to show the scope of non-manual articulation.

Create semi-automatic compiled abbreviation list

After you finished your text, it is possible to call the make_gloss_list() function in order to automatically create a list of abbreviations.

make_gloss_list()

This function works with the built-in dataset glosses_df that is compiled from Leipzig Glosses, Wikipedia page and articles from the open access journal Glossa[^glossa]. Everybody can download and change this dataset for their own purposes. I would be grateful if you leave your proposals for changes to the dataset for this list in the issue tracker on GitHub.

[^glossa]: The script for collecting glosses is available here. The list was manually corrected and merged with glosses from other sources. This kind of glosses are marked in the glosses_df dataset as lingglosses in the source column.

It is possible that the user is not satisfied with the result of the make_gloss_list() function. In this case there are two possible strategies. The first strategy is to copy the result of the make_gloss_list(), modify it and paste it into your rmarkdown document. Sometimes you work on some volume dedicated to a particular group of languages and you want to assure that glosses are the same across all articles. Then you can compile your own table with the columns gloss and definition_en and use it within the make_gloss_list function. As you can see, all glosses specified in the my_abbreviations dataset changed their values in the output below:

my_abbreviations <- data.frame(gloss = c("NPST", "HAB", "INF", "NEG"),
                               definition_en = c("non-past tense", "habitual aspect", "infinitive", "negation marker"))
make_gloss_list(my_abbreviations)

Unfortunately, some glosses can have multiple meanings in different traditions (e.g. r add_gloss("ASS") can be either an associative plural or assertive mood). By default make_gloss_list() shows only some entries that were chosen by the author of this package. You can see all the possibilities if you add the argument all_possible_variants = TRUE. As you can see, there are multiple possible values for r add_gloss("AFF"), r add_gloss("ASS"), r add_gloss("CL"), r add_gloss("IMP"), r add_gloss("IN"), r add_gloss("INS"), and r add_gloss("PRF"):

make_gloss_list(all_possible_variants = TRUE)

You can notice that problematic glosses (those which lack a definition or are duplicated) are colored. This can be switched off adding the argument annotate_problematic = FALSE:

make_gloss_list(all_possible_variants = TRUE, annotate_problematic = FALSE)

In case you want to remove some glosses from the list, you can use the argument remove_glosses:

make_gloss_list(remove_glosses = c("1SG", "3SG"))

It is really important that one should not treat the results of the make_gloss_list() function as carved in stone: once it is compiled you can copy, modify and paste it in your document. You can try to spend time improving the output of the function, but at the final stage it is probably faster to correct it manually.

Other output formats

Both kniting to .pdf and .docx outputs are possible, but there are some known restrictions:

markdown bold and italic annotations do not work;
example numbers appear above the example;
there is no non-breaking space in the list of glosses.

So if you want to avoid these problems, the best solution is to use one of the latex glossing packages listed in the first footnote and the package glossaries for automatic compilation of glosses.

About the `glosses_df` dataset

As mentioned above, the make_gloss_list() function's definitions are based on the glosses_df dataset.

str(glosses_df)

Most definitions are too general on purpose: r add_gloss("ASC"), for example, is defined as associative, which can be associative case, associative plural, associative mood, or associated motion. Since the user can easily replace the output with their own definitions, it is not a problem for the lingglosses package. However, it will make things easier, and more comparable and reproducible, if linguists would create a unified database of glosses, similar to the concepticon [@list21a] (Max Ionov mentioned to me that this could be Ontologies of Linguistic Annotation). If you think that some glosses and definitions should be changed, do not hesitate to open an issue on the GitHub page of the lingglosses project.

Towards a database of interlinear glossed examples

There have been several alternative to lingglosses infrastructure for interlinear glossed examples that might be interesting for the reader:

multiple packages for glossing in LaTeX:
- gb4e,
- langsci,
- expex,
- philex
ODIN project [@lewis10] (looks like this project is not longer active);
a Java-script library Leipzig.js;
a Python library Xigt [@goodman15];
scription format and scription2dlx Java-script library [@hieber20];
a Python library pyigt [@list21b].

Only several of them (ODIN, Xigt, scription and pyigt) are attempts towards creating a standard for the databases of interlinear glossed examples. I also wanted to mention paper by [@round20], where authors provided a script for the automated identification and parsing of interlinear glossed text from scanned page images. The motivation for creating cross-linguistic database of interlinear glossed examples is the following:

Prevent from disappearing of linguistic facts due to the projects fail (for example field notes of the researcher that did not manage to finish his work: article, dictionary, grammar etc.);
Fight with the publication bias, which cause some linguistic facts left unpublished since they not support a basic idea of author;
Make linguistic work more reproducible and linguistic facts reusable (cf. with human genome database, biodiversity databases or astronomical catalogues).

The lingglosses package make an attempt for going in this direction and provide an ability to extract examples in table format that can be further transformed into other formats. Each interlinear glossed example could be easily represented as a table using the convert_to_df() function.

convert_to_df(transliteration = "bur-e-**ri** c'in-ne-sːu",
              glosses = "fly-NPST-**INF** know-HAB-NEG",
              free_translation = "I cannot fly.",
              comment = "(lit. do not know how to)",
              annotation = "Бурери цIиннессу.")

This table lists all the parameters that could be useful for a database, and has the following columns:

id --- unique identifier through the whole table;
example_id --- unique identifier of particular examples;
word_id --- unique identifier of the word in the example (delimited with spaces and other punctuation);
morpheme_id --- unique identifier of the morpheme within the word (delimited with - or =);
transliteration --- language material;
gloss --- glosses;
delimiter --- delimiters: space, - or =
transliteration_orig --- original string with transliteration;
glosses_orig --- original string with glosses;
free_translation --- original string with the free translation;
comment --- original string with a comment;

When you use the gloss_example() function, a table of the structure described above is added to the database, so in the end you can extract it by saving the output of the get_examples_db() function to the file:

get_examples_db()

Of course one can just use a subset of some columns:

unique(get_examples_db()[, c("example_id", "transliteration_orig", "glosses_orig")])

References

agricolamz/lingglosses documentation built on June 2, 2025, 1:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

agricolamz/lingglosses
Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation

In agricolamz/lingglosses: Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation

Introduction

Create glossed examples with `gloss_example()`

Basic usage

Multiline examples

Add audio and video

In-text examples

Stand-alone glosses with `add_gloss()`

Glossing Sign languages

Create semi-automatic compiled abbreviation list

Other output formats

About the `glosses_df` dataset

Towards a database of interlinear glossed examples

References

R Package Documentation

Browse R Packages

We want your feedback!

agricolamz/lingglosses Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation

In agricolamz/lingglosses: Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation

Introduction

Create glossed examples with gloss_example()

Basic usage

Multiline examples

Add audio and video

In-text examples

Stand-alone glosses with add_gloss()

Glossing Sign languages

Create semi-automatic compiled abbreviation list

Other output formats

About the glosses_df dataset

Towards a database of interlinear glossed examples

References

R Package Documentation

Browse R Packages

We want your feedback!

agricolamz/lingglosses
Interlinear Glossed Linguistic Examples and Abbreviation Lists Generation

Create glossed examples with `gloss_example()`

Stand-alone glosses with `add_gloss()`

About the `glosses_df` dataset