knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(rezonateR)

Getting started

In this tutorial, we will learn about how time and sequence are handled in rezonateR. More features relating to more fine-grained time may be available later, so watch this space!

This file will use the file saved at the end of vignette("import_save_basics"). You don't have to have read that tutorial beforehand, though it may be helpful if you are new to rezonateR.

library(rezonateR)
path = system.file("extdata", "rez007_time.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)
rez007$tokenDF$pSentOrder[1:14] = 1:14

Token order and sequence

Rezonator by default provides two fields related to the position of a token, which you will see in tokenDF as columns:

Both orders count all tokens. In the Santa Barbara Corpus text we are using, this includes endnotes (such as , and .), transcriptions of vocalisms (such as (H) for in-breaths and @@@ for laughter), and so on. You can see these in the tokenDF of our sample file here:

head(rez007$tokenDF)

'Larger' elements that span multiple tokens have four token sequence-related fields:

You can see these fields in action in the chunkDF:

head(rez007$chunkDF$refexpr %>% select(id, doc, name, text, tokenOrderFirst, docTokenSeqFirst, tokenOrderLast, docTokenSeqLast))

Setting isWord status

The Santa Barbara Corpus by default contains other tokens, provided through the tagMap in the Rezonator edition, such as these:

However, if you are working with your own texts rather than the Santa Barbara Corpus, you will not have access to place by default. We will therefore have to create place by ourselves.

To create something that functions similarly as place, you must first define what a word is. For example, if you are dealing with a very 'clean' transcription that ignores elements like breaths and laughter, then you may simple create a regular expression that captures all the punctuation.

The function addIsWordField adds a column isWord to a rezrDF or rezrObj stating whether a token is a word. For the Santa Barbara Corpus, this is simple since the Kind column is set to "Word" for actual words. Thus, we can use the expression Kind == "Word" for the definition of what counts as a word. This expression is passed to addIsWordField as its second parameter (the first parameter being the rezrDF or rezrObj). Note that if addIsWordField is called for a rezrObj, then both tokenDF and entryDF will have this new field.

By default, addIsWordField also adds the fields docWordSeq and wordOrder (=place) to the tokenDF and entryDF, and the columns wordOrderFirst, wordOrderLast, docWordSeqFirst and docWordSeqLast will be added to unitDF, chunkDF and trackDF. These work similarly to their token counterparts, except non-words are give 0 values, and only words are counted in determining order.

rez007 = addIsWordField(rez007, kind == "Word")
head(rez007$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq))
head(rez007$chunkDF$refexpr %>% select(id, tokenOrderFirst, tokenOrderLast, docTokenSeqFirst, docTokenSeqLast, wordOrderFirst, wordOrderLast, docWordSeqFirst, docWordSeqLast))

Unit order

The ordering of units, unitSeq, is not available to rezrDFs other than unitDF by default. The addUnitSeq() function adds unitSeq to other fields. This function allows you to set which entity (and optionally, layer). Whichever entity type you specify, everything 'below' it will also get unit orders. For example, if you specify 'track' as the level at which you want unit order, then tokenDF and chunkDF will get it too. Similarly, if you specify 'stack' as the level, then cardDF will also get it.

rez007 = addUnitSeq(rez007, "track")
rez007 = addUnitSeq(rez007, "stack")
head(rez007$trackDF$default %>% select(id, text, unitSeqFirst, unitSeqLast))
head(rez007$stackDF %>% select(id, name, unitSeqFirst, unitSeqLast))

Note that chunkDF and layers depend on it get unitSeqFirst and unitSeqLast, because of a chunk combination feature that will be discussed in the trees chapter.

The function getSeqBounds() is mostly used rezonateR-internally, though more advanced users may use it to create functions similar to addIsWordField() and addUnitSeq().

Advanced sequence operations

In addition to the two most common operations we covered above, rezonateR also has other functions to deal with common problems related to time and sequence. These require knowledge of editing, so if you want to learn more about these functions, you can skip this for now and come back after having read vignette("edit_easyEdit").

Generally, when elements in a file belong to a larger structure, there are three ways of representing them inside the rezrDF that houses the smaller structure: * Sequence of the larger structure: unitSeq in tokenDF is an example. * Order (position) of the smaller element within the larger structure: tokenOrder in tokenDF is an example. * BILUO: Is the current element the beginning (B) of the larger structure, an intermediate (I) element, the last (L) element, the only (U) element, or not within the larger structure (O)? Generally used for artificial intelligence applications.

The following functions are currently available for dealing with these and related issues:

getOrderFromseq() is straightforward. For example, if we want to replicate tokenOrder from unitSeq, we can do this:

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "tokenOrder2", getOrderFromSeq(unitSeq))
#Check that tokenOrder and tokenOrder2 are identical
all(rez007$tokenDF$tokenOrder == rez007$tokenDF$tokenOrder2)

getSeqFromOrder() is also straightforward. If we want to get the prosodic sentence in which a token is found, which is not given in the Rezonator version of the Santa Barbara Corpus, we can get it from the place within the prosodic sentence (pSentOrder):

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSent", getSeqFromOrder(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSent, pSentOrder))

isInitial() and isFinal() tell us whether something is the initial or final member of a larger unit. isInitial() is simple: Its only parameter is the order (i.e. second representation).

For example, the following code determines whether a token is the start of the prosodic sentence:

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentInitial", isInitial(pSentOrder))
head(rez007$tokenDF %>% select(id, text, pSentOrder, isPSentInitial))

inLength() gives the length of the larger unit and its result is used by isFinal(), which takes the length of the larger unit in addition to the order value. (Note that for pSentLength, isWord is set to simply text not being zeroes, since we count non-words like breaths, endnotes, etc., as part of the prosodic sentence.)

rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSentLength", inLength(pSentOrder, isWord = (text != "<0>")), type = "complex", groupField = "pSent")
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentFinal", isFinal(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, isPSentFinal))

getBiluoFromOrder() is similar in requiring order and length. Let's get the BILUO values for prosodic sentences:

rez007$tokenDF = changeFieldLocal(rez007$tokenDF, "pSentBiluo", getBiluoFromOrder(pSentOrder, pSentLength))
head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, pSentBiluo))

Onwards!

Let's practice saving our data again:

savePath = "rez007.Rdata"
rez_save(rez007, savePath)

Now that you know how to deal with time and sequence, it will be even easier to work with rezonateR. If you're following the whole tutorial series, the next tutorial is vignette("edit_easyEdit"), where you will start learning to add automatic annotations to rezrDFs!



johnwdubois/rezonateR documentation built on Nov. 19, 2024, 11:17 p.m.