knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) options(rmarkdown.html_vignette.check_title = FALSE) library(rezonateR)
In this tutorial, we will learn about how time and sequence are handled in rezonateR.
More features relating to more fine-grained time may be available later, so watch this space!
This file will use the file saved at the end of vignette("import_save_basics")
. You don't have to have read that tutorial beforehand, though it may be helpful if you are new to rezonateR
.
library(rezonateR) path = system.file("extdata", "rez007_time.Rdata", package = "rezonateR", mustWork = T) rez007 = rez_load(path)
rez007$tokenDF$pSentOrder[1:14] = 1:14
Rezonator by default provides two fields related to the position of a token, which you will see in tokenDF
as columns:
docTokenSeq
- refers to the order of a token within the entire texttokenOrder
- refers to the position of a token within its intonation unitBoth orders count all tokens. In the Santa Barbara Corpus text we are using, this includes endnotes (such as ,
and .
), transcriptions of vocalisms (such as (H)
for in-breaths and @@@
for laughter), and so on. You can see these in the tokenDF
of our sample file here:
head(rez007$tokenDF)
'Larger' elements that span multiple tokens have four token sequence-related fields:
docTokenSeqFirst
- refers to the docTokenSeq
of the first token.docTokenSeqLast
- refers to the docTokenSeq
of the last token.tokenOrderFirst
- refers to the tokenOrder
of the first token.tokenOrderLast
- refers to the tokenOrder
of the last token.You can see these fields in action in the chunkDF
:
head(rez007$chunkDF$refexpr %>% select(id, doc, name, text, tokenOrderFirst, docTokenSeqFirst, tokenOrderLast, docTokenSeqLast))
isWord
statusThe Santa Barbara Corpus by default contains other tokens, provided through the tagMap
in the Rezonator edition, such as these:
corpusSeq
- the position of a token within the entire corpus.place
- the position of a word within the intonation unit, excluding elements like endnotes and vocalisms.negPlace
- the position of a word within the intonation unit, counting backwards from the last token.However, if you are working with your own texts rather than the Santa Barbara Corpus, you will not have access to place
by default. We will therefore have to create place
by ourselves.
To create something that functions similarly as place
, you must first define what a word is. For example, if you are dealing with a very 'clean' transcription that ignores elements like breaths and laughter, then you may simple create a regular expression that captures all the punctuation.
The function addIsWordField
adds a column isWord
to a rezrDF
or rezrObj
stating whether a token is a word. For the Santa Barbara Corpus, this is simple since the Kind
column is set to "Word"
for actual words. Thus, we can use the expression Kind == "Word"
for the definition of what counts as a word. This expression is passed to addIsWordField
as its second parameter (the first parameter being the rezrDF
or rezrObj
). Note that if addIsWordField
is called for a rezrObj
, then both tokenDF
and entryDF
will have this new field.
By default, addIsWordField
also adds the fields docWordSeq
and wordOrder
(=place
) to the tokenDF
and entryDF
, and the columns wordOrderFirst
, wordOrderLast
, docWordSeqFirst
and docWordSeqLast
will be added to unitDF
, chunkDF
and trackDF
. These work similarly to their token
counterparts, except non-words are give 0 values, and only words are counted in determining order.
rez007 = addIsWordField(rez007, kind == "Word") head(rez007$tokenDF %>% select(id, tokenOrder, docTokenSeq, wordOrder, docWordSeq)) head(rez007$chunkDF$refexpr %>% select(id, tokenOrderFirst, tokenOrderLast, docTokenSeqFirst, docTokenSeqLast, wordOrderFirst, wordOrderLast, docWordSeqFirst, docWordSeqLast))
The ordering of units, unitSeq
, is not available to rezrDF
s other than unitDF
by default. The addUnitSeq()
function adds unitSeq
to other fields. This function allows you to set which entity (and optionally, layer). Whichever entity type you specify, everything 'below' it will also get unit orders. For example, if you specify 'track
' as the level at which you want unit order, then tokenDF
and chunkDF
will get it too. Similarly, if you specify 'stack
' as the level, then cardDF
will also get it.
rez007 = addUnitSeq(rez007, "track") rez007 = addUnitSeq(rez007, "stack") head(rez007$trackDF$default %>% select(id, text, unitSeqFirst, unitSeqLast)) head(rez007$stackDF %>% select(id, name, unitSeqFirst, unitSeqLast))
Note that chunkDF
and layers depend on it get unitSeqFirst
and unitSeqLast
, because of a chunk combination feature that will be discussed in the trees chapter.
The function getSeqBounds()
is mostly used rezonateR
-internally, though more advanced users may use it to create functions similar to addIsWordField()
and addUnitSeq()
.
In addition to the two most common operations we covered above, rezonateR
also has other functions to deal with common problems related to time and sequence. These require knowledge of editing, so if you want to learn more about these functions, you can skip this for now and come back after having read vignette("edit_easyEdit")
.
Generally, when elements in a file belong to a larger structure, there are three ways of representing them inside the rezrDF that houses the smaller structure: * Sequence of the larger structure: unitSeq
in tokenDF
is an example. * Order (position) of the smaller element within the larger structure: tokenOrder
in tokenDF
is an example. * BILUO: Is the current element the beginning (B) of the larger structure, an intermediate (I) element, the last (L) element, the only (U) element, or not within the larger structure (O)? Generally used for artificial intelligence applications.
The following functions are currently available for dealing with these and related issues:
getOrderFromSeq()
: Converts from the first representation to the second one.getSeqFromOrder()
: Converts from the second representation to the first one.isInitial()
: Is the current element the initial member of a larger structure?isFinal()
: Is the current element the final member of a larger structure? Used after inLength()
.getBiluoFromOrder()
: Converts from the second representation to the third one.getOrderFromseq()
is straightforward. For example, if we want to replicate tokenOrder
from unitSeq
, we can do this:
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "tokenOrder2", getOrderFromSeq(unitSeq)) #Check that tokenOrder and tokenOrder2 are identical all(rez007$tokenDF$tokenOrder == rez007$tokenDF$tokenOrder2)
getSeqFromOrder()
is also straightforward. If we want to get the prosodic sentence in which a token is found, which is not given in the Rezonator version of the Santa Barbara Corpus, we can get it from the place within the prosodic sentence (pSentOrder
):
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSent", getSeqFromOrder(pSentOrder)) head(rez007$tokenDF %>% select(id, text, pSent, pSentOrder))
isInitial()
and isFinal()
tell us whether something is the initial or final member of a larger unit. isInitial()
is simple: Its only parameter is the order (i.e. second representation).
For example, the following code determines whether a token is the start of the prosodic sentence:
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentInitial", isInitial(pSentOrder)) head(rez007$tokenDF %>% select(id, text, pSentOrder, isPSentInitial))
inLength()
gives the length of the larger unit and its result is used by isFinal()
, which takes the length of the larger unit in addition to the order value. (Note that for pSentLength
, isWord
is set to simply text not being zeroes, since we count non-words like breaths, endnotes, etc., as part of the prosodic sentence.)
rez007$tokenDF = addFieldLocal(rez007$tokenDF, "pSentLength", inLength(pSentOrder, isWord = (text != "<0>")), type = "complex", groupField = "pSent") rez007$tokenDF = addFieldLocal(rez007$tokenDF, "isPSentFinal", isFinal(pSentOrder, pSentLength)) head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, isPSentFinal))
getBiluoFromOrder()
is similar in requiring order and length. Let's get the BILUO values for prosodic sentences:
rez007$tokenDF = changeFieldLocal(rez007$tokenDF, "pSentBiluo", getBiluoFromOrder(pSentOrder, pSentLength)) head(rez007$tokenDF %>% select(id, text, pSentOrder, pSentLength, pSentBiluo))
Let's practice saving our data again:
savePath = "rez007.Rdata" rez_save(rez007, savePath)
Now that you know how to deal with time and sequence, it will be even easier to work with rezonateR.
If you're following the whole tutorial series, the next tutorial is vignette("edit_easyEdit")
, where you will start learning to add automatic annotations to rezrDF
s!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.