knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(rezonateR)

This vignette will use the file saved at the end of vignette("time_seq"). As always, you don't have to have read that tutorial beforehand, though it may be helpful if you are new to rezonateR.

library(rezonateR)
path = system.file("extdata", "rez007_edit1.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)

Editing rezrDFs: Some preliminaries

Before editing, let's familiarise ourselves with some basic properties of rezrDFs to keep in mind when editing them.

Can you edit this?

As you know, editing data can be tricky. If you accidentally remove information you should not have, the results could be disastrous. Field access labels prevents you from accidentally changing things that you shouldn't be changing. Let's look at the field access values of the unitDF:

fieldaccess(rez007$unitDF)

There are five possible field access values:

Update functions and reloads

Whenever you have auto and foreign fields in a table, that means you will want them to be automatically updated as your annotations progress. The reload() function is one of the core features of rezonateR and allows you to do this. The reload() feature calls functions called updateFunctions. You can access the updateFunctions of a table using updateFunct():

updateFunct(rez007$unitDF)

There are three reload functions:

Once we start editing fields, we will experience the power of reloads. Let's now first take a look at how we'll be editing ...

The core four

As you probably guessed from the title, this vignette covers the EasyEdit series of functions in rezonateR, which are simple but powerful functions for editing rezrDFs, and can be learnt even by users with no exposure to dplyr. EasyEdit consists of four core functions, along with a bunch of useful helpers. The four core functions are:

The terms 'local' and 'foreign' are inspired by, but extended from, database terminology. They refer to what source of information you are drawing from to create or change the field. The two 'local' functions add or change fields using information from the current rezrDF, and the two 'foreign' ones add or change fields using information from other rezrDF. The word 'local' can be dropped when you are using the local functions.

All of the four basic functions can be applied to both rezrDFs and rezrObjs. In general, whenever you are working with a rezrObj directly, it is safest to work directly on it. However, if you are working with an emancipated rezrDF - that is, a rezrDF stored in a variable outside of a rezrObj - then you will want to apply these functions to a single rezrDF. In practice, when using these functions, the main difference between the rezrDF and rezrObj versions is that the latter will require you to specify entity type and layer. This tutorial will mainly use the rezrObj editions; simply omit the entity type and layer fields when applying these functions to rezrDFs.

The change functions act in more or less the same way as the add functions, the only difference being that it works on an existing field instead of adding a new one. So our tutorial will be mostly working with the add functions.

Staying local

Let's start by looking at addFieldLocal() using a simple application: In our tokenDF, let's add a field that automatically calculates the length of a word in characters.

In this function, entity specifies the name of the entity you would like to change, layer specifies the layer within that entity (which is an empty string since there are no token layers), fieldName is the name of the field we're adding, expression is the R expression with which we calculate the new field, and fieldaccess tells rezonateR to make this an auto field with an updateFunction that will be attached to the table. Let's try this, and look at both the results and the updateFunction:

rez007 = addField(rez007, entity = "token", layer = "",
                 fieldName = "orthoLength",
                 expression = nchar(text),
                 fieldaccess = "auto")
print("A fragment of the updated table:")
head(rez007$tokenDF %>% rez_select(id, text, orthoLength))
print("The updateFunction:")
updateFunct(rez007$tokenDF, "orthoLength")

You might notice that (...) has an orthoLength of 5. What if we decide that we don't want to count these non-words? One feasible solution is to use isWord, which we added in vignette("time_seq"): if a token is a word, then we set orthoLength to the number of characters in the text column as before; if not, we set it to 0.

Although EasyEdit functions do not require users to use Tidyverse functions, I still suggest that the Tidyverse function dplyr::case_when() is the best for this purpose, and it can be easily combined with EasyEdit functions. This allows you to create a vector whose value can be calculated differently depending on certain conditions. The syntax of dplyr::case_when() is simple: each argument of the function a condition ~ value pair, and if you want an 'else' statement, simply use T as the condition in the last condition-value pair. In this case, we can use this function to create a vector of values that is empty when a token is not a word, and the text of the token when it is a word:

rez007 = changeField(rez007, entity = "token", layer = "",
                 fieldName = "orthoLength",
                 expression = nchar(case_when(isWord ~ text, T ~ "")),
                 fieldaccess = "auto")
print("A fragment of the updated table:")
head(rez007$tokenDF %>% rez_select(id, text, orthoLength))

Notice that (...) now has an orthoLength of 0.

Now let's spice this up a bit by adding a complex field. A complex field takes information from multiple rows of a table. Let's say we are working with the tokenDF, but want the new column to be the longest length of the word that appears in the unit that the token comes from. In this case, the groupField argument that we haven't seen before is unit, and we specify the field type as "complex". The expression uses the function longestLength(), which is a rezonateR function that returns the longest word in a series of words.

rez007 = addField(rez007, entity = "token", layer = "",
                 fieldName = "longestWordInUnit",
                 expression = longestLength(text),
                 type = "complex",
                 groupField = "unit",
                 fieldaccess = "auto")
head(rez007$tokenDF %>% select(id, text, longestWordInUnit))

longestLength() belongs to a small collection of functions useful for extracting information from a bunch of strings:

Some base R functions that might be useful for numeric values include max(), min(), range(), mean(), etc.

Note that both times we added a field, we've set the field access to auto. If you do not set the field access, I will automatically set it to flex, which means that column - text in this case - will not be affected by reloads.

Going foreign

Now let's add a simple foreign field. Let's say when we look at the tokenDF, we also want to know what the whole unit's words are. (This will come into handy when we want to do external editing!)

The trickiest part of addFieldForeign is keeping track of the source we're getting information from, and the target we're aiming to add information to. We need to know:

In our specific example:

Let's put it to practice:

rez007 = addFieldForeign(rez007,
                targetEntity = "token", targetLayer = "",
                sourceEntity = "unit", sourceLayer = "",
                targetForeignKeyName = "unit",
                targetFieldName = "unitText", sourceFieldName = "text",
                fieldaccess = "foreign")
head(rez007$tokenDF %>% select(id, text, unitText))

Like its local counterpart, addFieldForeign() also has a complex flavour, i.e. we can draw from multiple lines of a different field. This is probably the hardest part of this tutorial, so buckle up!

Here, we're going to add a field in the unitDF that tells us the average length of words within the unit. We're going to base this off the entryDF.

This time, targetForeignKeyName works a bit differently. Because the entries that correspond to each unit are given in the nodeMap, you also need to supply the list of entries inside a unit node - that is, entryList, as you may recall from vignette("import_save_basics").

addFieldForeign() has a field called complexAction, which is a function performed on the source field of the source table, which could be any aggregating function (including the longestLength() series of functions that we have seen before). In this instance, we use mean()::

rez007 = addFieldForeign(rez007,
                targetEntity = "unit", targetLayer = "",
                sourceEntity = "entry", sourceLayer = "",
                targetForeignKeyName = "entryList",
                targetFieldName = "averageWordLength",
                sourceFieldName = "text",
                type = "complex",
                complexAction = function(x) mean(nchar(x)),
                fieldaccess = "foreign")
head(rez007$unitDF %>% select(id, text, averageWordLength))

Reloads revisited

Having created a bunch of auto fields, naturally we will want to try out our reloads! Let's try replacing the zero sign <0> with ∅ in the text column, which is more commonly used in linguistics papers. After doing this, we can then reload unitDF to look at the impact on our freshly created averageWordLength. Notice that we have to reload rezrDFs in order: first the entryDF, using information from tokenDF, then the unitDF, using information from entryDF (please be patient if running this on your computer, since reloads can take time):

#unitDF before the update
rez007$unitDF %>% filter(str_detect(text, "<0>")) %>% rez_select(id, text, averageWordLength) %>% head

#Change the zero format
rez007$tokenDF = changeFieldLocal(rez007$tokenDF,
                                  fieldName = "text",
                                  expression = case_when(text == "<0>" ~ "∅", T ~ text))
rez007$entryDF = rez007$entryDF %>% reload(rez007)
rez007$unitDF = rez007$unitDF %>% reload(rez007)

#unitDF after the update
rez007$unitDF %>% filter(str_detect(text, "∅")) %>% rez_select(id, text, averageWordLength) %>% head

Dealing with categorical variables

The tidyverse package forcats is much more powerful for dealing with categorical variables, but for those of us who don't want to learn an entirely new package, rezonateR provides a few easy ways to deal with categories.

mergeCats() allows you to merge two categories. It takes a vector, normally a column, as the first argument. Subsequently, the name of each argument is a new category, and the value of each argument is a vector of names of old categories that the new category will encompass (as character values, even if the original column contains factors).

For example, the Santa Barbara Corpus categorises laughter as separate from other vocalisms. If you want to merge "Laugh" into `"Vocalism", keeping everything else, then you can use this code:

#Laughter tokens before
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))

rez007 = changeField(rez007, entity = "token", layer = "",
                 fieldName = "kind",
                 expression = mergeCats(kind, Vocalism = c("Laugh", "Vocalism"))) 
#Laughter tokens after
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))

renameCats() has identical syntax. For example, if you want to rename Vocalism further to Voc:

#Breath tokens before
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))

rez007 = changeField(rez007, entity = "token", layer = "",
                 fieldName = "kind",
                 expression = renameCats(kind, Voc = "Vocalism")) 
#Laughter tokens after
head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))

Adding rows

Adding rows is not an operation you will do all the time. In general, it is recommended to just re-import the whole thing and run all the code again. On the occasional situation where you do have to do this, addRow() comes in handy.

You only need to add core and flex fields when you add a row. An ID will be automatically generated, and the foreign/auto fields will be automatically added. After specifying the rezrDF or rezrObj with entity and layer, each argument name is a column name, and its value is the value in the column.

rez007 = addRow(rez007, "trail", "default",
                doc = "sbc007",
                chainCreateSeq = max(rez007$trailDF$default$chainCreateSeq) + 1,
                name = "Danae",
                chainSize = 1)
tail(rez007$trailDF$default)
#Note: chainSize is currently flex as it is supplied by Rezonator and not calculated by rezonateR, but this may change in the future.

Onwards!

The next tutorial, vignette("edit_tidyRez") will be relatively short, because it will assume familiarity with the Tidyverse package dplyr. If you would like to do something that goes beyond the capabilities of easyRez, it is recommended that you familiarise yourself with the relevant dplyr function first, and then read the TidyRez vignette. If you are not familiar with Tidyverse and have no intention to learn it yet, you may elect to jump the next tutorial and go straight to vignette("edit_external").

And lest we forget, always save!

savePath = "rez007.Rdata"
rez_save(rez007, savePath)


johnwdubois/rezonateR documentation built on Nov. 19, 2024, 11:17 p.m.