knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) options(rmarkdown.html_vignette.check_title = FALSE) library(rezonateR)
This vignette will use the file saved at the end of vignette("time_seq")
. As always, you don't have to have read that tutorial beforehand, though it may be helpful if you are new to rezonateR.
library(rezonateR) path = system.file("extdata", "rez007_edit1.Rdata", package = "rezonateR", mustWork = T) rez007 = rez_load(path)
Before editing, let's familiarise ourselves with some basic properties of rezrDF
s to keep in mind when editing them.
As you know, editing data can be tricky. If you accidentally remove information you should not have, the results could be disastrous. Field access labels prevents you from accidentally changing things that you shouldn't be changing. Let's look at the field access values of the unitDF:
fieldaccess(rez007$unitDF)
There are five possible field access values:
key
: The primary key of the table. You are not allowed to change it (unless you turn it into a non-key field, but this is not encouraged since you will basically break everything). If you try to update these fields using rezonateR
functions, I will stop you with an error.core
: Core fields, mostly generated by Rezonator. You can change them, but I will give you a warning if you do, because changing a core field has strong potential to break things. flex
: Flexible fields, usually fields whose values you enter into Rezonator, though there are also flex fields automatically generated by Rezonator. If you add fields in rezonateR
that you would like to manually correct later, setting it to flex
is also a good idea.auto
: Fields whose values are automatically generated using information from the same rezrDF. This should be used for fields that do not need to be manually annotated or corrected.foreign
: Fields whose values are automatically generated using information from a different rezrDF. Fields like text
and tokenOrderFirst
in the unitDF
we've seen just before, for example, come from the entryDF
and are therefore foreign
.Whenever you have auto
and foreign
fields in a table, that means you will want them to be automatically updated as your annotations progress. The reload()
function is one of the core features of rezonateR
and allows you to do this. The reload()
feature calls functions called updateFunction
s. You can access the updateFunction
s of a table using updateFunct()
:
updateFunct(rez007$unitDF)
There are three reload functions:
reloadLocal()
only takes a rezrDF
, and only updates auto fields.reloadForeign()
take a rezrDF and a rezrObj, and updates the foreign
fields of the rezrDF using the rezrObj (which may or may not contain the rezrDF).reload()
combines the two.Once we start editing fields, we will experience the power of reloads. Let's now first take a look at how we'll be editing ...
As you probably guessed from the title, this vignette covers the EasyEdit series of functions in rezonateR
, which are simple but powerful functions for editing rezrDF
s, and can be learnt even by users with no exposure to dplyr
. EasyEdit consists of four core functions, along with a bunch of useful helpers. The four core functions are:
addFieldLocal()
addFieldForeign()
changeFieldLocal()
changeFieldForeign()
.The terms 'local' and 'foreign' are inspired by, but extended from, database terminology. They refer to what source of information you are drawing from to create or change the field. The two 'local' functions add or change fields using information from the current rezrDF
, and the two 'foreign' ones add or change fields using information from other rezrDF
. The word 'local' can be dropped when you are using the local functions.
All of the four basic functions can be applied to both rezrDF
s and rezrObj
s. In general, whenever you are working with a rezrObj
directly, it is safest to work directly on it. However, if you are working with an emancipated rezrDF
- that is, a rezrDF
stored in a variable outside of a rezrObj
- then you will want to apply these functions to a single rezrDF
. In practice, when using these functions, the main difference between the rezrDF
and rezrObj
versions is that the latter will require you to specify entity type and layer. This tutorial will mainly use the rezrObj
editions; simply omit the entity type and layer fields when applying these functions to rezrDF
s.
The change functions act in more or less the same way as the add functions, the only difference being that it works on an existing field instead of adding a new one. So our tutorial will be mostly working with the add functions.
Let's start by looking at addFieldLocal()
using a simple application: In our tokenDF
, let's add a field that automatically calculates the length of a word in characters.
In this function, entity
specifies the name of the entity you would like to change, layer
specifies the layer within that entity (which is an empty string since there are no token layers), fieldName
is the name of the field we're adding, expression is the R expression with which we calculate the new field, and fieldaccess
tells rezonateR
to make this an auto field with an updateFunction
that will be attached to the table. Let's try this, and look at both the results and the updateFunction
:
rez007 = addField(rez007, entity = "token", layer = "", fieldName = "orthoLength", expression = nchar(text), fieldaccess = "auto") print("A fragment of the updated table:") head(rez007$tokenDF %>% rez_select(id, text, orthoLength)) print("The updateFunction:") updateFunct(rez007$tokenDF, "orthoLength")
You might notice that (...)
has an orthoLength of 5. What if we decide that we don't want to count these non-words? One feasible solution is to use isWord
, which we added in vignette("time_seq")
: if a token is a word, then we set orthoLength
to the number of characters in the text
column as before; if not, we set it to 0.
Although EasyEdit functions do not require users to use Tidyverse functions, I still suggest that the Tidyverse function dplyr::case_when()
is the best for this purpose, and it can be easily combined with EasyEdit functions. This allows you to create a vector whose value can be calculated differently depending on certain conditions. The syntax of dplyr::case_when()
is simple: each argument of the function a condition
~ value
pair, and if you want an 'else' statement, simply use T
as the condition in the last condition-value pair. In this case, we can use this function to create a vector of values that is empty when a token is not a word, and the text of the token when it is a word:
rez007 = changeField(rez007, entity = "token", layer = "", fieldName = "orthoLength", expression = nchar(case_when(isWord ~ text, T ~ "")), fieldaccess = "auto") print("A fragment of the updated table:") head(rez007$tokenDF %>% rez_select(id, text, orthoLength))
Notice that (...)
now has an orthoLength
of 0.
Now let's spice this up a bit by adding a complex field. A complex field takes information from multiple rows of a table. Let's say we are working with the tokenDF
, but want the new column to be the longest length of the word that appears in the unit that the token comes from. In this case, the groupField
argument that we haven't seen before is unit
, and we specify the field type as "complex"
. The expression uses the function longestLength()
, which is a rezonateR
function that returns the longest word in a series of words.
rez007 = addField(rez007, entity = "token", layer = "", fieldName = "longestWordInUnit", expression = longestLength(text), type = "complex", groupField = "unit", fieldaccess = "auto") head(rez007$tokenDF %>% select(id, text, longestWordInUnit))
longestLength()
belongs to a small collection of functions useful for extracting information from a bunch of strings:
shortestLength()
: Find the shortest token's length within the group.longestLength()
: Find the longest token's length.shortest()
: Get the shortest token's text.longest()
: Get the longest token's text.concatenateAll()
: Concatenate all the tokens together.inLength()
: Gives the size of the group (may be used with non-strings), possibly with isWord
information.Some base R functions that might be useful for numeric values include max()
, min()
, range()
, mean()
, etc.
Note that both times we added a field, we've set the field access to auto
. If you do not set the field access, I will automatically set it to flex
, which means that column - text
in this case - will not be affected by reloads.
Now let's add a simple foreign field. Let's say when we look at the tokenDF
, we also want to know what the whole unit's words are. (This will come into handy when we want to do external editing!)
The trickiest part of addFieldForeign
is keeping track of the source we're getting information from, and the target we're aiming to add information to. We need to know:
sourceEntity
, sourceLayer
, sourceFieldName
.targetEntity
, targetLayer
, targetFieldName
.targetForeignKeyName
. We need to give the name of the column containing IDs of the source table inside the target table, i.e. the column of the target table that tells us which row of the source table to look at. In this case the unit
field of tokenDF.In our specific example:
sourceEntity
to "unit"
and sourceLayer
to the empty string, and sourceFieldName
to "text"
.targetEntity
to "token"
and targetLayer
to the empty string, and targetFieldName to "unitText"
.targetForeignKeyName
, is the unit
field of tokenDF.Let's put it to practice:
rez007 = addFieldForeign(rez007, targetEntity = "token", targetLayer = "", sourceEntity = "unit", sourceLayer = "", targetForeignKeyName = "unit", targetFieldName = "unitText", sourceFieldName = "text", fieldaccess = "foreign") head(rez007$tokenDF %>% select(id, text, unitText))
Like its local counterpart, addFieldForeign()
also has a complex flavour, i.e. we can draw from multiple lines of a different field. This is probably the hardest part of this tutorial, so buckle up!
Here, we're going to add a field in the unitDF that tells us the average length of words within the unit. We're going to base this off the entryDF
.
This time, targetForeignKeyName
works a bit differently. Because the entries that correspond to each unit are given in the nodeMap
, you also need to supply the list of entries inside a unit node - that is, entryList
, as you may recall from vignette("import_save_basics")
.
addFieldForeign()
has a field called complexAction
, which is a function performed on the source field of the source table, which could be any aggregating function (including the longestLength()
series of functions that we have seen before). In this instance, we use mean()
::
rez007 = addFieldForeign(rez007, targetEntity = "unit", targetLayer = "", sourceEntity = "entry", sourceLayer = "", targetForeignKeyName = "entryList", targetFieldName = "averageWordLength", sourceFieldName = "text", type = "complex", complexAction = function(x) mean(nchar(x)), fieldaccess = "foreign") head(rez007$unitDF %>% select(id, text, averageWordLength))
Having created a bunch of auto fields, naturally we will want to try out our reloads! Let's try replacing the zero sign <0> with ∅ in the text
column, which is more commonly used in linguistics papers. After doing this, we can then reload unitDF
to look at the impact on our freshly created averageWordLength
. Notice that we have to reload rezrDF
s in order: first the entryDF
, using information from tokenDF
, then the unitDF
, using information from entryDF
(please be patient if running this on your computer, since reloads can take time):
#unitDF before the update rez007$unitDF %>% filter(str_detect(text, "<0>")) %>% rez_select(id, text, averageWordLength) %>% head #Change the zero format rez007$tokenDF = changeFieldLocal(rez007$tokenDF, fieldName = "text", expression = case_when(text == "<0>" ~ "∅", T ~ text)) rez007$entryDF = rez007$entryDF %>% reload(rez007) rez007$unitDF = rez007$unitDF %>% reload(rez007) #unitDF after the update rez007$unitDF %>% filter(str_detect(text, "∅")) %>% rez_select(id, text, averageWordLength) %>% head
The tidyverse package forcats
is much more powerful for dealing with categorical variables, but for those of us who don't want to learn an entirely new package, rezonateR
provides a few easy ways to deal with categories.
mergeCats()
allows you to merge two categories. It takes a vector, normally a column, as the first argument. Subsequently, the name of each argument is a new category, and the value of each argument is a vector of names of old categories that the new category will encompass (as character values, even if the original column contains factors).
For example, the Santa Barbara Corpus categorises laughter as separate from other vocalisms. If you want to merge "Laugh"
into `"Vocalism", keeping everything else, then you can use this code:
#Laughter tokens before head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@"))) rez007 = changeField(rez007, entity = "token", layer = "", fieldName = "kind", expression = mergeCats(kind, Vocalism = c("Laugh", "Vocalism"))) #Laughter tokens after head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "@")))
renameCats()
has identical syntax. For example, if you want to rename Vocalism
further to Voc
:
#Breath tokens before head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)"))) rez007 = changeField(rez007, entity = "token", layer = "", fieldName = "kind", expression = renameCats(kind, Voc = "Vocalism")) #Laughter tokens after head(rez007$tokenDF %>% select(id, doc, unit, text, kind) %>% filter(str_detect(text, "\\(H\\)")))
Adding rows is not an operation you will do all the time. In general, it is recommended to just re-import the whole thing and run all the code again. On the occasional situation where you do have to do this, addRow()
comes in handy.
You only need to add core and flex fields when you add a row. An ID will be automatically generated, and the foreign/auto fields will be automatically added. After specifying the rezrDF or rezrObj with entity and layer, each argument name is a column name, and its value is the value in the column.
rez007 = addRow(rez007, "trail", "default", doc = "sbc007", chainCreateSeq = max(rez007$trailDF$default$chainCreateSeq) + 1, name = "Danae", chainSize = 1) tail(rez007$trailDF$default) #Note: chainSize is currently flex as it is supplied by Rezonator and not calculated by rezonateR, but this may change in the future.
The next tutorial, vignette("edit_tidyRez")
will be relatively short, because it will assume familiarity with the Tidyverse package dplyr
. If you would like to do something that goes beyond the capabilities of easyRez
, it is recommended that you familiarise yourself with the relevant dplyr
function first, and then read the TidyRez vignette. If you are not familiar with Tidyverse and have no intention to learn it yet, you may elect to jump the next tutorial and go straight to vignette("edit_external")
.
And lest we forget, always save!
savePath = "rez007.Rdata" rez_save(rez007, savePath)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.