In johnwdubois/rezonateR: A Support Package for Working with Rezonator in R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(rezonateR)

This tutorial discusses the handling of trails and tracks in Rezonator using the EasyTrack series of functions. In more generally accepted linguistic terms, a trail is a coreference chain, and a track is a mention or referential expression within a coreference chain.

We will be using the same Santa Barbara Corpus annotations as before:

library(rezonateR)
path = system.file("extdata", "rez007_track.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)

The file contains coreference annotations for the first fifth of the text or so, and the .Rdata file imported here has been processed to include information on trees in vignettes("trees"). This tutorial will make use of this feature.

This tutorial will build towards a very simple toy analysis at the end, using all the changes that have been made to the rezrObj so far, to show the capabilities of rezonateR.

Getting information from previous mentions

Anaphoric and cataphoric distance

In studying coreference, we often want to know the difference from the current mention to the previoue mention. EasyTrack takes care of this using a family of functions

lastMentionUnit() and nextMentionUnit(): Give you the unit ID of the previous and next mention, respectively.
lastMentionToken() and nextMentionToken(): Give you the token ID of the previous and next mention, respectively.
unitsToLastMention() and unitsToNextMention(): Give you the number of units from the current mention to the last mention and to the next mention, respectively.
tokensToLastMention() and tokensToNextMention(): Give you the number of tokens from the current mention to the last mention and to the next mention, respectively.

The first four functions are rarely used in practice, so we will focus on the last four, which builds on the first four

Let's first find out how many units we are from the previous mention of something using unitsToLastMention(). This is equivalent to the gapUnit column that already exists as automatically generated by Rezonator. There are two optional arguments:

unitSeq: The unit order values where the mentions appeared. Here, we use the unitSeqLast column, which is the default value, though unitSeqFirst is also possible.
chain: The column that gives the chain that each track belongs to. Typically there is no reason to touch this parameter; just leave it blank, and the column chain will be used.

The value will be NA if there are no previous mentions:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
rez007$trackDF$default %>% select(id, gapUnits, unitsToLastMention) %>% slice(1:20)

Now let's count the tokens from the last mention using the tokensToLastMention() function instead.. This one has a couple of complications. There are more parameters to fill this time:

tokenOrder: Similar to unitSeq, but for counting tokens. Common choices are docTokenSeqFirst, docTokenSeqLast, wordTokenSeqFirst and wordTokenseqLast (see vignette("time_seq") for the last two). By default it's docTokenSeqLast.
chain: As above.
zeroProtocol: How the positions of zero are determined. By default, it is "literal", i.e. the position at which the zero was inserted. If unitFinal, zeroes will be treated as being located at the end of the unit. If unitFirst, they will be treated as the first word.
zeroCond: A condition for determining whether a token is zero. Normally, this is (word column) == "<0>" if the default Rezonator zero is used, though others may prefer "<ZERO>" or similar.
unitSeq: As above. Required when using the unitFinal and unitFirst protocols.
unitTokenSeqName: The name of the tokenSeq column to be used in the unitDF.
unitDF: the rezrDF containing the unit.

rez007$trackDF$default =  rez007$trackDF$default %>%
  rez_mutate(wordsToLastMention = tokensToLastMention(
    docWordSeqFirst, #What seq to use
    zeroProtocol = "unitInitial", #How to treat zeroes
    zeroCond = (text == "<0>"),
    unitDF = rez007$unitDF,
    unitTokenSeqName = "docWordSeqFirst")) #Additional argument for unitFinal protocol
rez007$trackDF$default %>% select(id, wordsToLastMention) %>% slice(1:20)

The functions unitsToNextMention() and tokensToNextMention() work in the same way, except that they deal with following rather than preceding mentions.

Extracting features from previous mentions

In addition to getting the location of a previous mention, we might also want to extract a property of it:

getPrevMentionField() and getNextMentionField(): Extract a feature of the previous or next mention.

Let's try to extract the subject status (using the Relation field annotated on treeLinks). Firstly, we have to supplement this Relation field to rez007$trackDF$default, then replace the NA entries with "NonSubj" so that they missing values are treated as meaningful, and not just missing (fieldaccess is changed to "flex" to avoid future reloads messing this up):

rez007$trackDF$default = rez007$trackDF$default %>% addFieldForeign(sourceDF = rez007$treeEntryDF$default, targetForeignKeyName = "treeEntry", targetFieldName = "Relation", sourceFieldName = "Relation", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(Relation = coalesce(Relation, "NonSubj"), fieldaccess = "flex")

The first and obligatory argument of getPrevMentionField() is the name of the column, or feature, you're extracting. The other arguments, tokenOrder and chain, work the same way as before.

rez007$trackDF$default = rez007$trackDF$default %>%
  addFieldLocal(fieldName = "prevRelation",
                expression = getPrevMentionField(Relation),
                fieldaccess = "auto")
head(rez007$trackDF$default) %>% select(id, text, name, Relation, prevRelation)

Tallying preceding and following mentions

Apart from looking only at the previous or next unit, We can also count how many mentions of something there were within a window of units before or after a mention, optionally with additional conditions. Here are the relevant functions:

countPrevMentions() and countNextMentions(): Get the number of previous or following units within a specified window of units.
countPrevMentionsIf() and countNextMentionsIf(): Get the number of previous or following units within a specified window of units given that they satisfy certain conditions (which do not depend on the current mention).
countPrevMentionsMatch() and countNextMentionsMatch(): Get the number of previous or following units within a specified window of units given that they have the same value as the current mention for some field.

These functions have the following fields, in order:

windowSize: How many IUs before / after the current one do you want to count?
cond (countPrevMentionsIf only): The condition that fields must satisfy to count.
field(countPrevMentionsMatch only): The field whose value is to be matched.
unitSeq: As before.
chain: As before.

Let's try all three functions. We will count the number of previous mentions in the previous 20 units, the previous subject mentions, and the previous mentions whose subject/nonsubject value agrees with the present mention:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(noPrevMentionsIn20 = countPrevMentions(20),
             noPrevSubjMentionsIn20 = countPrevMentionsIf(20, Relation == "Subj"),
             noPrevSubjMatchMentionsIn20 = countPrevMentionsMatch(20, "Relation"))
rez007$trackDF$default %>% select(id, noPrevMentionsIn20, noPrevSubjMentionsIn20, noPrevSubjMatchMentionsIn20)  %>% slice(1:20)

If you don't want a window restriction, you can set the window to Inf. Here's an example where we extract the number of future zero mentions, regardless of how far it is from the current one:

rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, text == "<0>"))
rez007$trackDF$default %>% select(id, noComingZeroes)  %>% slice(1:20)

Counting competitors

We may also want to count competing mentions, that is, recent mentions not coreferential to the current mention. The presence of competitors usually suggests that a referential form is more likely to be explicit. countCompetitors() tallies the number of competitors recently. The following parameters are present, all of which are:

cond: The condition under which something counts as a competitor (other than being non-coreferential with the present mention). By default, anything goes.
window: How many far back (in units) do you want to look? By default, there is no limit.
tokenSeq: As before.
unitSeq: As before.
chain: As before.
between: Do you count only competitors between the current mention and the previous mention in the same trail, or do you also count mentions from before the previous mention?

The function countMatchingCompetitors() is similar, but instead of col, there is a field matchCol, where you should put the name of a field in which competitors should match the current mention in order to be mentioned:

Here is one example. noCompetitors uses a window of 10 units, and may look beyond the previous mention, whereas noMatchingCompetitors is similar, but only looks between the current and previous mention, and only counts mentions with matching Relation values:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(noCompetitors = countCompetitors(windowSize = 10, between = F),
             noMatchingCompetitors = countMatchingCompetitors(Relation, windowSize = 10, between = F))
rez007$trackDF$default %>% select(id, text, noCompetitors, noMatchingCompetitors)  %>% slice(1:20)

Adding verb information to the track table, and vice versa

We often want to connect information about verbs to their arguments. We may either put verb information in a track table, or put track information in a verb table. The former approach can be taken when investigating issues like coreference, and the latter for issues like argument structure.

If we want to add verb information to the track table, we can do this in two steps:

Add a treeParent column to trackDF$refexpr that takes the value of the parent column of treeEntryDF.
Using the treeParent column, find the corresponding verb in the verb table chunkDF$verb through the treeEntry column of chunkDF$verb.

rez007 = rez007 %>%
  addFieldForeign("track", "default", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>%
  rez_left_join(rez007$chunkDF$verb %>% select(id, text, treeEntry),
                by = c(treeParent = "treeEntry"),
                suffix = c("", "_verb"),
                df2Address = "chunkDF/verb",
                fkey = "treeParent",
                df2key = "treeEntry",
                rezrObj = rez007) %>%
  rename(verbID = id_verb, verbText = text_verb)
rez007$trackDF$default %>% select(id, treeParent, verbID, verbText) %>% slice(1:20)

Now with the verbID in place, we can do the reverse process of putting argument information in the verb table too. The reverse process is similar, but a little more dangerous because you could potentially create duplicate rows in chunkDF$verb if you are not careful! Let's say we want to add information about the subject to chunkDF$verb - say, the number of the subject. You would first have to make sure that each verb only has one subject! If you have made a lot of annotations, mistakes are likely to happen, so make sure to check first. In this code, we extract the verb IDs of each subject, and make sure that the verb IDs are unique:

verbsBySubject = rez007$trackDF$default %>%
  filter(Relation == "Subj", !is.na(verbID)) %>%
  pull(verbID)
if(length(verbsBySubject) != unique(length(verbsBySubject))) print("Error found!") else print("You're good!")

Now we can safely move this information to the verb table:

rez007$chunkDF$verb = rez007$chunkDF$verb %>%
  rez_left_join(rez007$trackDF$default %>% select(id, text, verbID),
                by = c(id = "verbID"),
                suffix = c("", "_subj"),
                df2Address = "trackDF/default",
                fkey = "id",
                df2key = "verbID",
                rezrObj = rez007) %>%
  rename(subjID = id_subj, subjText = text_subj)
rez007$chunkDF$verb %>% select(id, text, subjID, subjText)

Bridging

Natively, Rezonator does not yet support bridging annotation. rezonateR handles bridging a bit unusually, since it is difficult to do direct bridging annotation without an annotation interface like Rezonator's.

The first step in doing bridging annotation is to create a frameMatrix, a data frame which is used to enter framing relationships between chains (for example, the car's engine has a part-whole relationship with the car). But before doing that, we must ensure that there are no chains with repeat names. We use the function undupeLayers(), which has three arguments: the rezrObj, the layer you want to undupe, and the field/column you want to undupe. This will add numbers next to duplicated chain names. Then we can call addFrameMatrix() on the rezrObj. The following code does these steps, then shows the first 10 rows and 12 columns of rez007 (the first two columns give the IDs and names of the chains):

rez007 = undupeLayers(rez007, "trail", "name")
rez007 = addFrameMatrix(rez007)
frameMatrix(rez007)[1:10, 1:12]

The second step is to export the frameMatrix and populate it with actual annotations. Although it does not apply to our cases, in many cases the function reduceFrameMatrix() will be useful for removing rows and columns that do not actually participate in framing relations, or divide them into subparts that don't have framing relationships with each other, so that we end up with a cleaner CSV to annotate.

We use the function obscureUpper() to obscure the upper triangular matrix of 'repeat' entries so that we don't duplicate our annotation efforts: if we've already annotated that the car's engine and the car have a part-whole relationship in the lower triangular matrix, we don't need to annotate that the car and the car's engine have a whole-part relationship!

After obscureUpper() and (optionally) reduceFrameMatrix(), the frameMatrix can be exported using rez_write_csv() and edited in an external editor. A spreadsheet program is highly recommended for this; you can use the freeze frame feature to annotate these relations more easily. When entering relationships, use the format '(role of the row entity)-(role of the column entity)'. For example, if the car's engine is the row and the car is the column, type part-whole, not whole-part. Alternatively, if there are only a few of these relations and you already remember which ones they are, you can edit those relations in R directly. You can simply use base R assignment for this.

For this text, we will mainly be annotating individual-group relationships.

rez_write_csv(obscureUpper(frameMatrix(rez007)), "rez007_frame.csv")
newFrame = rez_read_csv("rez007_frame_edited.csv", origDF = frameMatrix(rez007))
newFrame[10:20,c(1,2,12:22)]

Rather than updateFromDF(), we use updateFrameMatrixFromDF() to update the frameMatrix. This function will 'flip' the relationships for the upper triangular matrix. For example, if in the lower triangular matrix you annotated the car's engine as having a part-whole relationship with the car, then 'whole-part' will show up in the upper triangular matrix for the row 'car' and the column the car's engine.

frameMatrix(rez007) = updateFrameMatrixFromDF(frameMatrix(rez007), newFrame)

After having updated the frame matrix, we can easily

lastBridgeUnit(): Get the location (in unit) of the previous unit with a bridge to this unit.
lastBridgeToken(): Get the location (in tokens) of the bridging expression to this unit.
unitsToLastBridge(): Get the number of units between the closest unit with a bridge to the current unit, and the current unit.
tokensToLastBridge(): Get the number of tokens between the bridging expression and this unit.
countPrevBridges(): Count the number of previous bridging expressions in a specified window.

In practice, we will generally be using unitsToLastBridge() and tokensToLastBridge(). The arguments needed are very similar to what we saw for unitsToLastMention() and tokensToLastMention(). Here are the arguments for unitsToLastMention():

frameMatrix: The frameMatrix.
unitSeq: As before.
chain: As before.
tokenOrderLast: The token sequence value for the last token in an expression, by default docTokenSeqFirst.
tokenOrderFirst: The token sequence value for the first token in an expression, by default docTokenSeqLast.
inclRelations: Vector of relations that will be counted. This allows you to, for example, count part-whole relations but not whole-part. If left blank, everything will be counted.

The arguments of tokensToTheLastBridge() are again mostly things we have seen before:

frameMatrix: As before.
firstOrLast: Do you count the first or last token of the previous bridging expression? Either "first" or, by default `"last".
tokenOrderFirst: As before.
tokenOrderLast: As before.
chain: As before.
zeroProtocol: As before.
zeroCond: As before.
unitSeq: As before.
unitDF: As before.
inclRelations: As before.

Let's say you want to extract the units and tokens to the previous bridge, including only the "individual-group" relation (so that it doesn't count as a bridge of the current mention is a group that includes the previous mention, but it does count it the current mention is an individual that is included by the previous mention). Here is the code:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(bridgeDistUnit = unitsToLastBridge(frameMatrix(rez007),
                                                     inclRelations = "individual-group"),
             bridgeDistToken = tokensToLastBridge(frameMatrix(rez007),
                                                     inclRelations = "individual-group"))
rez007$trackDF$default %>% select(id, text, bridgeDistUnit, bridgeDistToken) %>% slice(53:63)

Notice that if there has been no preceding bridge, the value is NA.

A toy analysis

Now that we've come so far, let's try to do a mockup of a real linguistic analysis! Of course, an actual analysis would take far more data than what we have here, as well as a more carefully designed annotation scheme. But what we do here should suffice to demonstrate how an analysis might be done.

Let's try to predict the number of characters inside a referential expression from three variables:

noPrevSubjMentionsIn20: The number of coreferent subject mentions within the 20 previous units.
noPrevNonSubjMentionsIn20: The number of coreferent non-subject mentions within the 20 previous units.
noCompetitors: The number of competitors within the five previous units.
Relation: Is the current mention a subject?
number: Is the current mention singular or plural?

We will use a linear regression for this prediction. (This probably isn't the best model, but let's keep it simple for this demonstration.) First we need to create noPrevNonSubjMentionsIn20 (which we do in an emancipated rezrDF to avoid clogging up the main table), then we'll convert Relation to a factor, and then we'll use the lm() function in base R to do the prediction:

analysis_df = rez007$trackDF$default %>% rez_mutate(
  noPrevNonSubjMentionsIn20 = noPrevMentionsIn20 - noPrevSubjMentionsIn20,
  Relation = stringToFactor(Relation)
)
lm_nochar = lm(charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 + noCompetitors + Relation + number, data = analysis_df)
lm_nochar
anova(lm_nochar)

As we can see, noPrevSubjMentionsIn20, noPrevNonSubjMentionsIn20 and Relation emerge as very good predictors, with entities mentioned more in the previous 20 units and which are subjects having a much stronger tendency to be light.

And so our journey ends here.

This concludes our journey through the basic functionality of rezonateR. And of course, don't forget to save:

As usual, let's not forget, for one last time:

savePath = "rez007.Rdata"
rez_save(rez007, savePath)

johnwdubois/rezonateR documentation built on April 17, 2025, 4:08 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

johnwdubois/rezonateR
A Support Package for Working with Rezonator in R

In johnwdubois/rezonateR: A Support Package for Working with Rezonator in R

Getting information from previous mentions

Anaphoric and cataphoric distance

Extracting features from previous mentions

Tallying preceding and following mentions

Counting competitors

Adding verb information to the track table, and vice versa

Bridging

A toy analysis

And so our journey ends here.

R Package Documentation

Browse R Packages

We want your feedback!

johnwdubois/rezonateR A Support Package for Working with Rezonator in R

In johnwdubois/rezonateR: A Support Package for Working with Rezonator in R

Getting information from previous mentions

Anaphoric and cataphoric distance

Extracting features from previous mentions

Tallying preceding and following mentions

Counting competitors

Adding verb information to the track table, and vice versa

Bridging

A toy analysis

And so our journey ends here.

R Package Documentation

Browse R Packages

We want your feedback!

johnwdubois/rezonateR
A Support Package for Working with Rezonator in R