knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE
)
options(rmarkdown.html_vignette.check_title = FALSE)
library(rezonateR)

This tutorial discusses the handling of trails and tracks in Rezonator using the EasyTrack series of functions. In more generally accepted linguistic terms, a trail is a coreference chain, and a track is a mention or referential expression within a coreference chain.

We will be using the same Santa Barbara Corpus annotations as before:

library(rezonateR)
path = system.file("extdata", "rez007_track.Rdata", package = "rezonateR", mustWork = T)
rez007 = rez_load(path)

The file contains coreference annotations for the first fifth of the text or so, and the .Rdata file imported here has been processed to include information on trees in vignettes("trees"). This tutorial will make use of this feature.

This tutorial will build towards a very simple toy analysis at the end, using all the changes that have been made to the rezrObj so far, to show the capabilities of rezonateR.

Getting information from previous mentions

Anaphoric and cataphoric distance

In studying coreference, we often want to know the difference from the current mention to the previoue mention. EasyTrack takes care of this using a family of functions

The first four functions are rarely used in practice, so we will focus on the last four, which builds on the first four

Let's first find out how many units we are from the previous mention of something using unitsToLastMention(). This is equivalent to the gapUnit column that already exists as automatically generated by Rezonator. There are two optional arguments:

The value will be NA if there are no previous mentions:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast))
rez007$trackDF$default %>% select(id, gapUnits, unitsToLastMention) %>% slice(1:20)

Now let's count the tokens from the last mention using the tokensToLastMention() function instead.. This one has a couple of complications. There are more parameters to fill this time:

rez007$trackDF$default =  rez007$trackDF$default %>%
  rez_mutate(wordsToLastMention = tokensToLastMention(
    docWordSeqFirst, #What seq to use
    zeroProtocol = "unitInitial", #How to treat zeroes
    zeroCond = (text == "<0>"),
    unitDF = rez007$unitDF,
    unitTokenSeqName = "docWordSeqFirst")) #Additional argument for unitFinal protocol
rez007$trackDF$default %>% select(id, wordsToLastMention) %>% slice(1:20)

The functions unitsToNextMention() and tokensToNextMention() work in the same way, except that they deal with following rather than preceding mentions.

Extracting features from previous mentions

In addition to getting the location of a previous mention, we might also want to extract a property of it:

Let's try to extract the subject status (using the Relation field annotated on treeLinks). Firstly, we have to supplement this Relation field to rez007$trackDF$default, then replace the NA entries with "NonSubj" so that they missing values are treated as meaningful, and not just missing (fieldaccess is changed to "flex" to avoid future reloads messing this up):

rez007$trackDF$default = rez007$trackDF$default %>% addFieldForeign(sourceDF = rez007$treeEntryDF$default, targetForeignKeyName = "treeEntry", targetFieldName = "Relation", sourceFieldName = "Relation", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(Relation = coalesce(Relation, "NonSubj"), fieldaccess = "flex")

The first and obligatory argument of getPrevMentionField() is the name of the column, or feature, you're extracting. The other arguments, tokenOrder and chain, work the same way as before.

rez007$trackDF$default = rez007$trackDF$default %>%
  addFieldLocal(fieldName = "prevRelation",
                expression = getPrevMentionField(Relation),
                fieldaccess = "auto")
head(rez007$trackDF$default) %>% select(id, text, name, Relation, prevRelation)

Tallying preceding and following mentions

Apart from looking only at the previous or next unit, We can also count how many mentions of something there were within a window of units before or after a mention, optionally with additional conditions. Here are the relevant functions:

These functions have the following fields, in order:

Let's try all three functions. We will count the number of previous mentions in the previous 20 units, the previous subject mentions, and the previous mentions whose subject/nonsubject value agrees with the present mention:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(noPrevMentionsIn20 = countPrevMentions(20),
             noPrevSubjMentionsIn20 = countPrevMentionsIf(20, Relation == "Subj"),
             noPrevSubjMatchMentionsIn20 = countPrevMentionsMatch(20, "Relation"))
rez007$trackDF$default %>% select(id, noPrevMentionsIn20, noPrevSubjMentionsIn20, noPrevSubjMatchMentionsIn20)  %>% slice(1:20)

If you don't want a window restriction, you can set the window to Inf. Here's an example where we extract the number of future zero mentions, regardless of how far it is from the current one:

rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, text == "<0>"))
rez007$trackDF$default %>% select(id, noComingZeroes)  %>% slice(1:20)

Counting competitors

We may also want to count competing mentions, that is, recent mentions not coreferential to the current mention. The presence of competitors usually suggests that a referential form is more likely to be explicit. countCompetitors() tallies the number of competitors recently. The following parameters are present, all of which are:

The function countMatchingCompetitors() is similar, but instead of col, there is a field matchCol, where you should put the name of a field in which competitors should match the current mention in order to be mentioned:

Here is one example. noCompetitors uses a window of 10 units, and may look beyond the previous mention, whereas noMatchingCompetitors is similar, but only looks between the current and previous mention, and only counts mentions with matching Relation values:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(noCompetitors = countCompetitors(windowSize = 10, between = F),
             noMatchingCompetitors = countMatchingCompetitors(Relation, windowSize = 10, between = F))
rez007$trackDF$default %>% select(id, text, noCompetitors, noMatchingCompetitors)  %>% slice(1:20)

Adding verb information to the track table, and vice versa

We often want to connect information about verbs to their arguments. We may either put verb information in a track table, or put track information in a verb table. The former approach can be taken when investigating issues like coreference, and the latter for issues like argument structure.

If we want to add verb information to the track table, we can do this in two steps:

  1. Add a treeParent column to trackDF$refexpr that takes the value of the parent column of treeEntryDF.
  2. Using the treeParent column, find the corresponding verb in the verb table chunkDF$verb through the treeEntry column of chunkDF$verb.
rez007 = rez007 %>%
  addFieldForeign("track", "default", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign")
rez007$trackDF$default = rez007$trackDF$default %>%
  rez_left_join(rez007$chunkDF$verb %>% select(id, text, treeEntry),
                by = c(treeParent = "treeEntry"),
                suffix = c("", "_verb"),
                df2Address = "chunkDF/verb",
                fkey = "treeParent",
                df2key = "treeEntry",
                rezrObj = rez007) %>%
  rename(verbID = id_verb, verbText = text_verb)
rez007$trackDF$default %>% select(id, treeParent, verbID, verbText) %>% slice(1:20)

Now with the verbID in place, we can do the reverse process of putting argument information in the verb table too. The reverse process is similar, but a little more dangerous because you could potentially create duplicate rows in chunkDF$verb if you are not careful! Let's say we want to add information about the subject to chunkDF$verb - say, the number of the subject. You would first have to make sure that each verb only has one subject! If you have made a lot of annotations, mistakes are likely to happen, so make sure to check first. In this code, we extract the verb IDs of each subject, and make sure that the verb IDs are unique:

verbsBySubject = rez007$trackDF$default %>%
  filter(Relation == "Subj", !is.na(verbID)) %>%
  pull(verbID)
if(length(verbsBySubject) != unique(length(verbsBySubject))) print("Error found!") else print("You're good!")

Now we can safely move this information to the verb table:

rez007$chunkDF$verb = rez007$chunkDF$verb %>%
  rez_left_join(rez007$trackDF$default %>% select(id, text, verbID),
                by = c(id = "verbID"),
                suffix = c("", "_subj"),
                df2Address = "trackDF/default",
                fkey = "id",
                df2key = "verbID",
                rezrObj = rez007) %>%
  rename(subjID = id_subj, subjText = text_subj)
rez007$chunkDF$verb %>% select(id, text, subjID, subjText)

Bridging

Natively, Rezonator does not yet support bridging annotation. rezonateR handles bridging a bit unusually, since it is difficult to do direct bridging annotation without an annotation interface like Rezonator's.

The first step in doing bridging annotation is to create a frameMatrix, a data frame which is used to enter framing relationships between chains (for example, the car's engine has a part-whole relationship with the car). But before doing that, we must ensure that there are no chains with repeat names. We use the function undupeLayers(), which has three arguments: the rezrObj, the layer you want to undupe, and the field/column you want to undupe. This will add numbers next to duplicated chain names. Then we can call addFrameMatrix() on the rezrObj. The following code does these steps, then shows the first 10 rows and 12 columns of rez007 (the first two columns give the IDs and names of the chains):

rez007 = undupeLayers(rez007, "trail", "name")
rez007 = addFrameMatrix(rez007)
frameMatrix(rez007)[1:10, 1:12]

The second step is to export the frameMatrix and populate it with actual annotations. Although it does not apply to our cases, in many cases the function reduceFrameMatrix() will be useful for removing rows and columns that do not actually participate in framing relations, or divide them into subparts that don't have framing relationships with each other, so that we end up with a cleaner CSV to annotate.

We use the function obscureUpper() to obscure the upper triangular matrix of 'repeat' entries so that we don't duplicate our annotation efforts: if we've already annotated that the car's engine and the car have a part-whole relationship in the lower triangular matrix, we don't need to annotate that the car and the car's engine have a whole-part relationship!

After obscureUpper() and (optionally) reduceFrameMatrix(), the frameMatrix can be exported using rez_write_csv() and edited in an external editor. A spreadsheet program is highly recommended for this; you can use the freeze frame feature to annotate these relations more easily. When entering relationships, use the format '(role of the row entity)-(role of the column entity)'. For example, if the car's engine is the row and the car is the column, type part-whole, not whole-part. Alternatively, if there are only a few of these relations and you already remember which ones they are, you can edit those relations in R directly. You can simply use base R assignment for this.

For this text, we will mainly be annotating individual-group relationships.

rez_write_csv(obscureUpper(frameMatrix(rez007)), "rez007_frame.csv")
newFrame = rez_read_csv("rez007_frame_edited.csv", origDF = frameMatrix(rez007))
newFrame[10:20,c(1,2,12:22)]

Rather than updateFromDF(), we use updateFrameMatrixFromDF() to update the frameMatrix. This function will 'flip' the relationships for the upper triangular matrix. For example, if in the lower triangular matrix you annotated the car's engine as having a part-whole relationship with the car, then 'whole-part' will show up in the upper triangular matrix for the row 'car' and the column the car's engine.

frameMatrix(rez007) = updateFrameMatrixFromDF(frameMatrix(rez007), newFrame)

After having updated the frame matrix, we can easily

In practice, we will generally be using unitsToLastBridge() and tokensToLastBridge(). The arguments needed are very similar to what we saw for unitsToLastMention() and tokensToLastMention(). Here are the arguments for unitsToLastMention():

The arguments of tokensToTheLastBridge() are again mostly things we have seen before:

Let's say you want to extract the units and tokens to the previous bridge, including only the "individual-group" relation (so that it doesn't count as a bridge of the current mention is a group that includes the previous mention, but it does count it the current mention is an individual that is included by the previous mention). Here is the code:

rez007$trackDF$default = rez007$trackDF$default %>%
  rez_mutate(bridgeDistUnit = unitsToLastBridge(frameMatrix(rez007),
                                                     inclRelations = "individual-group"),
             bridgeDistToken = tokensToLastBridge(frameMatrix(rez007),
                                                     inclRelations = "individual-group"))
rez007$trackDF$default %>% select(id, text, bridgeDistUnit, bridgeDistToken) %>% slice(53:63)

Notice that if there has been no preceding bridge, the value is NA.

A toy analysis

Now that we've come so far, let's try to do a mockup of a real linguistic analysis! Of course, an actual analysis would take far more data than what we have here, as well as a more carefully designed annotation scheme. But what we do here should suffice to demonstrate how an analysis might be done.

Let's try to predict the number of characters inside a referential expression from three variables:

We will use a linear regression for this prediction. (This probably isn't the best model, but let's keep it simple for this demonstration.) First we need to create noPrevNonSubjMentionsIn20 (which we do in an emancipated rezrDF to avoid clogging up the main table), then we'll convert Relation to a factor, and then we'll use the lm() function in base R to do the prediction:

analysis_df = rez007$trackDF$default %>% rez_mutate(
  noPrevNonSubjMentionsIn20 = noPrevMentionsIn20 - noPrevSubjMentionsIn20,
  Relation = stringToFactor(Relation)
)
lm_nochar = lm(charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 + noCompetitors + Relation + number, data = analysis_df)
lm_nochar
anova(lm_nochar)

As we can see, noPrevSubjMentionsIn20, noPrevNonSubjMentionsIn20 and Relation emerge as very good predictors, with entities mentioned more in the previous 20 units and which are subjects having a much stronger tendency to be light.

And so our journey ends here.

This concludes our journey through the basic functionality of rezonateR. And of course, don't forget to save:

As usual, let's not forget, for one last time:

savePath = "rez007.Rdata"
rez_save(rez007, savePath)


johnwdubois/rezonateR documentation built on Nov. 19, 2024, 11:17 p.m.