knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) options(rmarkdown.html_vignette.check_title = FALSE) library(rezonateR)
This tutorial discusses the handling of trails and tracks in Rezonator using the EasyTrack series of functions. In more generally accepted linguistic terms, a trail is a coreference chain, and a track is a mention or referential expression within a coreference chain.
We will be using the same Santa Barbara Corpus annotations as before:
library(rezonateR) path = system.file("extdata", "rez007_track.Rdata", package = "rezonateR", mustWork = T) rez007 = rez_load(path)
The file contains coreference annotations for the first fifth of the
text or so, and the .Rdata
file imported here has been processed to
include information on trees in vignettes("trees")
. This tutorial will
make use of this feature.
This tutorial will build towards a very simple
toy analysis at the end, using all the changes that have been made to the rezrObj
so far,
to show the capabilities of rezonateR
.
In studying coreference, we often want to know the difference from the
current mention to the previoue mention. EasyTrack
takes care of this
using a family of functions
lastMentionUnit()
and nextMentionUnit()
: Give you the unit ID of
the previous and next mention, respectively.lastMentionToken()
and nextMentionToken()
: Give you the token ID
of the previous and next mention, respectively.unitsToLastMention()
and unitsToNextMention()
: Give you the
number of units from the current mention to the last mention and to
the next mention, respectively.tokensToLastMention()
and tokensToNextMention()
: Give you the
number of tokens from the current mention to the last mention and to
the next mention, respectively.The first four functions are rarely used in practice, so we will focus on the last four, which builds on the first four
Let's first find out how many units we are from the previous mention of
something using unitsToLastMention()
. This is equivalent to the
gapUnit
column that already exists as automatically generated by
Rezonator. There are two optional arguments:
unitSeq
: The unit order values where the mentions appeared. Here,
we use the unitSeqLast
column, which is the default value, though
unitSeqFirst
is also possible.chain
: The column that gives the chain that each track belongs to.
Typically there is no reason to touch this parameter; just leave it
blank, and the column chain
will be used.The value will be NA
if there are no previous mentions:
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(unitsToLastMention = unitsToLastMention(unitSeqLast)) rez007$trackDF$default %>% select(id, gapUnits, unitsToLastMention) %>% slice(1:20)
Now let's count the tokens from the last mention using the
tokensToLastMention()
function instead.. This one has a couple of
complications. There are more parameters to fill this time:
tokenOrder
: Similar to unitSeq
, but for counting tokens. Common choices are docTokenSeqFirst
, docTokenSeqLast
, wordTokenSeqFirst
and wordTokenseqLast
(see vignette("time_seq")
for the last two). By default it's docTokenSeqLast
.chain
: As above.zeroProtocol
: How the positions of zero are determined. By default, it is "literal"
, i.e. the position at which the zero was inserted. If unitFinal
, zeroes will be treated as being located at the end of the unit. If unitFirst
, they will be treated as the first word.zeroCond
: A condition for determining whether a token is zero. Normally, this is (word column) == "<0>"
if the default Rezonator zero is used, though others may prefer "<ZERO>"
or similar.unitSeq
: As above. Required when using the unitFinal
and unitFirst
protocols.unitTokenSeqName
: The name of the tokenSeq
column to be used in the unitDF
.unitDF
: the rezrDF
containing the unit.rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(wordsToLastMention = tokensToLastMention( docWordSeqFirst, #What seq to use zeroProtocol = "unitInitial", #How to treat zeroes zeroCond = (text == "<0>"), unitDF = rez007$unitDF, unitTokenSeqName = "docWordSeqFirst")) #Additional argument for unitFinal protocol rez007$trackDF$default %>% select(id, wordsToLastMention) %>% slice(1:20)
The functions unitsToNextMention()
and tokensToNextMention()
work in the same
way, except that they deal with following rather than preceding mentions.
In addition to getting the location of a previous mention, we might also want to extract a property of it:
getPrevMentionField()
and getNextMentionField()
: Extract a feature of the previous or next mention.Let's try to extract the subject status (using the Relation
field annotated on treeLink
s). Firstly, we have to supplement this Relation
field to rez007$trackDF$default
, then replace the NA
entries with "NonSubj"
so that they missing values are treated as meaningful, and not just missing (fieldaccess
is changed to "flex"
to avoid future reloads messing this up):
rez007$trackDF$default = rez007$trackDF$default %>% addFieldForeign(sourceDF = rez007$treeEntryDF$default, targetForeignKeyName = "treeEntry", targetFieldName = "Relation", sourceFieldName = "Relation", fieldaccess = "foreign") rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(Relation = coalesce(Relation, "NonSubj"), fieldaccess = "flex")
The first and obligatory argument of getPrevMentionField()
is the name of the column, or feature, you're extracting. The other arguments, tokenOrder
and chain
, work the same way as before.
rez007$trackDF$default = rez007$trackDF$default %>% addFieldLocal(fieldName = "prevRelation", expression = getPrevMentionField(Relation), fieldaccess = "auto") head(rez007$trackDF$default) %>% select(id, text, name, Relation, prevRelation)
Apart from looking only at the previous or next unit, We can also count how many mentions of something there were within a window of units before or after a mention, optionally with additional conditions. Here are the relevant functions:
countPrevMentions()
and countNextMentions()
: Get the number of previous or following units within a specified window of units.countPrevMentionsIf()
and countNextMentionsIf()
: Get the number of previous or following units within a specified window of units given that they satisfy certain conditions (which do not depend on the current mention).countPrevMentionsMatch()
and countNextMentionsMatch()
: Get the number of previous or following units within a specified window of units given that they have the same value as the current mention for some field.These functions have the following fields, in order:
windowSize
: How many IUs before / after the current one do you want to count?cond
(countPrevMentionsIf
only): The condition that fields must satisfy to count.field
(countPrevMentionsMatch
only): The field whose value is to be matched.unitSeq
: As before.chain
: As before.Let's try all three functions. We will count the number of previous mentions in the previous 20 units, the previous subject mentions, and the previous mentions whose subject/nonsubject value agrees with the present mention:
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noPrevMentionsIn20 = countPrevMentions(20), noPrevSubjMentionsIn20 = countPrevMentionsIf(20, Relation == "Subj"), noPrevSubjMatchMentionsIn20 = countPrevMentionsMatch(20, "Relation")) rez007$trackDF$default %>% select(id, noPrevMentionsIn20, noPrevSubjMentionsIn20, noPrevSubjMatchMentionsIn20) %>% slice(1:20)
If you don't want a window restriction, you can set the window to Inf
. Here's an example where we extract the number of future zero mentions, regardless of how far it is from the current one:
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noComingZeroes = countNextMentionsIf(Inf, text == "<0>")) rez007$trackDF$default %>% select(id, noComingZeroes) %>% slice(1:20)
We may also want to count competing mentions, that is, recent mentions
not coreferential to the current mention. The presence of competitors usually suggests that a referential form is more likely to be explicit. countCompetitors()
tallies the
number of competitors recently. The following parameters are present, all of which are:
cond
: The condition under which something counts as a competitor (other than being non-coreferential with the present mention). By default, anything goes.window
: How many far back (in units) do you want to look? By default, there is no limit.tokenSeq
: As before.unitSeq
: As before.chain
: As before.between
: Do you count only competitors between the current mention and the previous mention in the same trail, or do you also count mentions from before the previous mention?The function countMatchingCompetitors()
is similar, but instead of col
, there is a field matchCol
, where you should put the name of a field in which competitors should match the current mention in order to be mentioned:
Here is one example. noCompetitors
uses a window of 10 units, and may look beyond the previous mention, whereas noMatchingCompetitors
is similar, but only looks between the current and previous mention, and only counts mentions with matching Relation
values:
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(noCompetitors = countCompetitors(windowSize = 10, between = F), noMatchingCompetitors = countMatchingCompetitors(Relation, windowSize = 10, between = F)) rez007$trackDF$default %>% select(id, text, noCompetitors, noMatchingCompetitors) %>% slice(1:20)
We often want to connect information about verbs to their arguments. We may either put verb information in a track table, or put track information in a verb table. The former approach can be taken when investigating issues like coreference, and the latter for issues like argument structure.
If we want to add verb information to the track table, we can do this in two steps:
treeParent
column to trackDF$refexpr
that takes the value of the parent
column
of treeEntryDF
.treeParent
column, find the corresponding verb in the verb table chunkDF$verb
through the treeEntry
column of chunkDF$verb
.rez007 = rez007 %>% addFieldForeign("track", "default", "treeEntry", "default", "treeEntry", "treeParent", "parent", fieldaccess = "foreign") rez007$trackDF$default = rez007$trackDF$default %>% rez_left_join(rez007$chunkDF$verb %>% select(id, text, treeEntry), by = c(treeParent = "treeEntry"), suffix = c("", "_verb"), df2Address = "chunkDF/verb", fkey = "treeParent", df2key = "treeEntry", rezrObj = rez007) %>% rename(verbID = id_verb, verbText = text_verb) rez007$trackDF$default %>% select(id, treeParent, verbID, verbText) %>% slice(1:20)
Now with the verbID in place, we can do the reverse process of putting argument information in the verb table too. The reverse process is similar, but a little more dangerous because you could potentially create duplicate rows in chunkDF$verb
if you are not careful! Let's say we want to add information about the subject to chunkDF$verb
- say, the number of the subject. You would first have to make sure that each verb only has one subject! If you have made a lot of annotations, mistakes are likely to happen, so make sure to check first. In this code, we extract the verb IDs of each subject, and make sure that the verb IDs are unique:
verbsBySubject = rez007$trackDF$default %>% filter(Relation == "Subj", !is.na(verbID)) %>% pull(verbID) if(length(verbsBySubject) != unique(length(verbsBySubject))) print("Error found!") else print("You're good!")
Now we can safely move this information to the verb table:
rez007$chunkDF$verb = rez007$chunkDF$verb %>% rez_left_join(rez007$trackDF$default %>% select(id, text, verbID), by = c(id = "verbID"), suffix = c("", "_subj"), df2Address = "trackDF/default", fkey = "id", df2key = "verbID", rezrObj = rez007) %>% rename(subjID = id_subj, subjText = text_subj) rez007$chunkDF$verb %>% select(id, text, subjID, subjText)
Natively, Rezonator does not yet support bridging annotation. rezonateR
handles bridging a bit unusually, since it is difficult to do direct bridging annotation without an annotation interface like Rezonator's.
The first step in doing bridging annotation is to create a frameMatrix
, a data frame which is used to enter framing relationships between chains (for example, the car's engine has a part-whole relationship with the car). But before doing that, we must ensure that there are no chains with repeat names. We use the function undupeLayers()
, which has three arguments: the rezrObj
, the layer you want to undupe, and the field/column you want to undupe. This will add numbers next to duplicated chain names. Then we can call addFrameMatrix()
on the rezrObj
. The following code does these steps, then shows the first 10 rows and 12 columns of rez007
(the first two columns give the IDs and names of the chains):
rez007 = undupeLayers(rez007, "trail", "name") rez007 = addFrameMatrix(rez007) frameMatrix(rez007)[1:10, 1:12]
The second step is to export the frameMatrix
and populate it with actual annotations. Although it does not apply to our cases, in many cases the function reduceFrameMatrix()
will be useful for removing rows and columns that do not actually participate in framing relations, or divide them into subparts that don't have framing relationships with each other, so that we end up with a cleaner CSV to annotate.
We use the function obscureUpper()
to obscure the upper triangular matrix of 'repeat' entries so that we don't duplicate our annotation efforts: if we've already annotated that the car's engine and the car have a part-whole relationship in the lower triangular matrix, we don't need to annotate that the car and the car's engine have a whole-part relationship!
After obscureUpper()
and (optionally) reduceFrameMatrix()
, the frameMatrix can be exported using rez_write_csv()
and edited in an external editor. A spreadsheet program is highly recommended for this; you can use the freeze frame feature to annotate these relations more easily. When entering relationships, use the format '(role of the row entity)-(role of the column entity)'. For example, if the car's engine is the row and the car is the column, type part-whole
, not whole-part
. Alternatively, if there are only a few of these relations and you already remember which ones they are, you can edit those relations in R directly. You can simply use base R assignment for this.
For this text, we will mainly be annotating individual-group relationships.
rez_write_csv(obscureUpper(frameMatrix(rez007)), "rez007_frame.csv") newFrame = rez_read_csv("rez007_frame_edited.csv", origDF = frameMatrix(rez007)) newFrame[10:20,c(1,2,12:22)]
Rather than updateFromDF()
, we use updateFrameMatrixFromDF()
to update the frameMatrix
. This function will 'flip' the relationships for the upper triangular matrix. For example, if in the lower triangular matrix you annotated the car's engine as having a part-whole relationship with the car, then 'whole-part' will show up in the upper triangular matrix for the row 'car' and the column the car's engine.
frameMatrix(rez007) = updateFrameMatrixFromDF(frameMatrix(rez007), newFrame)
After having updated the frame matrix, we can easily
lastBridgeUnit()
: Get the location (in unit) of the previous unit with a bridge to this unit.lastBridgeToken()
: Get the location (in tokens) of the bridging expression to this unit.unitsToLastBridge()
: Get the number of units between the closest unit with a bridge to the current unit, and the current unit.tokensToLastBridge()
: Get the number of tokens between the bridging expression and this unit.countPrevBridges()
: Count the number of previous bridging expressions in a specified window.In practice, we will generally be using unitsToLastBridge()
and tokensToLastBridge()
. The arguments needed are very similar to what we saw for unitsToLastMention()
and tokensToLastMention()
. Here are the arguments for unitsToLastMention()
:
frameMatrix
: The frameMatrix
.unitSeq
: As before.chain
: As before.tokenOrderLast
: The token sequence value for the last token in an expression, by default docTokenSeqFirst
.tokenOrderFirst
: The token sequence value for the first token in an expression, by default docTokenSeqLast
.inclRelations
: Vector of relations that will be counted. This allows you to, for example, count part-whole relations but not whole-part. If left blank, everything will be counted.The arguments of tokensToTheLastBridge()
are again mostly things we have seen before:
frameMatrix
: As before.firstOrLast
: Do you count the first or last token of the previous bridging expression? Either "first"
or, by default `"last".tokenOrderFirst
: As before.tokenOrderLast
: As before.chain
: As before.zeroProtocol
: As before.zeroCond
: As before.unitSeq
: As before.unitDF
: As before.inclRelations
: As before.Let's say you want to extract the units and tokens to the previous bridge, including only the "individual-group"
relation (so that it doesn't count as a bridge of the current mention is a group that includes the previous mention, but it does count it the current mention is an individual that is included by the previous mention). Here is the code:
rez007$trackDF$default = rez007$trackDF$default %>% rez_mutate(bridgeDistUnit = unitsToLastBridge(frameMatrix(rez007), inclRelations = "individual-group"), bridgeDistToken = tokensToLastBridge(frameMatrix(rez007), inclRelations = "individual-group")) rez007$trackDF$default %>% select(id, text, bridgeDistUnit, bridgeDistToken) %>% slice(53:63)
Notice that if there has been no preceding bridge, the value is NA
.
Now that we've come so far, let's try to do a mockup of a real linguistic analysis! Of course, an actual analysis would take far more data than what we have here, as well as a more carefully designed annotation scheme. But what we do here should suffice to demonstrate how an analysis might be done.
Let's try to predict the number of characters inside a referential expression from three variables:
noPrevSubjMentionsIn20
: The number of coreferent subject mentions within the 20 previous units.noPrevNonSubjMentionsIn20
: The number of coreferent non-subject mentions within the 20 previous units.noCompetitors
: The number of competitors within the five previous units.Relation
: Is the current mention a subject?number
: Is the current mention singular or plural?We will use a linear regression for this prediction. (This probably isn't the best model, but let's keep it simple for this demonstration.) First we need to create noPrevNonSubjMentionsIn20
(which we do in an emancipated rezrDF
to avoid clogging up the main table), then we'll convert Relation
to a factor, and then we'll use the lm()
function in base R to do the prediction:
analysis_df = rez007$trackDF$default %>% rez_mutate( noPrevNonSubjMentionsIn20 = noPrevMentionsIn20 - noPrevSubjMentionsIn20, Relation = stringToFactor(Relation) ) lm_nochar = lm(charCount ~ noPrevSubjMentionsIn20 + noPrevNonSubjMentionsIn20 + noCompetitors + Relation + number, data = analysis_df) lm_nochar anova(lm_nochar)
As we can see, noPrevSubjMentionsIn20
, noPrevNonSubjMentionsIn20
and Relation
emerge as very good predictors, with entities mentioned more in the previous 20 units and which are subjects having a much stronger tendency to be light.
This concludes our journey through the basic functionality of rezonateR
. And of course, don't forget to save:
As usual, let's not forget, for one last time:
savePath = "rez007.Rdata" rez_save(rez007, savePath)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.