annotate: Annotate a tokenlist based on rsyntax queries

View source: R/deprecated.r

annotateR Documentation

Annotate a tokenlist based on rsyntax queries

Description

This function has been renamed to annotate_tqueries.

Usage

annotate(
  tokens,
  column,
  ...,
  block = NULL,
  fill = TRUE,
  overwrite = FALSE,
  block_fill = FALSE,
  copy = TRUE,
  verbose = FALSE
)

Arguments

tokens

A tokenIndex data.table, or any data.frame coercible with as_tokenindex.

column

The name of the column in which the annotations are added. The unique ids are added as column_id

...

One or multiple tqueries, or a list of queries, as created with tquery. Queries can be given a named by using a named argument, which will be used in the annotation_id to keep track of which query was used.

block

Optionally, specify ids (doc_id - sentence - token_id triples) that are blocked from querying and filling (ignoring the id and recursive searches through the id).

fill

Logical. If TRUE (default) also assign the fill nodes (as specified in the tquery). Otherwise these are ignored

overwrite

If TRUE, existing column will be overwritten. Otherwise (default), the exsting annotations in the column will be blocked, and new annotations will be added. This is identical to using multiple queries.

block_fill

If TRUE (and overwrite is FALSE), the existing fill nodes will also be blocked. In other words, the new annotations will only be added if the

copy

If TRUE (default), the data.table is copied. Otherwise, it is changed by reference. Changing by reference is faster and more memory efficient, but is not predictable R style, so is optional.

verbose

If TRUE, report progress (only usefull if multiple queries are given)

Details

Apply queries to extract syntax patterns, and add the results as two columns to a tokenlist. One column contains the ids for each hit. The other column contains the annotations. Only nodes that are given a name in the tquery (using the 'label' parameter) will be added as annotation.

Note that while queries only find 1 node for each labeld component of a pattern (e.g., quote queries have 1 node for "source" and 1 node for "quote"), all children of these nodes can be annotated by settting fill to TRUE. If a child has multiple ancestors, only the most direct ancestors are used (see documentation for the fill argument).

Value

The tokenIndex with the annotation columns

Examples

## spacy tokens for: Mary loves John, and Mary was loved by John
tokens = tokens_spacy[tokens_spacy$doc_id == 'text3',]

## two simple example tqueries
passive = tquery(pos = "VERB*", label = "predicate",
                 children(relation = c("agent"), label = "subject"))
active =  tquery(pos = "VERB*", label = "predicate",
                 children(relation = c("nsubj", "nsubjpass"), label = "subject"))

 
tokens = annotate_tqueries(tokens, "clause", pas=passive, act=active)
tokens
if (interactive()) plot_tree(tokens, annotation='clause')


rsyntax documentation built on June 7, 2022, 9:07 a.m.