R functions for working with syntactic structure coded as token lists (e.g. CONLL format)

Installation

You can install directly from github:

library(devtools)
install_github("anon-author/clauses")

Usage

The functions in this module assume that you have a list of tokens in a data frame. A simple example is provided with the module:

library(rsyntax)
data(example_tokens)
tokens
library(rsyntax)
data(example_tokens)
library(knitr)
kable(tokens)

Get the text of a sentence, optionally specifying which column(s) to use:

get_text(tokens)
get_text(tokens, word.column = c("lemma", "pos"))

Plot the syntactic structure of a sentence: (Note: if you have multiple sentences in one token list, you should filter it or provide a sentence= argument)

g = graph_from_sentence(tokens)
plot(g)

Clauses and Sources

You can use the get_quotes function to extract quotes and paraphrases from the sentences. Note that for this, the token ids need to be globally unique. If that is not the case, you can use the unique.ids function to make them unique:

tokens = unique_ids(tokens)

You can get the quotes from the tokens with get_quotes:

quotes = get_quotes(tokens)
quotes
quotes = get_quotes(tokens)
kable(quotes)

A single quote was found, with node 2 ("say") as the key, node 1 ("John") as the sources, and nodes 3 through 6 ("that Mary hit him") as quote.

To find the clauses, you can use the get_clauses function, which takes the quotes as an optional argument to make sure that speech actions are not listed as clauses:

clauses = get_clauses(tokens, quotes=quotes)
clauses
clauses = get_clauses(tokens, quotes=quotes)
kable(clauses)

Finally, you can also provide the quotes and clauses to the graph_from_sentence function. This will fill the clauses in a desaturated rainbow, with the subject as a circle and the predicate as rectangle. Quotes are represented with a bright node for the source, and the border in the same colour for the quote.

g = graph_from_sentence(tokens, quotes = quotes, clauses = clauses)
plot(g)


anon-author/clauses documentation built on May 10, 2019, 11:52 a.m.