README.md

R functions for working with syntactic structure coded as token lists (e.g. CONLL format)

Installation

You can install directly from github:

library(devtools)
install_github("vanatteveldt/rsyntax")

Usage

The functions in this module assume that you have a list of tokens in a data frame. A simple example is provided with the module:

library(rsyntax)
data(example_tokens)
head(tokens)

Get the text of a sentence, optionally specifying which column(s) to use:

get_text(tokens)

## [1] "John says that Mary hit him"

get_text(tokens, word.column = c("lemma", "pos"))

## [1] "John/NNP say/VBZ that/IN Mary/NNP hit/VBD he/PRP"

Plot the syntactic structure of a sentence: (Note: if you have multiple sentences in one token list, you should filter it or provide a sentence= argument)

g = graph_from_sentence(tokens)
plot(g)

Syntactic Structure of example sentence

Clauses and Sources

You can use the get_quotes function to extract quotes and paraphrases from the sentences. Note that for this, the token ids need to be globally unique. If that is not the case, you can use the unique.ids function to make them unique:

tokens = unique_ids(tokens)

You can get the quotes from the tokens with get_quotes:

quotes = get_quotes(tokens)
head(quotes)
id source quote 2 1 5

A single quote was found, with node 2 ("say") as the key, node 1 ("John") as the sources, and nodes 3 through 6 ("that Mary hit him") as quote.

To find the clauses, you can use the get_clauses function, which takes the quotes as an optional argument to make sure that speech actions are not listed as clauses:

clauses = get_clauses(tokens, quotes=quotes)
head(clauses)
clause_id subject predicate 1 4 5

You can annotate the original tokens file with the quotes and tokens to facilitate processing (e.g. to create a word cloud or topic model of all utterances per source):

tokens = annotate_tokens(tokens, quotes, clauses)
head(tokens)
id word parent sentence coref pos entity lemma relation offset aid pos1 attack quote_id quote_role clause_id clause_role 1 John 2 1 1 NNP PERSON John nsubj 0 156884180 M FALSE 1 source NA NA 2 says NA 1 NA VBZ say 5 156884180 V FALSE NA NA NA NA 3 that 5 1 NA IN that mark 10 156884180 P FALSE 1 quote 1 predicate 4 Mary 5 1 NA NNP PERSON Mary nsubj 15 156884180 M FALSE 1 quote 1 subject 5 hit 2 1 NA VBD hit ccomp 20 156884180 V FALSE 1 quote 1 predicate 6 him 5 1 1 PRP he dobj 24 156884180 O FALSE 1 quote 1 predicate

Finally, you can also provide the quotes and clauses to the graph_from_sentence function. This will fill the clauses in a desaturated rainbow, with the subject as a circle and the predicate as rectangle. Quotes are represented with a bright node for the source, and the border in the same colour for the quote.

g = graph_from_sentence(tokens)
plot(g)

Syntactic Structure of example sentence with clauses and quotes
marked

Use with coreNLP

You can use the coreNLP package to directly parse (English) sentences and create a token list.

First, initialize coreNLP and parse the sentence:

coreNLP::initCoreNLP()

a = coreNLP::annotateString("John told Mary he loves her")

Now, you can create a token list from the coreNLP annotation and use that to compute sources and clauses as normal:

tokens = tokens_from_coreNLP(a)
quotes = get_quotes(tokens)
clauses = get_clauses(tokens, quotes)
tokens = annotate_tokens(tokens, quotes, clauses)
head(tokens)
id sentence token lemma CharacterOffsetBegin CharacterOffsetEnd POS NER Speaker parent relation pos1 quote_id quote_role clause_id clause_role 1 1 John John 0 4 NNP PERSON PER0 2 nsubj N 1 source NA NA 2 1 told tell 5 9 VBD O PER0 NA root V NA NA NA NA 3 1 Mary Mary 10 14 NNP PERSON PER0 2 dobj N 1 quote NA NA 4 1 he he 15 17 PRP O PER0 5 nsubj P 1 quote 1 subject 5 1 loves love 18 23 VBZ O PER0 3 acl:relcl V 1 quote 1 predicate 6 1 her she 24 27 PRP$ O PER0 5 dobj P 1 quote 1 predicate

And plot the sentence:

plot(graph_from_sentence(tokens))

Syntactic Structure of example sentence from
coreNLP



vanatteveldt/rsyntax documentation built on Aug. 7, 2018, 1:31 a.m.