CoNLLUTextDocument: CoNNL-U Text Documents
In NLP: Natural Language Processing Infrastructure

CoNLLUTextDocument

R Documentation

CoNNL-U Text Documents

Description

Create text documents from CoNNL-U format files.

Usage

CoNLLUTextDocument(con, meta = list(), text = NULL)
read_CoNNLU(con)

Arguments

`con`	a connection object or a character string. See `scan()` for details.
`meta`	a named or empty list of document metadata tag-value pairs.
`text`	a character vector giving the text of the CoNNL-U annotation. If `NULL`, the `text` comments of the annotation are used.

Details

The CoNLL-U format (see https://universaldependencies.org/format.html) is a CoNLL-style format for annotated texts popularized and employed by the Universal Dependencies project (see https://universaldependencies.org/). For each “word” in the text, this provides exactly the 10 fields ID, FORM (word form or punctuation symbol), LEMMA (lemma or stem of word form), UPOSTAG (universal part-of-speech tag, see https://universaldependencies.org/u/pos/index.html), XPOSTAG (language-specific part-of-speech tag, may be unavailable), FEATS (list of morphological features), HEAD, DEPREL, DEPS, and MISC.

read_CoNNLU() reads the lines with these fields and optional comments from the given connection and splits into fields using scan(). This is combined with consecutive sentence ids into a data frame inheriting from class "CoNNLU_Annotation" used for representing the annotation information,

CoNLLUTextDocument() combines this annotation information with the given metadata (and optionally the original pre-tokenized text) into a CoNLL-U text document inheriting from classes "CoNLLUTextDocument" and "TextDocument".

The complete annotation information data frame can be extracted via content(). CoNLL-U v2 requires providing the complete texts of each sentence (or a reconstruction thereof) in ‘⁠# text =⁠’ comment lines. Where consistently provided, these are made available in the text attribute of the content data frame.

In addition, there are methods for generics as.character(), words(), sents(), tagged_words(), and tagged_sents() and class "CoNLLUTextDocument", which should be used to access the text in such text document objects.

The CoNLL-U format allows to represent both words and (multiword) tokens (see section ‘Words, Tokens and Empty Nodes’ in the format documentation), as distinguished by ids being integers or integer ranges, with the words being annotated further. One can use as.character() to extract the tokens; all other viewers listed above use the words. Finally, the viewers incorporating POS tags take a which argument to specify using the universal or language-specific tags, by giving a substring of "UPOSTAG" (default) or "XPOSTAG".

Value

For CoNLLUTextDocument(), an object inheriting from "CoNLLUTextDocument" and "TextDocument".

For read_CoNNLU(), an object inherting from "CoNNLU_Annotation" and "data.frame"

NLP
Natural Language Processing Infrastructure

CoNLLUTextDocument: CoNNL-U Text Documents
In NLP: Natural Language Processing Infrastructure

CoNNL-U Text Documents

Description

Usage

Arguments

Details

Value

See Also

Related to CoNLLUTextDocument in NLP...

R Package Documentation

Browse R Packages

We want your feedback!

NLP Natural Language Processing Infrastructure

CoNLLUTextDocument: CoNNL-U Text Documents In NLP: Natural Language Processing Infrastructure

CoNNL-U Text Documents

Description

Usage

Arguments

Details

Value

See Also

Related to CoNLLUTextDocument in NLP...

R Package Documentation

Browse R Packages

We want your feedback!

NLP
Natural Language Processing Infrastructure

CoNLLUTextDocument: CoNNL-U Text Documents
In NLP: Natural Language Processing Infrastructure