CoNLLTextDocument: CoNLL-Style Text Documents
In NLP: Natural Language Processing Infrastructure

CoNLLTextDocument

R Documentation

CoNLL-Style Text Documents

Description

Create text documents from CoNLL-style files.

Usage

CoNLLTextDocument(con, encoding = "unknown", format = "conll00",
                  meta = list())

Arguments

`con`	a connection object or a character string. See `scan()` for details.
`encoding`	encoding to be assumed for input strings. See `scan()` for details.
`format`	a character vector specifying the format. See Details.
`meta`	a named or empty list of document metadata tag-value pairs.

Details

CoNLL-style files use an extended tabular format where empty lines separate sentences, and non-empty lines consist of whitespace separated columns giving the word tokens and annotations for these. Such formats were popularized through their use for the shared tasks of CoNLL (Conference on Natural Language Learning), the yearly meeting of the Special Interest Group on Natural Language Learning of the Association for Computational Linguistics (see https://www.signll.org/content/conll/ for more information about CoNLL).

The precise format can vary according to corpus, and must be specified via argument format, as either a character string giving a pre-defined format, or otherwise a character vector with elements giving the names of the ‘fields’ (columns), and names used to give the field ‘types’, with ‘WORD’, ‘POS’ and ‘CHUNK’ to be used for, respectively, word tokens, POS tags, and chunk tags. For example,

  c(WORD = "WORD", POS = "POS", CHUNK = "CHUNK")

would be a format specification appropriate for the CoNLL-2000 chunking task, as also available as the pre-defined "conll00", which serves as default format for reasons of back-compatibility. Other pre-defined formats are "conll01" (for the CoNLL-2001 clause identification task), "conll02" (for the CoNLL-2002 language-independent named entity recognition task), "conllx" (for the CoNLL-X format used in at least the CoNLL-2006 and CoNLL-2007 multilingual dependency parsing tasks), and "conll09" (for the CoNLL-2009 shared task on syntactic and semantic dependencies in multiple languages).

The lines are read from the given connection and split into fields using scan(). From this, a suitable representation of the provided information is obtained, and returned as a CoNLL text document object inheriting from classes "CoNLLTextDocument" and "TextDocument".

There are methods for class "CoNLLTextDocument" and generics words(), sents(), tagged_words(), tagged_sents(), and chunked_sents() (as well as as.character()), which should be used to access the text in such text document objects.

The methods for generics tagged_words() and tagged_sents() provide a mechanism for mapping POS tags via the map argument, see section Details in the help page for tagged_words() for more information. The POS tagset used will be inferred from the POS_tagset metadata element of the CoNLL-style text document.

Value

An object inheriting from "CoNLLTextDocument" and "TextDocument".

NLP
Natural Language Processing Infrastructure

CoNLLTextDocument: CoNLL-Style Text Documents
In NLP: Natural Language Processing Infrastructure

CoNLL-Style Text Documents

Description

Usage

Arguments

Details

Value

See Also

Related to CoNLLTextDocument in NLP...

R Package Documentation

Browse R Packages

We want your feedback!

NLP Natural Language Processing Infrastructure

CoNLLTextDocument: CoNLL-Style Text Documents In NLP: Natural Language Processing Infrastructure

CoNLL-Style Text Documents

Description

Usage

Arguments

Details

Value

See Also

Related to CoNLLTextDocument in NLP...

R Package Documentation

Browse R Packages

We want your feedback!

NLP
Natural Language Processing Infrastructure

CoNLLTextDocument: CoNLL-Style Text Documents
In NLP: Natural Language Processing Infrastructure