CoNLL-Style Text Documents

Share:

Description

Create text documents from CoNLL-style files.

Usage

1
CoNLLTextDocument(con, encoding = "unknown", meta = list())

Arguments

con

a connection object or a character string. See scan() for details.

encoding

encoding to be assumed for input strings. See scan() for details.

meta

a named or empty list of document metadata tag-value pairs.

Details

CoNLL-style files use an extended tabular format where empty lines separate sentences, and non-empty lines consist of whitespace separated columns giving the word tokens and annotations for these. In principle, these annotations can vary from corpus to corpus: the current version of CoNLLTextDocument() assumes a fixed set of 3 columns giving, respectively, the word token and its POS and chunk tags.

The lines are read from the given connection and split into fields using scan(). From this, a suitable representation of the provided information is obtained, and returned as a CoNLL text document object inheriting from classes "CoNLLTextDocument" and "TextDocument".

There are methods for generics words(), sents(), tagged_words(), tagged_sents(), and chunked_sents() (as well as as.character()) and class "CoNLLTextDocument", which should be used to access the text in such text document objects.

The methods for generics tagged_words() and tagged_sents() provide a mechanism for mapping POS tags via the map argument, see section Details in the help page for tagged_words() for more information. The POS tagset used will be inferred from the POS_tagset metadata element of the CoNLL-style text document.

Value

An object inheriting from "CoNLLTextDocument" and "TextDocument".

See Also

TextDocument for basic information on the text document infrastructure employed by package NLP.

http://ifarm.nl/signll/conll/ for general information about CoNLL (Conference on Natural Language Learning), the yearly meeting of the Special Interest Group on Natural Language Learning of the Association for Computational Linguistics.

http://www.cnts.ua.ac.be/conll2000/chunking/ for the CoNLL 2000 chunking task, and training and test data sets which can be read in using CoNLLTextDocument().