TaggedTextDocument: POS-Tagged Word Text Documents
In NLP: Natural Language Processing Infrastructure

TaggedTextDocument

R Documentation

POS-Tagged Word Text Documents

Description

Create text documents from files containing POS-tagged words.

Usage

TaggedTextDocument(con, encoding = "unknown",
                   word_tokenizer = whitespace_tokenizer,
                   sent_tokenizer = Regexp_Tokenizer("\n", invert = TRUE),
                   para_tokenizer = blankline_tokenizer,
                   sep = "/",
                   meta = list())

Arguments

`con`	a connection object or a character string. See `readLines()` for details.
`encoding`	encoding to be assumed for input strings. See `readLines()` for details.
`word_tokenizer`	a function for obtaining the word token spans.
`sent_tokenizer`	a function for obtaining the sentence token spans.
`para_tokenizer`	a function for obtaining the paragraph token spans, or `NULL` in which case no paragraph tokenization is performed.
`sep`	the character string separating the word tokens and their POS tags.
`meta`	a named or empty list of document metadata tag-value pairs.

Details

TaggedTextDocument() creates documents representing natural language text as suitable collections of POS-tagged words, based on using readLines() to read text lines from connections providing such collections.

The text read is split into paragraph, sentence and tagged word tokens using the span tokenizers specified by arguments para_tokenizer, sent_tokenizer and word_tokenizer. By default, paragraphs are assumed to be separated by blank lines, sentences by newlines and tagged word tokens by whitespace. Finally, word tokens and their POS tags are obtained by splitting the tagged word tokens according to sep. From this, a suitable representation of the provided collection of POS-tagged words is obtained, and returned as a tagged text document object inheriting from classes "TaggedTextDocument" and "TextDocument".

There are methods for generics words(), sents(), paras(), tagged_words(), tagged_sents(), and tagged_paras() (as well as as.character()) and class "TaggedTextDocument", which should be used to access the text in such text document objects.

The methods for generics tagged_words(), tagged_sents() and tagged_paras() provide a mechanism for mapping POS tags via the map argument, see section Details in the help page for tagged_words() for more information. The POS tagset used will be inferred from the POS_tagset metadata element of the CoNLL-style text document.

Value

A tagged text document object inheriting from "TaggedTextDocument" and "TextDocument".

NLP
Natural Language Processing Infrastructure

TaggedTextDocument: POS-Tagged Word Text Documents
In NLP: Natural Language Processing Infrastructure

POS-Tagged Word Text Documents

Description

Usage

Arguments

Details

Value

See Also

Related to TaggedTextDocument in NLP...

R Package Documentation

Browse R Packages

We want your feedback!

NLP Natural Language Processing Infrastructure

TaggedTextDocument: POS-Tagged Word Text Documents In NLP: Natural Language Processing Infrastructure

POS-Tagged Word Text Documents

Description

Usage

Arguments

Details

Value

See Also

Related to TaggedTextDocument in NLP...

R Package Documentation

Browse R Packages

We want your feedback!

NLP
Natural Language Processing Infrastructure

TaggedTextDocument: POS-Tagged Word Text Documents
In NLP: Natural Language Processing Infrastructure