Annotated Plain Text Documents

Description

Create annotated plain text documents from plain text and collections of annotations for this text.

Usage

1
2

Arguments

s

a String object, or something coercible to this using as.String() (e.g., a character string with appropriate encoding information).

annotations

an Annotation object with annotations for x, or a list of such objects.

meta

a named or empty list of document metadata tag-value pairs.

x

an object inheriting from class "AnnotatedPlainTextDocument".

Details

Annotated plain text documents combine plain text with collections (“sets”, implemented as lists) of objects with annotations for the text.

A typical workflow is to use annotate() with suitable annotator pipelines to obtain the annotations, and then use AnnotatedPlainTextDocument() to combine these with the text being annotated. This yields an object inheriting from "AnnotatedPlainTextDocument" and "TextDocument", from which the text and collection of annotations can be obtained using, respectively, as.character() and annotations().

There are methods for generics words(), sents(), paras(), tagged_words(), tagged_sents(), tagged_paras(), chunked_sents(), parsed_sents() and parsed_paras() and class "AnnotatedPlainTextDocument" providing structured views of the text in such documents. These all have an additional argument which for specifying the annotation object to use (by default, the first one is taken), and of course require the necessary annotations to be available in the annotation object used.

The methods for generics tagged_words(), tagged_sents() and tagged_paras() provide a mechanism for mapping POS tags via the map argument, see section Details in the help page for tagged_words() for more information. The POS tagset used will be inferred from the POS_tagset metadata element of the annotation object used.

Value

For AnnotatedPlainTextDocument(), an object inheriting from "AnnotatedPlainTextTextDocument" and "TextDocument".

For annotations(), a list of Annotation objects.

See Also

TextDocument for basic information on the text document infrastructure employed by package NLP.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Use a pre-built annotated plain text document obtained by employing an
## annotator pipeline from package 'StanfordCoreNLP', available from the
## repository at <http://datacube.wu.ac.at>, using the following code:
##   require("StanfordCoreNLP")
##   s <- paste("Stanford University is located in California.",
##              "It is a great university.")
##   p <- StanfordCoreNLP_Pipeline(c("pos", "lemma", "parse"))
##   doc <- AnnotatedPlainTextDocument(s, p(s))

doc <- readRDS(system.file("texts", "stanford.rds", package = "NLP"))

doc

## Extract available annotation:
a <- annotations(doc)[[1L]]
a

## Structured views:
sents(doc)
tagged_sents(doc)
tagged_sents(doc, map = Universal_POS_tags_map)
parsed_sents(doc)

## Add (trivial) paragraph annotation:
s <- as.character(doc)
a <- annotate(s, Simple_Para_Token_Annotator(blankline_tokenizer), a)
doc <- AnnotatedPlainTextDocument(s, a)
## Structured view:
paras(doc)