AnnotatedPlainTextDocument: Annotated Plain Text Documents

Description Usage Arguments Details Value See Also Examples

View source: R/aptd.R

Description

Create annotated plain text documents from plain text and collections of annotations for this text.

Usage

1
2

Arguments

s

a String object, or something coercible to this using as.String() (e.g., a character string with appropriate encoding information).

annotations

an Annotation object with annotations for x, or a list of such objects.

meta

a named or empty list of document metadata tag-value pairs.

x

an object inheriting from class "AnnotatedPlainTextDocument".

Details

Annotated plain text documents combine plain text with collections (“sets”, implemented as lists) of objects with annotations for the text.

A typical workflow is to use annotate() with suitable annotator pipelines to obtain the annotations, and then use AnnotatedPlainTextDocument() to combine these with the text being annotated. This yields an object inheriting from "AnnotatedPlainTextDocument" and "TextDocument", from which the text and collection of annotations can be obtained using, respectively, as.character() and annotations().

There are methods for generics words(), sents(), paras(), tagged_words(), tagged_sents(), tagged_paras(), chunked_sents(), parsed_sents() and parsed_paras() and class "AnnotatedPlainTextDocument" providing structured views of the text in such documents. These all have an additional argument which for specifying the annotation object to use (by default, the first one is taken), and of course require the necessary annotations to be available in the annotation object used.

The methods for generics tagged_words(), tagged_sents() and tagged_paras() provide a mechanism for mapping POS tags via the map argument, see section Details in the help page for tagged_words() for more information. The POS tagset used will be inferred from the POS_tagset metadata element of the annotation object used.

Value

For AnnotatedPlainTextDocument(), an object inheriting from "AnnotatedPlainTextTextDocument" and "TextDocument".

For annotations(), a list of Annotation objects.

See Also

TextDocument for basic information on the text document infrastructure employed by package NLP.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Use a pre-built annotated plain text document obtained by employing an
## annotator pipeline from package 'StanfordCoreNLP', available from the
## repository at <https://datacube.wu.ac.at>, using the following code:
##   require("StanfordCoreNLP")
##   s <- paste("Stanford University is located in California.",
##              "It is a great university.")
##   p <- StanfordCoreNLP_Pipeline(c("pos", "lemma", "parse"))
##   doc <- AnnotatedPlainTextDocument(s, p(s))

doc <- readRDS(system.file("texts", "stanford.rds", package = "NLP"))

doc

## Extract available annotation:
a <- annotations(doc)[[1L]]
a

## Structured views:
sents(doc)
tagged_sents(doc)
tagged_sents(doc, map = Universal_POS_tags_map)
parsed_sents(doc)

## Add (trivial) paragraph annotation:
s <- as.character(doc)
a <- annotate(s, Simple_Para_Token_Annotator(blankline_tokenizer), a)
doc <- AnnotatedPlainTextDocument(s, a)
## Structured view:
paras(doc)

Example output

<<AnnotatedPlainTextDocument>>
Metadata:  0
Annotations:  1, length(s): 15
Content:  chars: 71
 id type     start end features
  1 sentence     1  45 constituents=<<integer,7>>, parse=<<character,1>>,
                       basic-dependencies=<<Stanford_typed_dependencies>>,
                       collapsed-dependencies=<<Stanford_typed_dependencies>>,
                       collapsed-ccprocessed-dependencies=<<Stanford_typed_dependencies>>
  2 word         1   8 POS=NNP, lemma=Stanford
  3 word        10  19 POS=NNP, lemma=University
  4 word        21  22 POS=VBZ, lemma=be
  5 word        24  30 POS=JJ, lemma=located
  6 word        32  33 POS=IN, lemma=in
  7 word        35  44 POS=NNP, lemma=California
  8 word        45  45 POS=., lemma=.
  9 sentence    47  71 constituents=<<integer,6>>, parse=<<character,1>>,
                       basic-dependencies=<<Stanford_typed_dependencies>>,
                       collapsed-dependencies=<<Stanford_typed_dependencies>>,
                       collapsed-ccprocessed-dependencies=<<Stanford_typed_dependencies>>
 10 word        47  48 POS=PRP, lemma=it
 11 word        50  51 POS=VBZ, lemma=be
 12 word        53  53 POS=DT, lemma=a
 13 word        55  59 POS=JJ, lemma=great
 14 word        61  70 POS=NN, lemma=university
 15 word        71  71 POS=., lemma=.
[[1]]
[1] "Stanford"   "University" "is"         "located"    "in"        
[6] "California" "."         

[[2]]
[1] "It"         "is"         "a"          "great"      "university"
[6] "."         

[[1]]
Stanford/NNP
University/NNP
is/VBZ
located/JJ
in/IN
California/NNP
./.

[[2]]
It/PRP
is/VBZ
a/DT
great/JJ
university/NN
./.

[[1]]
Stanford/NOUN
University/NOUN
is/VERB
located/ADJ
in/ADP
California/NOUN
./.

[[2]]
It/PRON
is/VERB
a/DET
great/ADJ
university/NOUN
./.

[[1]]
(ROOT
  (S
    (NP (NNP Stanford) (NNP University))
    (VP
      (VBZ is)
      (ADJP (JJ located) (PP (IN in) (NP (NNP California)))))
    (. .)))

[[2]]
(ROOT
  (S
    (NP (PRP It))
    (VP (VBZ is) (NP (DT a) (JJ great) (NN university)))
    (. .)))

[[1]]
[[1]][[1]]
[1] "Stanford"   "University" "is"         "located"    "in"        
[6] "California" "."         

[[1]][[2]]
[1] "It"         "is"         "a"          "great"      "university"
[6] "."         

NLP documentation built on Aug. 15, 2017, 5:01 p.m.