Maxent_POS_Tag_Annotator: Apache OpenNLP based POS tag annotators

Description Usage Arguments Details Value See Also Examples

View source: R/pos.R

Description

Generate an annotator which computes POS tag annotations using the Apache OpenNLP Maxent Part of Speech tagger.

Usage

1
Maxent_POS_Tag_Annotator(language = "en", probs = FALSE, model = NULL)

Arguments

language

a character string giving the ISO-639 code of the language being processed by the annotator.

probs

a logical indicating whether the computed annotations should provide the token probabilities obtained from the Maxent model as their ‘POS_prob’ feature.

model

a character string giving the path to the Maxent model file to be used, or NULL indicating to use a default model file for the given language (if available, see Details).

Details

See http://opennlp.sourceforge.net/models-1.5/ for available model files. For languages other than English, these can conveniently be made available to R by installing the respective openNLPmodels.language package from the repository at https://datacube.wu.ac.at. For English, no additional installation is required.

Value

An Annotator object giving the generated POS tag annotator.

See Also

https://opennlp.apache.org for more information about Apache OpenNLP.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
require("NLP")
## Some text.
s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ",
             "nonexecutive director Nov. 29.\n",
             "Mr. Vinken is chairman of Elsevier N.V., ",
             "the Dutch publishing group."),
           collapse = "")
s <- as.String(s)

## Need sentence and word token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- annotate(s, list(sent_token_annotator, word_token_annotator))

pos_tag_annotator <- Maxent_POS_Tag_Annotator()
pos_tag_annotator
a3 <- annotate(s, pos_tag_annotator, a2)
a3
## Variant with POS tag probabilities as (additional) features.
head(annotate(s, Maxent_POS_Tag_Annotator(probs = TRUE), a2))

## Determine the distribution of POS tags for word tokens.
a3w <- subset(a3, type == "word")
tags <- sapply(a3w$features, `[[`, "POS")
tags
table(tags)
## Extract token/POS pairs (all of them): easy.
sprintf("%s/%s", s[a3w], tags)

## Extract pairs of word tokens and POS tags for second sentence:
a3ws2 <- annotations_in_spans(subset(a3, type == "word"),
                              subset(a3, type == "sentence")[2L])[[1L]]
sprintf("%s/%s", s[a3ws2], sapply(a3ws2$features, `[[`, "POS"))

Example output

OpenJDK 64-Bit Server VM warning: Can't detect initial thread stack location - find_vma failed
Loading required package: NLP
An annotator inheriting from classes
  Simple_POS_Tag_Annotator Annotator
with description
  Computes POS tag annotations using the Apache OpenNLP Maxent Part of
  Speech tagger employing the default model for language 'en'
 id type     start end features
  1 sentence     1  84 constituents=<<integer,18>>
  2 sentence    86 153 constituents=<<integer,13>>
  3 word         1   6 POS=NNP
  4 word         8  13 POS=NNP
  5 word        14  14 POS=,
  6 word        16  17 POS=CD
  7 word        19  23 POS=NNS
  8 word        25  27 POS=JJ
  9 word        28  28 POS=,
 10 word        30  33 POS=MD
 11 word        35  38 POS=VB
 12 word        40  42 POS=DT
 13 word        44  48 POS=NN
 14 word        50  51 POS=IN
 15 word        53  53 POS=DT
 16 word        55  66 POS=JJ
 17 word        68  75 POS=NN
 18 word        77  80 POS=NNP
 19 word        82  83 POS=CD
 20 word        84  84 POS=.
 21 word        86  88 POS=NNP
 22 word        90  95 POS=NNP
 23 word        97  98 POS=VBZ
 24 word       100 107 POS=NN
 25 word       109 110 POS=IN
 26 word       112 119 POS=NNP
 27 word       121 124 POS=NNP
 28 word       125 125 POS=,
 29 word       127 129 POS=DT
 30 word       131 135 POS=JJ
 31 word       137 146 POS=NN
 32 word       148 152 POS=NN
 33 word       153 153 POS=.
 id type     start end features
  1 sentence     1  84 constituents=<<integer,18>>
  2 sentence    86 153 constituents=<<integer,13>>
  3 word         1   6 POS=NNP, POS_prob=0.9476405
  4 word         8  13 POS=NNP, POS_prob=0.9692841
  5 word        14  14 POS=,, POS_prob=0.9884445
  6 word        16  17 POS=CD, POS_prob=0.9926943
 [1] "NNP" "NNP" ","   "CD"  "NNS" "JJ"  ","   "MD"  "VB"  "DT"  "NN"  "IN" 
[13] "DT"  "JJ"  "NN"  "NNP" "CD"  "."   "NNP" "NNP" "VBZ" "NN"  "IN"  "NNP"
[25] "NNP" ","   "DT"  "JJ"  "NN"  "NN"  "."  
tags
  ,   .  CD  DT  IN  JJ  MD  NN NNP NNS  VB VBZ 
  3   2   2   3   2   3   1   5   7   1   1   1 
 [1] "Pierre/NNP"      "Vinken/NNP"      ",/,"             "61/CD"          
 [5] "years/NNS"       "old/JJ"          ",/,"             "will/MD"        
 [9] "join/VB"         "the/DT"          "board/NN"        "as/IN"          
[13] "a/DT"            "nonexecutive/JJ" "director/NN"     "Nov./NNP"       
[17] "29/CD"           "./."             "Mr./NNP"         "Vinken/NNP"     
[21] "is/VBZ"          "chairman/NN"     "of/IN"           "Elsevier/NNP"   
[25] "N.V./NNP"        ",/,"             "the/DT"          "Dutch/JJ"       
[29] "publishing/NN"   "group/NN"        "./."            
 [1] "Mr./NNP"       "Vinken/NNP"    "is/VBZ"        "chairman/NN"  
 [5] "of/IN"         "Elsevier/NNP"  "N.V./NNP"      ",/,"          
 [9] "the/DT"        "Dutch/JJ"      "publishing/NN" "group/NN"     
[13] "./."          
Warning message:
system call failed: Cannot allocate memory 

openNLP documentation built on Oct. 30, 2019, 11:37 a.m.