Maxent_Word_Token_Annotator: Apache OpenNLP based word token annotators
In openNLP: Apache OpenNLP Tools Interface

Description Usage Arguments Details Value See Also Examples

View source: R/tokenize.R

Generate an annotator which computes word token annotations using the Apache OpenNLP Maxent tokenizer.

1	Maxent_Word_Token_Annotator(language = "en", probs = FALSE, model = NULL)

`language`	a character string giving the ISO-639 code of the language being processed by the annotator.
`probs`	a logical indicating whether the computed annotations should provide the token probabilities obtained from the Maxent model as their ‘prob’ feature.
`model`	a character string giving the path to the Maxent model file to be used, or `NULL` indicating to use a default model file for the given language (if available, see Details).

See http://opennlp.sourceforge.net/models-1.5/ for available model files. For languages other than English, these can conveniently be made available to R by installing the respective openNLPmodels.language package from the repository at https://datacube.wu.ac.at. For English, no additional installation is required.

An Annotator object giving the generated word token annotator.

https://opennlp.apache.org for more information about Apache OpenNLP.

require("NLP")
## Some text.
s <- paste(c("Pierre Vinken, 61 years old, will join the board as a ",
             "nonexecutive director Nov. 29.\n",
             "Mr. Vinken is chairman of Elsevier N.V., ",
             "the Dutch publishing group."),
           collapse = "")
s <- as.String(s)

## Need sentence token annotations.
sent_token_annotator <- Maxent_Sent_Token_Annotator()
a1 <- annotate(s, sent_token_annotator)

word_token_annotator <- Maxent_Word_Token_Annotator()
word_token_annotator
a2 <- annotate(s, word_token_annotator, a1)
a2
## Variant with word token probabilities as features.
head(annotate(s, Maxent_Word_Token_Annotator(probs = TRUE), a1))

## Can also perform sentence and word token annotations in a pipeline:
a <- annotate(s, list(sent_token_annotator, word_token_annotator))
head(a)

OpenJDK 64-Bit Server VM warning: Can't detect initial thread stack location - find_vma failed
Loading required package: NLP
An annotator inheriting from classes
  Simple_Word_Token_Annotator Annotator
with description
  Computes word token annotations using the Apache OpenNLP Maxent
  tokenizer employing the default model for language 'en'.
 id type     start end features
  1 sentence     1  84 constituents=<<integer,18>>
  2 sentence    86 153 constituents=<<integer,13>>
  3 word         1   6 
  4 word         8  13 
  5 word        14  14 
  6 word        16  17 
  7 word        19  23 
  8 word        25  27 
  9 word        28  28 
 10 word        30  33 
 11 word        35  38 
 12 word        40  42 
 13 word        44  48 
 14 word        50  51 
 15 word        53  53 
 16 word        55  66 
 17 word        68  75 
 18 word        77  80 
 19 word        82  83 
 20 word        84  84 
 21 word        86  88 
 22 word        90  95 
 23 word        97  98 
 24 word       100 107 
 25 word       109 110 
 26 word       112 119 
 27 word       121 124 
 28 word       125 125 
 29 word       127 129 
 30 word       131 135 
 31 word       137 146 
 32 word       148 152 
 33 word       153 153 
 id type     start end features
  1 sentence     1  84 constituents=<<integer,18>>
  2 sentence    86 153 constituents=<<integer,13>>
  3 word         1   6 prob=1
  4 word         8  13 prob=0.9770575
  5 word        14  14 prob=1
  6 word        16  17 prob=1
 id type     start end features
  1 sentence     1  84 constituents=<<integer,18>>
  2 sentence    86 153 constituents=<<integer,13>>
  3 word         1   6 
  4 word         8  13 
  5 word        14  14 
  6 word        16  17 
Warning message:
system call failed: Cannot allocate memory