annotators | R Documentation |
Create annotator objects for composite basic NLP tasks based on functions performing simple basic tasks.
Simple_Para_Token_Annotator(f, meta = list(), classes = NULL)
Simple_Sent_Token_Annotator(f, meta = list(), classes = NULL)
Simple_Word_Token_Annotator(f, meta = list(), classes = NULL)
Simple_POS_Tag_Annotator(f, meta = list(), classes = NULL)
Simple_Entity_Annotator(f, meta = list(), classes = NULL)
Simple_Chunk_Annotator(f, meta = list(), classes = NULL)
Simple_Stem_Annotator(f, meta = list(), classes = NULL)
f |
a function performing a “simple” basic NLP task (see Details). |
meta |
an empty or named list of annotator (pipeline) metadata tag-value pairs. |
classes |
a character vector or |
The purpose of these functions is to facilitate the creation of annotators for basic NLP tasks as described below.
Simple_Para_Token_Annotator()
creates “simple” paragraph
token annotators. Argument f
should be a paragraph tokenizer,
which takes a string s
with the whole text to be processed, and
returns the spans of the paragraphs in s
, or an annotation
object with these spans and (possibly) additional features. The
generated annotator inherits from the default classes
"Simple_Para_Token_Annotator"
and "Annotator"
. It uses
the results of the simple paragraph tokenizer to create and return
annotations with unique ids and type ‘paragraph’.
Simple_Sent_Token_Annotator()
creates “simple” sentence
token annotators. Argument f
should be a sentence tokenizer,
which takes a string s
with the whole text to be processed, and
returns the spans of the sentences in s
, or an annotation
object with these spans and (possibly) additional features. The
generated annotator inherits from the default classes
"Simple_Sent_Token_Annotator"
and "Annotator"
. It uses
the results of the simple sentence tokenizer to create and return
annotations with unique ids and type ‘sentence’, possibly
combined with sentence constituent features for already available
paragraph annotations.
Simple_Word_Token_Annotator()
creates “simple” word
token annotators. Argument f
should be a simple word
tokenizer, which takes a string s
giving a sentence to be
processed, and returns the spans of the word tokens in s
, or an
annotation object with these spans and (possibly) additional features.
The generated annotator inherits from the default classes
"Simple_Word_Token_Annotator"
and "Annotator"
.
It uses already available sentence token annotations to extract the
sentences and obtains the results of the word tokenizer for these. It
then adds the sentence character offsets and unique word token ids,
and word token constituents features for the sentences, and returns
the word token annotations combined with the augmented sentence token
annotations.
Simple_POS_Tag_Annotator()
creates “simple” POS tag
annotators. Argument f
should be a simple POS tagger, which
takes a character vector giving the word tokens in a sentence, and
returns either a character vector with the tags, or a list of feature
maps with the tags as ‘POS’ feature and possibly other
features. The generated annotator inherits from the default classes
"Simple_POS_Tag_Annotator"
and "Annotator"
. It uses
already available sentence and word token annotations to extract the
word tokens for each sentence and obtains the results of the simple
POS tagger for these, and returns annotations for the word tokens with
the features obtained from the POS tagger.
Simple_Entity_Annotator()
creates “simple” entity
annotators. Argument f
should be a simple entity detector
(“named entity recognizer”) which takes a character vector
giving the word tokens in a sentence, and return an annotation object
with the word token spans, a ‘kind’ feature giving the
kind of the entity detected, and possibly other features. The
generated annotator inherits from the default classes
"Simple_Entity_Annotator"
and "Annotator"
. It uses
already available sentence and word token annotations to extract the
word tokens for each sentence and obtains the results of the simple
entity detector for these, transforms word token spans to character
spans and adds unique ids, and returns the combined entity
annotations.
Simple_Chunk_Annotator()
creates “simple” chunk
annotators. Argument f
should be a simple chunker, which takes
as arguments character vectors giving the word tokens and the
corresponding POS tags, and returns either a character vector with the
chunk tags, or a list of feature lists with the tags as
‘chunk_tag’ feature and possibly other features. The generated
annotator inherits from the default classes
"Simple_Chunk_Annotator"
and "Annotator"
. It uses
already available annotations to extract the word tokens and POS tags
for each sentence and obtains the results of the simple chunker for
these, and returns word token annotations with the chunk features
(only).
Simple_Stem_Annotator()
creates “simple” stem
annotators. Argument f
should be a simple stemmer, which takes
as arguments a character vector giving the word tokens, and returns a
character vector with the corresponding word stems. The generated
annotator inherits from the default classes
"Simple_Stem_Annotator"
and "Annotator"
. It uses
already available annotations to extract the word tokens, and returns
word token annotations with the corresponding stem features (only).
In all cases, if the underlying simple processing function returns annotation objects these should not provide their own ids (or use such in the features), as the generated annotators will necessarily provide these (the already available annotations are only available at the annotator level, but not at the simple processing level).
An annotator object inheriting from the given classes and the default ones.
Package openNLP which provides annotator generators for sentence and word tokens, POS tags, entities and chunks, using processing functions based on the respective Apache OpenNLP MaxEnt processing resources.
## A simple text.
s <- String(" First sentence. Second sentence. ")
## ****5****0****5****0****5****0****5**
## A very trivial sentence tokenizer.
sent_tokenizer <-
function(s) {
s <- as.String(s)
m <- gregexpr("[^[:space:]][^.]*\\.", s)[[1L]]
Span(m, m + attr(m, "match.length") - 1L)
}
## (Could also use Regexp_Tokenizer() with the above regexp pattern.)
sent_tokenizer(s)
## A simple sentence token annotator based on the sentence tokenizer.
sent_token_annotator <- Simple_Sent_Token_Annotator(sent_tokenizer)
sent_token_annotator
a1 <- annotate(s, sent_token_annotator)
a1
## Extract the sentence tokens.
s[a1]
## A very trivial word tokenizer.
word_tokenizer <-
function(s) {
s <- as.String(s)
## Remove the last character (should be a period when using
## sentences determined with the trivial sentence tokenizer).
s <- substring(s, 1L, nchar(s) - 1L)
## Split on whitespace separators.
m <- gregexpr("[^[:space:]]+", s)[[1L]]
Span(m, m + attr(m, "match.length") - 1L)
}
lapply(s[a1], word_tokenizer)
## A simple word token annotator based on the word tokenizer.
word_token_annotator <- Simple_Word_Token_Annotator(word_tokenizer)
word_token_annotator
a2 <- annotate(s, word_token_annotator, a1)
a2
## Extract the word tokens.
s[subset(a2, type == "word")]
## A simple word token annotator based on wordpunct_tokenizer():
word_token_annotator <-
Simple_Word_Token_Annotator(wordpunct_tokenizer,
list(description =
"Based on wordpunct_tokenizer()."))
word_token_annotator
a2 <- annotate(s, word_token_annotator, a1)
a2
## Extract the word tokens.
s[subset(a2, type == "word")]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.