AnnotationPipeline: AnnotationPipeline Class.

Description Public fields Methods Examples

Description

Worker behind the higher-level StanfordCoreNLP class that allows fine-tuned configuration of an annotation pipeline, see the documentation of CoreNLP Pipelines. The $annotate() method supports processing annotations in parallel. Unlike the StanfordCoreNLP$process_files() method for processing the content of files in parallel, it is a very efficient in-memory operation and the fastest option for processsing medium-sized corpora. But as annotations consume a lot of memory, there are limitations to allocating sufficient heap space required for the parallel in-memory processing of larger corpora. If heap space is insufficient, the process may run endless without a telling warning message or an error. So use the $annotate() method with appropriate care.

Public fields

pipeline

AnnotationPipeline

Methods

Public methods


Method new()

Initialize AnnotationPipeline

Usage
AnnotationPipeline$new(corenlp_dir = getOption("bignlp.corenlp_dir"))
Arguments
corenlp_dir

Directory where StanfordCoreNLP resides.


Method annotate()

Annotate a list of strings.

Usage
AnnotationPipeline$annotate(x, threads = NULL, verbose = TRUE)
Arguments
x

A list of character vectors to annotate, an AnnotationList class object or an ArrayList with Annotation objects.

threads

If NULL, all available threads are used, otherwise an integer value with number of threads to use.

verbose

A logical value, whether to show progress messages.

Returns

A Java object .


Method clone()

The objects of this class are cloneable with this method.

Usage
AnnotationPipeline$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
A <- AnnotationPipeline$new()
a <- c("This is a sentences.", "Yet another sentence.")
s <- A$annotate(a)
result <- s$as.data.table()

reuters_txt <- readLines(system.file(package = "bignlp", "extdata", "txt", "reuters.txt"))
B <- AnnotationPipeline$new()
r <- B$annotate(reuters_txt)
result <- r$as.data.table()

## Not run: 
# this will NOT work with 512GB heap space - 4 GB required
library(polmineR)
gparl_by_date <- corpus("GERMAPARL") %>%
  subset(year %in% 1998) %>%
  split(s_attribute = "date") %>% 
  get_token_stream(p_attribute = "word", collapse = " ") %>%
  as.character()
C <- AnnotationPipeline$new()
anno <- C$annotate(gparl_by_date, 4L)
result <- anno$as.data.table(anno)

## End(Not run)

PolMine/bignlp documentation built on Jan. 29, 2021, 1:14 a.m.