StanfordCoreNLP: StanfordCoreNLP Annotator Class.

Description Details Super class Public fields Methods Examples

Description

StanfordCoreNLP Annotator Class.

StanfordCoreNLP Annotator Class.

Details

The StanfordCoreNLP class exposes the pipeline of StanfordCoreNLP for processing text. Its main functionality is exposed to R by way of an R6 class. The special focus of this implementation is to use the multithreading capacities of StanfordCoreNLP from R.

The StanfordCorenNLP pipeline uses multithreading (a) by processing files in parallel. This requires that chunks of text are present as files in one directory. The $processFiles() method exposes this functionality. The number of threads to be used is controlled by setting the property "threads" accordingly, see examples and vignette. This approach is fast and memory efficient, as it allows effectively a line-by-line approach.

The second approach to multithreading is (b) to process sentences in parallel, i.e. after tokenization and sentence segmentation further annotation tasks such as POS annotation and NER recognition are carried out in parallel. Whether this parallelization is used is controlled by setting the properties "pos.nthreads", "ner.nthreads" and alike. See examples.

Super class

bignlp::AnnotationPipeline -> StanfordCoreNLP

Public fields

pipeline

Instance of the StanfordCoreNLP class.

outputter

An outputter (JSON, CoNLL, XML) to generate string output from annotations.

output_format

Which output format to use ("json", "xml", "conll").

properties

A Properties Java object to control multithreading.

Methods

Public methods

Inherited methods

Method new()

Usage
StanfordCoreNLP$new(
  corenlp_dir = getOption("bignlp.corenlp_dir"),
  properties,
  output_format = c("xml", "json", "conll")
)
Arguments
corenlp_dir

Directory where StanfordCoreNLP resides.

properties

Either the filename of a properties file or a Java properties object.

output_format

Either "json", "xml", "conll".


Method process()

Process a string.

Usage
StanfordCoreNLP$process(txt, purge = TRUE)
Arguments
txt

A (length-one) character vector to process.

purge

A logical value, whether to preprocess input string txt.

doc_id

An ID to prepend.

Returns

If output_format is "json" or "xml", a string is returned, if output_format is "conll", a data.frame.


Method process_files()

Process all files in the stated directory (argument dir). Parallel processing is possible if a 'threads' key the properties object is defined and sets a number of cores to use.

Usage
StanfordCoreNLP$process_files(dir)
Arguments
dir

Directory with files to process (in parallel).

Returns

The method returns (invisibly) the files expected to result from the tagging exercise.


Method verbose()

Set whether calls of the class shall be verbose.

Usage
StanfordCoreNLP$verbose(x)
Arguments
x

A logical value. If TRUE, all status messages are shown, if FALSE, only error messages are displayed.

Returns

The class is returned invisibly


Method clone()

The objects of this class are cloneable with this method.

Usage
StanfordCoreNLP$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
if (getOption("bignlp.corenlp_dir") == "") corenlp_install(lang = "de")

txt <- "Das ist ein Satz. Und das ist ein zweiter Satz."

props_file <- corenlp_get_properties_file(lang = "de")
CNLP <- StanfordCoreNLP$new(output_format = "json", properties = props_file)
j <- CNLP$process(txt = txt)

CNLP <- StanfordCoreNLP$new(output_format = "xml", properties = props_file)
x <- CNLP$process(txt = txt)

CNLP <- StanfordCoreNLP$new(output_format = "conll", properties = props_file)
c <- CNLP$process(txt = txt)


# Java parallellization - processing sentences in parallel

library(data.table)
reuters_txt <- readLines(system.file(package = "bignlp", "extdata", "txt", "reuters.txt"))
dt <- data.table(doc_id = 1L:length(reuters_txt), text = reuters_txt)

options(java.parameters = "-Xmx4g")

n_cores <- as.character(parallel::detectCores() - 1L)
properties_file <- corenlp_get_properties_file(lang = "en", fast = TRUE)
props <- properties(properties_file)
props$put("pos.nthreads", as.character(parallel::detectCores() - 1L))
props$put("ner.nthreads", as.character(parallel::detectCores() - 1L))

CNLP <- StanfordCoreNLP$new(output_format = "conll", properties = props)

y <- CNLP$process(dt[1][["text"]])

PolMine/bignlp documentation built on Jan. 29, 2021, 1:14 a.m.