Description Details Super class Public fields Methods Examples
StanfordCoreNLP Annotator Class.
StanfordCoreNLP Annotator Class.
The StanfordCoreNLP class exposes the pipeline of StanfordCoreNLP for processing text. Its main functionality is exposed to R by way of an R6 class. The special focus of this implementation is to use the multithreading capacities of StanfordCoreNLP from R.
The StanfordCorenNLP pipeline uses multithreading (a) by processing files in
parallel. This requires that chunks of text are present as files in one
directory. The $processFiles()
method exposes this functionality. The
number of threads to be used is controlled by setting the property "threads"
accordingly, see examples and vignette. This approach is fast and memory
efficient, as it allows effectively a line-by-line approach.
The second approach to multithreading is (b) to process sentences in parallel, i.e. after tokenization and sentence segmentation further annotation tasks such as POS annotation and NER recognition are carried out in parallel. Whether this parallelization is used is controlled by setting the properties "pos.nthreads", "ner.nthreads" and alike. See examples.
bignlp::AnnotationPipeline
-> StanfordCoreNLP
pipeline
Instance of the StanfordCoreNLP class.
outputter
An outputter (JSON, CoNLL, XML) to generate string output from annotations.
output_format
Which output format to use ("json", "xml", "conll").
properties
A Properties Java object to control multithreading.
new()
StanfordCoreNLP$new( corenlp_dir = getOption("bignlp.corenlp_dir"), properties, output_format = c("xml", "json", "conll") )
corenlp_dir
Directory where StanfordCoreNLP resides.
properties
Either the filename of a properties file or a Java properties object.
output_format
Either "json", "xml", "conll".
process()
Process a string.
StanfordCoreNLP$process(txt, purge = TRUE)
txt
A (length-one) character
vector to process.
purge
A logical
value, whether to preprocess input string txt
.
doc_id
An ID to prepend.
If output_format is "json" or "xml", a string is returned, if output_format is
"conll", a data.frame
.
process_files()
Process all files in the stated directory (argument dir
).
Parallel processing is possible if a 'threads' key the properties
object is defined and sets a number of cores to use.
StanfordCoreNLP$process_files(dir)
dir
Directory with files to process (in parallel).
The method returns (invisibly) the files expected to result from the tagging exercise.
verbose()
Set whether calls of the class shall be verbose.
StanfordCoreNLP$verbose(x)
x
A logical
value. If TRUE
, all status messages are shown, if
FALSE
, only error messages are displayed.
The class is returned invisibly
clone()
The objects of this class are cloneable with this method.
StanfordCoreNLP$clone(deep = FALSE)
deep
Whether to make a deep clone.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | if (getOption("bignlp.corenlp_dir") == "") corenlp_install(lang = "de")
txt <- "Das ist ein Satz. Und das ist ein zweiter Satz."
props_file <- corenlp_get_properties_file(lang = "de")
CNLP <- StanfordCoreNLP$new(output_format = "json", properties = props_file)
j <- CNLP$process(txt = txt)
CNLP <- StanfordCoreNLP$new(output_format = "xml", properties = props_file)
x <- CNLP$process(txt = txt)
CNLP <- StanfordCoreNLP$new(output_format = "conll", properties = props_file)
c <- CNLP$process(txt = txt)
# Java parallellization - processing sentences in parallel
library(data.table)
reuters_txt <- readLines(system.file(package = "bignlp", "extdata", "txt", "reuters.txt"))
dt <- data.table(doc_id = 1L:length(reuters_txt), text = reuters_txt)
options(java.parameters = "-Xmx4g")
n_cores <- as.character(parallel::detectCores() - 1L)
properties_file <- corenlp_get_properties_file(lang = "en", fast = TRUE)
props <- properties(properties_file)
props$put("pos.nthreads", as.character(parallel::detectCores() - 1L))
props$put("ner.nthreads", as.character(parallel::detectCores() - 1L))
CNLP <- StanfordCoreNLP$new(output_format = "conll", properties = props)
y <- CNLP$process(dt[1][["text"]])
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.