As Natural Language Processing (NLP) has evolved, various software tools for standard tasks such as tokenization, sentence segmentation, part-of-speech annotation, named entity recognition or dependency parsing have been developed. Well-established and powerful tools are typically implemented in Java or Python, yet several R wrapper packages such as OpenNLP, cleanNLP, coreNLP, spacyr or udpipe expose this functionality to the R community.
So why yet another NLP R package? As more data is available and the text to process is getting bigger, one limitation of the state of affairs in the R domain is that the existing packages are not good at dealing with large volumes of text. The thrust of the bignlp-package is to offer a fast and memory-efficient workflow based on CoreNLP, an established high-quality standard tool.
The default workflow envisaged by the bignlp package pursues a split-apply-combine strategy that entails
(a) splitting up a corpus into segments of text that are saved to disk,
(b) the (parallel) processing of these segments using CoreNLP and
(c) parsing and combining the results into an R output format, such as a data.table
.
Input and output data (which can be considerably larger than the input as annotation layers are added) need not be kept in memory at the same time, making this approach memory efficient. And as this approach makes full use of the ability of to CoreNLP to process documents in parallel, it is fast and capable to process big data efficiently.
The machine should at least have 8 GB of RAM as we recommend to allocate 4 GB of memory to the Java Virtual Machine (JVM). You can do this as follows.
options(java.parameters = "-Xmx4g")
There needs to be sufficient free disk space. Depending on the output format you opt for, the space required for temporary files can be extensive. If you save these files to a temporary directory, make sure that the volume or partition of the R temporary directory is sufficiently sized.
The pipeline will work on a single core, but multithreading is actively supported. Using more cores speeds up things, more cores are better than few cores. If you run the pipeline within a virtual machine (VM), check the number of cores available for the VM. A common approach is to use all cores but one for multithreaded operations.
no_cores <- parallel::detectCores() - 1L no_cores
CoreNLP runs within a JVM which ensures portability. There is no known limitation to using the pipeline on Linux, macOS and Windows machines.
CoreNLP is implemented in Java, so Java needs to be installed as CoreNLP is run in a Java Virtual Machine (JVM). More specifically, Java 8 (Oracle Java) is flagged as a prerequisite of the CoreNLP version used(v4.2.0), see the information on the CoreNLP Download page. Download Java from one of these locations:
The rJava package is the interface to using Java code from R. Configuring rJava can be painful. Very often, running R CMD javareconf
in the terminal solves issues. Once Java is installed and working, you can load the bignlp package.
library(bignlp)
Upon loading the bignlp package, a JVM is initialized. The package includes auxiliary functionality to check which implementation of Java is available and used to run the JVM and to check the Java version.
jvm_name() jvm_version()
As CoreNLP is based on Oracle Java 8, you would ideally see "Java(TM) SE Runtime Environment". If you see "OpenJDK Runtime Environment", this indicates that you are using OpenJDK, not Oracle Java as recommended by CoreNLP. However, our experience is that CoreNLP works nicely with OpenJDK too. If you have issues with the licensing conditions of Oracle Java, using OpenJDK does not seem to be a limitation.
The CoreNLP code jar and models for specific languages can be downloaded from the CoreNLP website without restrictions. Check the presence of the CoreNLP code jar as follows, and perform the download, if necessary. In this vignette, we avoid downloading a lanugage model (argument lang
is NULL
).
library(bignlp) if (getOption("bignlp.corenlp_dir") == "") corenlp_install(lang = NULL)
Another way to download Stanford CoreNLP is to (ab)use the installation mechanism included in the cleanNLP package.
Apart from the directory with the code jars, the location of a properties file to configure the annotators is required. Again, use the system.file()
-function to find out where that is.
Note that it is not necessary to use cleanNLP for downloading Stanford CoreNLP. CoreNLP and properties files can be stored anywhere on your system, and functions of the bignlp package take the paths as input.
In our example, we will use an excerpt from the Reuters corpus. The articles of the corpus are included as sample data in the package. We geneerate a data.table
with the columns "doc_id" (integer ids) and "text" that will serve as input for the subsequent steps.
library(data.table) reuters_txt <- readLines(system.file(package = "bignlp", "extdata", "txt", "reuters.txt")) reuters_tab <- data.table(doc_id = 1L:length(reuters_txt), text = reuters_txt)
Running CoreNLP may require substanial memory. With 4GB, you are on the safe side. Set the memory limit for JVMs before doing anything else, i.e. note that setting the memory allocated for the JVM was the very first code executed in this vignette. It is necessary that this is done before rJava and/or bignlp are loaded.
Please note: Other packages (such as openNLP) interfacing to Java may have instantiated a JVM already, with less memory than necessary for running coreNLP. If a JVM has already been initialized with insufficient memory, the memory allocated cannot be increased afterwards.
The default approach of CoreNLP is to read the configuration of annotators from a properties file. There is much more flexibility if we generate a Java Properties object which we can manipulate from R. This is what the properties()
function does.
props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE) props <- properties(x = props_file)
As we want to use parallelization, we set the number of threads in the properties object.
properties_set_threads(props, no_cores)
Our first stept then is to instantiate the pipeline for processing files. On this point, we define the output format (argument output_format
) we want to use. Here, we use the CoNLL output format which is closest to the tabular data formats that can be processed quickly by R.
Pipe <- StanfordCoreNLP$new(properties = props, output_format = "conll")
While it is good to see the messages issued on the initialization of the different annotators, there will be too many status messages when files are processed. Therefore, we switch the verbosity of the annotator off.
Pipe$verbose(FALSE)
Step 1 - Split Create directories with text segments
segdirs <- segment(x = reuters_tab, dir = (nlpdir <- tempdir()), chunksize = 10L)
Step 2 - Appy Process files
conll_files <- lapply(segdirs, Pipe$process_files)
Step 3 - Combine Parse and CoNLL output
Sys.sleep(0.5) # Java may still be working while R is moving on - then files are missing dt <- corenlp_parse_conll(conll_files, progress = FALSE)
DT::datatable(dt[1:1000,])
library(magrittr) dt <- segment(x = reuters_tab, dir = tempdir(), chunksize = 10L) %>% lapply(Pipe$process_files) %>% corenlp_parse_conll() DT::datatable(dt[1:100,])
A second workflow envisaged by CoreNLP for the multithreaded annotation works without saving files to disk. After splitting a segment of text into tokens and sentences, sentences are processed in parallel. So it is only a part of the annotators such as POS annotation, named entity recognition etc. that is parallelized. This in-memory operation is less parsimonous concerning memory usage - the corpus and the result are kept in memory throughout - but may have its merits because saving and reloading and parsing data is not necessary.
As before, we start with instantiating a Properties object to be used for instantiating with a basic (and fast) NLP annotation for English text.
props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE) props <- properties(props_file)
To inform the annotator that steps subsequent to sentence segmentation shall be parallelized, we add respective settings for the POS annotation and lemmatization steps.
props$put("pos.nthreads", as.character(no_cores)) props$put("lemma.nthreads", as.character(no_cores))
New we can run the annotation. The data to be processed needs to be a data.table
that contains the columns 'doc_id' (unique integer
values at this stage) and 'text' (character
).
reuters_annotated <- corenlp_annotate(reuters_tab, properties = props, progress = FALSE)
... and this is the resulting data.table
.
DT::datatable(reuters_annotated[1:1000,])
A third approach processes documents of text in parallel from the outset. It takes a list of annotation objects as input.
alist <- AnnotationList$new(reuters_tab[["text"]])
We then call the $annotate()
method of the annotation pipeline on this object.
Pipe <- StanfordCoreNLP$new(properties = props, output_format = "conll") Pipe$annotate(alist)
There are two important issues to note here:
(a) The annotate()
method is inherited from the AnnotationPipeline
superclass of the StanfordCoreNLP
class.
(b) The annotation list is modified in-place, there is no return value of the $annotate()
method. This may feel somewhat unusual in the R context. But the in-place modification of the annotation object is a good contribution to memory efficiency.
The AnnotationList
object is an R6 class with a $as.data.table()
that will get a data.table
from the annotations that have been generated.
dt3 <- alist$as.data.table()
Inspecting the result, we see it is the same result as with the two other workflows.
DT::datatable(dt3)
Note that piping is possible with R6 class objects:
dt4 <- StanfordCoreNLP$new(properties = props, output_format = "conll")$ annotate(reuters_tab[["text"]])$ as.data.table()
The most important instrument to define the annotation pipeline and to control its behavior are properties. The conventional format is as follows.
props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE) readLines(props_file)
A quick way to customize properties is to define a list, see the following example.
propslist <- list( "annotators" = "tokenize, ssplit, pos, lemma, ner", "tokenize.language" = "de", "tokenize.postProcessor" = "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor", "pos.model" = "edu/stanford/nlp/models/pos-tagger/german-ud.tagger", "pos.nthreads" = as.character(parallel::detectCores() - 1L), # THIS "ner.model" = "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz", "ner.applyNumericClassifiers" = "false", "ner.applyFineGrained" = "false", "ner.useSUTime" = "false", "ner.nthreads" = as.character(parallel::detectCores() - 1L) ) properties(propslist)
J <- StanfordCoreNLP$new(properties = props_file, output_format = "json") reuters_json <- J$process(reuters_tab[1][["text"]]) df <- corenlp_parse_json(reuters_json)
X <- StanfordCoreNLP$new(properties = props_file, output_format = "xml") reuters_xml <- X$process(reuters_tab[1][["text"]]) y <- corenlp_parse_xml(reuters_xml)
C <- StanfordCoreNLP$new(properties = props_file, output_format = "conll") reuters_conll <- C$process(reuters_tab[1][["text"]])
[to be written]
A somewhat advanced yet commmon scenario is to add an annotation layer to a corpus that has already been tokenized. The corpus might be stored in some kind of database or corpus management system. It will be essential not to change the sequence of tokens. Based on the previous building blocks, we present two workflows that will maintain the tokenization and sequence of tokens as is.
The first workflow builds on a CoreNLP whitespace tokenizer. Using this most basic and simple tokenizer is very useful if data has already been tokenized.
To walk through this approach, we use the tokenized version of the REUTERS corpus that is included as sample data in the package. Your data may look differently, but if you work with tokenized data, it will be easy to generate this tabular data format.
DT::datatable(reuters_dt)
To prepare adding an annotation layer, we generate whitespace-separated strings, one for each document in the corpus. We split the table based on the document id (column 'doc_id') and concatenate the tokens (column 'word') in each table in the resulting list to one string.
ts_ws <- lapply( split(reuters_dt, f = reuters_dt[["doc_id"]]), function(tab) paste(tab[["word"]], collapse = " ") )
To keep things simple (and fast) here, we will just add a sentence annotation. In real-life scenarios you might want to add annotations "further down the NLP road" such es POS, NER etc.
The crucial setting step is to ensure that a plain and simple whitespace tokenizer is used. We can do so using the properties that configure the NLP pipeline. Setting the property "tokenize.whitespace" as "true" will do the job. Then we instantiate the pipeline with these properties.
properties_list <- list( "annotators" = "tokenize, ssplit", "tokenize.whitespace" = "true" ) Pipe <- StanfordCoreNLP$new(properties = properties_list, output_format = "conll")
The rest is not new. We turn the vector of documents with whitespace-separated documents into an AnnotationList
, we run the pipeline on the Annotation objects (in parallel) and convert the annotated data into a data.table
.
annoli <- AnnotationList$new(ts_ws) Pipe$annotate(annoli) reuters_dt_v2 <- annoli$as.data.table()
We now have a table that has that same number of rows (i.e. same number of tokens) as the (tokenized) input data. The should only be a technicality to write back the annotation layers that have been added back to your data source (i.e. database or corpus management system).
There is a second workflow to work with pre-tokenized data that avoids generating a string. This step may appear somewhat inefficient. Wouldn't it be smart to generate the Java Annotation objects directly?
Indeed, skipping generating a concatenated string with whitespace-delimiters should potentially speed up things. The current implementation of instantiating Java Annotation
objects requires going back and forth between R and Java for every single token. This is very inefficient and inhibits performance dramatically. Thus, we present this appraoch as a proof-of-concept rather than for real life scenarios.
anno_objs <- lapply( split(reuters_dt, by = "doc_id"), function(dt_sub){ df <- data.frame(word = as.data.frame(dt_sub)[, "word"]) as.Annotation(df) } ) anno_list <- AnnotationList$new() anno_list$obj <- rJava::.jnew("java.util.Arrays")$asList(rJava::.jarray(anno_objs)) Pipe$annotate(anno_list) reuters_dt_v3 <- anno_list$as.data.table()
To explain: We generate Annotation
objects from tabular input data and turn the list of Annotation
objects into a proper Java ArrayList
that is assigned to a new R AnnotationList
object. This is annotated using the pipeline we instantiated for approach #1.
This code is not run when preparing vignette because it is slow. If you do so, you will see it is slow indeed. So we do recommend to use the first approach.
Enjoy!
A good and conventional place for installing a tool such as CoreNLP on Linux and macOS is the /opt dir. So from a terminal, create a directory for CoreNLP, go into it, download the zipped jar files, unzip it, and remove the zip file. Note that sudo rights may be necessary to write into the /opt directory.
```{sh download_corenlp, eval = FALSE} mkdir /opt/stanford-corenlp cd /opt/stanford-corenlp wget http://nlp.stanford.edu/software/stanford-corenlp-4.2.0.zip unzip stanford-corenlp-4.2.0.zip rm stanford-corenlp-4.2.0.zip
### Install language model for English ```{sh download_english_model, eval = FALSE} wget http://nlp.stanford.edu/software/stanford-corenlp-4.2.0-models-english.jar
We illustrate getting models for a specific language for German. We go into the CoreNLP directory and download the model to this place.
```{sh download_german_model, eval = FALSE} cd stanford-corenlp-4.2.0 wget http://nlp.stanford.edu/software/stanford-corenlp-4.2.0-models-german.jar
The jar file with the model include a default properties file for processing German data (StanfordCoreNLP-german.properties). You can see this by displaying the content of the jar as follows. ```{sh inspect_jar, eval = FALSE} jar tf stanford-corenlp-4.2.0-models-german.jar | grep "properties" # see content of jar
The inclusion of this properties file in the jar may become a problem. If you want to configure the parser yourself, you may encounter the issue that this properties file included in the jar will override any other properties you may want to use.
The solution we found to work is to (a) extract the properties file from the jar and (b) remove it from the jar.
```{sh modify_model_jar, eval = FALSE}
unzip stanford-corenlp-4.2.0-models-german.jar StanfordCoreNLP-german.properties
zip -d stanford-corenlp-4.2.0-models-german.jar StanfordCoreNLP-german.properties
Note that the bignlp package already includes a properties file that has been edited for annotating large amounts of data quickly. You will find it as follows: ```r options(bignlp.properties_file = corenlp_get_properties_file(lang = "de", fast = TRUE))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.