Rationale of the bignlp-package

As Natural Language Processing (NLP) has evolved, various software tools for standard tasks such as tokenization, sentence segmentation, part-of-speech annotation, named entity recognition or dependency parsing have been developed. Well-established and powerful tools are typically implemented in Java or Python, yet several R wrapper packages such as OpenNLP, cleanNLP, coreNLP, spacyr or udpipe expose this functionality to the R community.

So why yet another NLP R package? As more data is available and the text to process is getting bigger, one limitation of the state of affairs in the R domain is that the existing packages are not good at dealing with large volumes of text. The thrust of the bignlp-package is to offer a fast and memory-efficient workflow based on CoreNLP, an established high-quality standard tool.

The default workflow envisaged by the bignlp package pursues a split-apply-combine strategy that entails

(a) splitting up a corpus into segments of text that are saved to disk, (b) the (parallel) processing of these segments using CoreNLP and (c) parsing and combining the results into an R output format, such as a data.table.

Input and output data (which can be considerably larger than the input as annotation layers are added) need not be kept in memory at the same time, making this approach memory efficient. And as this approach makes full use of the ability of to CoreNLP to process documents in parallel, it is fast and capable to process big data efficiently.

Prerequisites

Hardware Requirements

The machine should at least have 8 GB of RAM as we recommend to allocate 4 GB of memory to the Java Virtual Machine (JVM). You can do this as follows.

options(java.parameters = "-Xmx4g")

There needs to be sufficient free disk space. Depending on the output format you opt for, the space required for temporary files can be extensive. If you save these files to a temporary directory, make sure that the volume or partition of the R temporary directory is sufficiently sized.

The pipeline will work on a single core, but multithreading is actively supported. Using more cores speeds up things, more cores are better than few cores. If you run the pipeline within a virtual machine (VM), check the number of cores available for the VM. A common approach is to use all cores but one for multithreaded operations.

no_cores <- parallel::detectCores() - 1L
no_cores

System Requirements

Operating System

CoreNLP runs within a JVM which ensures portability. There is no known limitation to using the pipeline on Linux, macOS and Windows machines.

Java and rJava

CoreNLP is implemented in Java, so Java needs to be installed as CoreNLP is run in a Java Virtual Machine (JVM). More specifically, Java 8 (Oracle Java) is flagged as a prerequisite of the CoreNLP version used(v4.2.0), see the information on the CoreNLP Download page. Download Java from one of these locations:

The rJava package is the interface to using Java code from R. Configuring rJava can be painful. Very often, running R CMD javareconf in the terminal solves issues. Once Java is installed and working, you can load the bignlp package.

library(bignlp)

Upon loading the bignlp package, a JVM is initialized. The package includes auxiliary functionality to check which implementation of Java is available and used to run the JVM and to check the Java version.

jvm_name()
jvm_version()

As CoreNLP is based on Oracle Java 8, you would ideally see "Java(TM) SE Runtime Environment". If you see "OpenJDK Runtime Environment", this indicates that you are using OpenJDK, not Oracle Java as recommended by CoreNLP. However, our experience is that CoreNLP works nicely with OpenJDK too. If you have issues with the licensing conditions of Oracle Java, using OpenJDK does not seem to be a limitation.

CoreNLP

The CoreNLP code jar and models for specific languages can be downloaded from the CoreNLP website without restrictions. Check the presence of the CoreNLP code jar as follows, and perform the download, if necessary. In this vignette, we avoid downloading a lanugage model (argument lang is NULL).

library(bignlp)
if (getOption("bignlp.corenlp_dir") == "") corenlp_install(lang = NULL)

Another way to download Stanford CoreNLP is to (ab)use the installation mechanism included in the cleanNLP package.

Apart from the directory with the code jars, the location of a properties file to configure the annotators is required. Again, use the system.file()-function to find out where that is.

Note that it is not necessary to use cleanNLP for downloading Stanford CoreNLP. CoreNLP and properties files can be stored anywhere on your system, and functions of the bignlp package take the paths as input.

Workflow 1: A split-apply-combine approach

Preparations

Sample data

In our example, we will use an excerpt from the Reuters corpus. The articles of the corpus are included as sample data in the package. We geneerate a data.table with the columns "doc_id" (integer ids) and "text" that will serve as input for the subsequent steps.

library(data.table)
reuters_txt <- readLines(system.file(package = "bignlp", "extdata", "txt", "reuters.txt"))
reuters_tab <- data.table(doc_id = 1L:length(reuters_txt), text = reuters_txt)

Allocate sufficient memory

Running CoreNLP may require substanial memory. With 4GB, you are on the safe side. Set the memory limit for JVMs before doing anything else, i.e. note that setting the memory allocated for the JVM was the very first code executed in this vignette. It is necessary that this is done before rJava and/or bignlp are loaded.

Please note: Other packages (such as openNLP) interfacing to Java may have instantiated a JVM already, with less memory than necessary for running coreNLP. If a JVM has already been initialized with insufficient memory, the memory allocated cannot be increased afterwards.

Configure properties

The default approach of CoreNLP is to read the configuration of annotators from a properties file. There is much more flexibility if we generate a Java Properties object which we can manipulate from R. This is what the properties() function does.

props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE)
props <- properties(x = props_file)

As we want to use parallelization, we set the number of threads in the properties object.

properties_set_threads(props, no_cores)

Instantiate pipeline

Our first stept then is to instantiate the pipeline for processing files. On this point, we define the output format (argument output_format) we want to use. Here, we use the CoNLL output format which is closest to the tabular data formats that can be processed quickly by R.

Pipe <- StanfordCoreNLP$new(properties = props, output_format = "conll")

While it is good to see the messages issued on the initialization of the different annotators, there will be too many status messages when files are processed. Therefore, we switch the verbosity of the annotator off.

Pipe$verbose(FALSE)

Step 1 - Split Create directories with text segments

segdirs <- segment(x = reuters_tab, dir = (nlpdir <- tempdir()), chunksize = 10L)

Step 2 - Appy Process files

conll_files <- lapply(segdirs, Pipe$process_files)

Step 3 - Combine Parse and CoNLL output

Sys.sleep(0.5) # Java may still be working while R is moving on - then files are missing
dt <- corenlp_parse_conll(conll_files, progress = FALSE)

Inspecting the result

DT::datatable(dt[1:1000,])

Piping

library(magrittr)
dt <- segment(x = reuters_tab, dir = tempdir(), chunksize = 10L) %>%
  lapply(Pipe$process_files) %>%
  corenlp_parse_conll()
DT::datatable(dt[1:100,])

Workflow 2: In-memory multithreading of sentences

A second workflow envisaged by CoreNLP for the multithreaded annotation works without saving files to disk. After splitting a segment of text into tokens and sentences, sentences are processed in parallel. So it is only a part of the annotators such as POS annotation, named entity recognition etc. that is parallelized. This in-memory operation is less parsimonous concerning memory usage - the corpus and the result are kept in memory throughout - but may have its merits because saving and reloading and parsing data is not necessary.

As before, we start with instantiating a Properties object to be used for instantiating with a basic (and fast) NLP annotation for English text.

props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE)
props <- properties(props_file)

To inform the annotator that steps subsequent to sentence segmentation shall be parallelized, we add respective settings for the POS annotation and lemmatization steps.

props$put("pos.nthreads", as.character(no_cores))
props$put("lemma.nthreads", as.character(no_cores))

New we can run the annotation. The data to be processed needs to be a data.table that contains the columns 'doc_id' (unique integer values at this stage) and 'text' (character).

reuters_annotated <- corenlp_annotate(reuters_tab, properties = props, progress = FALSE)

... and this is the resulting data.table.

DT::datatable(reuters_annotated[1:1000,])

Workflow 3: Processing annotations in parallel

A third approach processes documents of text in parallel from the outset. It takes a list of annotation objects as input.

alist <- AnnotationList$new(reuters_tab[["text"]])

We then call the $annotate() method of the annotation pipeline on this object.

Pipe <- StanfordCoreNLP$new(properties = props, output_format = "conll")
Pipe$annotate(alist)

There are two important issues to note here:

(a) The annotate() method is inherited from the AnnotationPipeline superclass of the StanfordCoreNLP class.

(b) The annotation list is modified in-place, there is no return value of the $annotate() method. This may feel somewhat unusual in the R context. But the in-place modification of the annotation object is a good contribution to memory efficiency.

The AnnotationList object is an R6 class with a $as.data.table() that will get a data.table from the annotations that have been generated.

dt3 <- alist$as.data.table()

Inspecting the result, we see it is the same result as with the two other workflows.

DT::datatable(dt3)

Note that piping is possible with R6 class objects:

dt4 <- StanfordCoreNLP$new(properties = props, output_format = "conll")$
  annotate(reuters_tab[["text"]])$
  as.data.table()

Digging deeper

Understanding properties

The most important instrument to define the annotation pipeline and to control its behavior are properties. The conventional format is as follows.

props_file <- corenlp_get_properties_file(lang = "en", fast = TRUE)
readLines(props_file)

A quick way to customize properties is to define a list, see the following example.

propslist <- list(
  "annotators" = "tokenize, ssplit, pos, lemma, ner",
  "tokenize.language" = "de",
  "tokenize.postProcessor" = "edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor",
  "pos.model" = "edu/stanford/nlp/models/pos-tagger/german-ud.tagger",
  "pos.nthreads" = as.character(parallel::detectCores() - 1L), # THIS
  "ner.model" = "edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz",
  "ner.applyNumericClassifiers" = "false",
  "ner.applyFineGrained" = "false",
  "ner.useSUTime" = "false",
  "ner.nthreads" = as.character(parallel::detectCores() - 1L)
)
properties(propslist)

Output formats

json

J <- StanfordCoreNLP$new(properties = props_file, output_format = "json")
reuters_json <- J$process(reuters_tab[1][["text"]])
df <- corenlp_parse_json(reuters_json)

XML

X <- StanfordCoreNLP$new(properties = props_file, output_format = "xml")
reuters_xml <- X$process(reuters_tab[1][["text"]])
y <- corenlp_parse_xml(reuters_xml)

CoNLL

C <- StanfordCoreNLP$new(properties = props_file, output_format = "conll")
reuters_conll <- C$process(reuters_tab[1][["text"]])

Customized AnnotationPipelines

[to be written]

Scenario: Adding annotation layers to tokenized text

A somewhat advanced yet commmon scenario is to add an annotation layer to a corpus that has already been tokenized. The corpus might be stored in some kind of database or corpus management system. It will be essential not to change the sequence of tokens. Based on the previous building blocks, we present two workflows that will maintain the tokenization and sequence of tokens as is.

Option 1: Whitespace tokenization

The first workflow builds on a CoreNLP whitespace tokenizer. Using this most basic and simple tokenizer is very useful if data has already been tokenized.

To walk through this approach, we use the tokenized version of the REUTERS corpus that is included as sample data in the package. Your data may look differently, but if you work with tokenized data, it will be easy to generate this tabular data format.

DT::datatable(reuters_dt)

To prepare adding an annotation layer, we generate whitespace-separated strings, one for each document in the corpus. We split the table based on the document id (column 'doc_id') and concatenate the tokens (column 'word') in each table in the resulting list to one string.

ts_ws <- lapply(
  split(reuters_dt, f = reuters_dt[["doc_id"]]),
  function(tab) paste(tab[["word"]], collapse = " ")
)

To keep things simple (and fast) here, we will just add a sentence annotation. In real-life scenarios you might want to add annotations "further down the NLP road" such es POS, NER etc.

The crucial setting step is to ensure that a plain and simple whitespace tokenizer is used. We can do so using the properties that configure the NLP pipeline. Setting the property "tokenize.whitespace" as "true" will do the job. Then we instantiate the pipeline with these properties.

properties_list <- list(
  "annotators" = "tokenize, ssplit",
  "tokenize.whitespace" = "true"
)
Pipe <- StanfordCoreNLP$new(properties = properties_list, output_format = "conll")

The rest is not new. We turn the vector of documents with whitespace-separated documents into an AnnotationList, we run the pipeline on the Annotation objects (in parallel) and convert the annotated data into a data.table.

annoli <- AnnotationList$new(ts_ws)
Pipe$annotate(annoli)
reuters_dt_v2 <- annoli$as.data.table()

We now have a table that has that same number of rows (i.e. same number of tokens) as the (tokenized) input data. The should only be a technicality to write back the annotation layers that have been added back to your data source (i.e. database or corpus management system).

Option 2: Turn tabular data into Annotation object

There is a second workflow to work with pre-tokenized data that avoids generating a string. This step may appear somewhat inefficient. Wouldn't it be smart to generate the Java Annotation objects directly?

Indeed, skipping generating a concatenated string with whitespace-delimiters should potentially speed up things. The current implementation of instantiating Java Annotation objects requires going back and forth between R and Java for every single token. This is very inefficient and inhibits performance dramatically. Thus, we present this appraoch as a proof-of-concept rather than for real life scenarios.

anno_objs <- lapply(
  split(reuters_dt, by = "doc_id"),
  function(dt_sub){
    df <- data.frame(word = as.data.frame(dt_sub)[, "word"])
    as.Annotation(df)
  }
)
anno_list <- AnnotationList$new()
anno_list$obj <- rJava::.jnew("java.util.Arrays")$asList(rJava::.jarray(anno_objs))
Pipe$annotate(anno_list)
reuters_dt_v3 <- anno_list$as.data.table()

To explain: We generate Annotation objects from tabular input data and turn the list of Annotation objects into a proper Java ArrayList that is assigned to a new R AnnotationList object. This is annotated using the pipeline we instantiated for approach #1.

This code is not run when preparing vignette because it is slow. If you do so, you will see it is slow indeed. So we do recommend to use the first approach.

Discussion

Enjoy!

Appendix: Installing Dependencies

Installing CoreNLP

A good and conventional place for installing a tool such as CoreNLP on Linux and macOS is the /opt dir. So from a terminal, create a directory for CoreNLP, go into it, download the zipped jar files, unzip it, and remove the zip file. Note that sudo rights may be necessary to write into the /opt directory.

```{sh download_corenlp, eval = FALSE} mkdir /opt/stanford-corenlp cd /opt/stanford-corenlp wget http://nlp.stanford.edu/software/stanford-corenlp-4.2.0.zip unzip stanford-corenlp-4.2.0.zip rm stanford-corenlp-4.2.0.zip

### Install language model for English

```{sh download_english_model, eval = FALSE}
wget http://nlp.stanford.edu/software/stanford-corenlp-4.2.0-models-english.jar

Install further language models

We illustrate getting models for a specific language for German. We go into the CoreNLP directory and download the model to this place.

```{sh download_german_model, eval = FALSE} cd stanford-corenlp-4.2.0 wget http://nlp.stanford.edu/software/stanford-corenlp-4.2.0-models-german.jar

The jar file with the model include a default properties file for processing German data (StanfordCoreNLP-german.properties). You can see this by displaying the content of the jar as follows.

```{sh inspect_jar, eval = FALSE}
jar tf stanford-corenlp-4.2.0-models-german.jar | grep "properties" # see content of jar

The inclusion of this properties file in the jar may become a problem. If you want to configure the parser yourself, you may encounter the issue that this properties file included in the jar will override any other properties you may want to use.

The solution we found to work is to (a) extract the properties file from the jar and (b) remove it from the jar.

```{sh modify_model_jar, eval = FALSE}

extract StanfordCoreNLP-german.properties from jar

unzip stanford-corenlp-4.2.0-models-german.jar StanfordCoreNLP-german.properties

remove StanfordCoreNLP-german.properties from jar

zip -d stanford-corenlp-4.2.0-models-german.jar StanfordCoreNLP-german.properties

Note that the bignlp package already includes a properties file that has been edited for annotating large amounts of data quickly. You will find it as follows:

```r
options(bignlp.properties_file = corenlp_get_properties_file(lang = "de", fast = TRUE))


PolMine/bignlp documentation built on Jan. 29, 2021, 1:14 a.m.