Writing performance code with RcppCWB

Rationale

The RcppCWB package exposes the functionality of the Corpus Workbench (CWB) to R, so that R users can benefit from the performance of the C code of the CWB. Ease of use and performance should be great most of the time. But there are scenarios when the interface between R and C/C++ is a bottleneck for achieving sufficient performance. In this case, using CWB functionality in C++ functions exposed to R using Rcpp::cppFunction() or Rcpp::sourceCpp() may solve issues with performance and memory limitations.

Basics

Writing C++ functions that use CWB functionality requires loading the Rcpp and RcppCWB package.

library(Rcpp)
library(RcppCWB)

We need to be aware that the default functions for accessing the CWB functionality involve passing length-one character vectors used for looking up the C representation of structural or positional attributes for corpora that have been loaded. It is more efficient to perform this lookup only once. Following this rationale, a set of functions exposes CWB functionality closer to the C logic, passing pointers to attributes that have been looked up.

This functionality can also be used from R. For instance, we look up the p-attribute "word" of the "REUTERS" corpus as follows.

p_attr_word <- p_attr(
  corpus = "REUTERS",
  p_attribute = "word",
  registry = get_tmp_registry()
)

And we use cpos_to_str() to decode the first words of the corpus.

cpos_to_str(p_attr_word, 0:10)

While this may also be useful when writing R code, this lower-level functionality is particularly well-suited for writing high-performance C++ code exposed to R.

Inline C++ functions

We start with a first simple scenario, which uses cppFunction() to source an inline C++ function in an R session.

cppFunction(
  'Rcpp::StringVector get_str(SEXP corpus, SEXP p_attribute, SEXP registry, Rcpp::IntegerVector cpos){
     SEXP attr;
     Rcpp::StringVector result;
     attr = RcppCWB::p_attr(corpus, p_attribute, registry);
     result = RcppCWB::cpos_to_str(attr, cpos);
     return(result);
  }',
  depends = "RcppCWB"
)

This is not a very interesting example, but using the function works:

get_str("REUTERS", "word", RcppCWB::get_tmp_registry(), 0:50)

Source C++ function

To provide a more interesting real-life example, we demonstrate a solution to the following scenario: It may be necessary to decode an entire corpus, and to write the tokens of corpus regions to a file in a line-by-line manner. Computing word embeddings may require this input format, for instance.

But if the corpus is really large, decoding the corpus entirely and then writing everything to disk may hit memory limitations. Decoding the tokens of the corpus successively and writing content to the output file on the spot is an obvious solution, but moving data between the R/C++/C interface for every single token is excessively slow. A pure C++ implementation will be much more effective.

The following C++ file that relies on CWB functions as exposed by RcppCWB addresses the scenario.

```{Rcpp, file = system.file(package = "RcppCWB", "cpp", "fastdecode.cpp"), eval = FALSE}

This code can be sourced, compiled and exposed to R using `sourceCpp()`.

```r
sourceCpp(file = system.file(package = "RcppCWB", "cpp", "fastdecode.cpp"))

We exemplify that everything works as intended using the (smallish) REUTERS corpus. So we create the output ...

outfile <- tempfile(fileext = ".txt")

write_token_stream(
  corpus = "REUTERS",
  p_attribute = "word", 
  s_attribute = "id",
  attribute_type = "s",
  registry = RcppCWB::get_tmp_registry(),
  filename = outfile
)

... and read it (showing the content selectively) to convey that the corpus data has been exported as intended.

readLines(outfile) |>
  lapply(substr, 1, 75) |>
  unlist()

Moving ahead

Writing C++ functions is obviously more demanding than writing R code. But using CWB functionality as exposed by RcppCWB in C++ functions that can be used from R may be a great solution to performance and memory issues. Rcpp brings writing C++ code much closer to what R users are acquainted with, making writing high-performance C++ close much easier. So we encourage considering this option when pure R solutions are not fast enough.



Try the RcppCWB package in your browser

Any scripts or data that you put into this service are public.

RcppCWB documentation built on July 9, 2023, 7:40 p.m.