Tools to Create, Modify and Manage CWB Corpora.


The Corpus Workbench (CWB) offers a classic approach for working with large, linguistically and structurally annotated corpora. Its design ensures memory efficiency and makes running queries fast (Evert and Hardie 2011). Technically, indexing and compressing corpora as suggested by Witten et al. (1999) is the approach implemented by the CWB (Christ 1994).

The C implementation of the CWB is mature and efficient. However, the convenience and flexibility of traditional CWB command line tools is limited. These tools are not portable across platforms, inhibiting the ideal of reproducible research.

The 'cwbtools' package combines portable pure R tools to create indexed corpus files and convenience wrappers for the original C implementation of CWB as exposed by the RcppCWB package. Additional functionality to add and modify annotations of corpora from within R makes working with CWB indexed corpora much more flexible. "Pure R" workflows to enrich corpora with annotations using standard NLP tools or generated manually can be implemented seamlessly and conveniently.

The cwbtools package is a companion of the RcppCWB and the polmineR package and is a building block of an infrastructure to support the combination of quantitative and qualitative approaches when working with textual data.


Andreas Blaette


Christ, Oliver (1994): "A Modular and Flexible Architecture for an Integrated Corpus Query System". Proceedings of COMPLEX'94, pp.23-32. (available online here)

Evert, Stefan and Andrew Hardie (2011): "Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium." In: Proceedings of the Corpus Linguistics 2011 conference, University of Birmingham, UK. (available online here)

Witten, Ian H., Alistair Moffat and Timothy C. Bell (1999): Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd edition. San Francisco et al.: Morgan Kaufmann.

