R/RcppCWB_package.R

#' Rcpp Bindings for the Corpus Workbench (CWB).
#' 
#' @description 
#' The \code{RcppCWB} package is a wrapper library to expose core functions of
#' the \code{Open Corpus Workbench} (CWB). This includes the low-level
#' functionality of the \code{Corpus Library} (CL) as well as capacities to use
#' the query syntax of the \code{Corpus Query Processor} (CQP).
#' 
#' @section The Idea Behind RcppCWB:
#' 
#'   The \code{Open Corpus Workbench} (CWB) is an indexing and querying engine
#'   popular in corpus-assisted research. Its core aim is to support working
#'   efficiently with large, structurally and linguistically annotated corpora.
#'   First of all, the CWB includes tools to index and compress corpora. Second,
#'   the \code{Corpus Library} (CL) offers low-level functionality to retrieve
#'   information from CWB indexed corpora. Third, the \code{Corpus Query
#'   Processor} (CQP) offers a syntax that allows to perform anything from
#'   simple to complex queries, using different annotation layers of corpora.
#' 
#'   The CWB is a classical tool which has inspired a set of developments. A
#'   persisting advantage of the CWB is its mature, open source code base that
#'   is actively maintained by a community of developers. It is used as a robust
#'   and efficient backend for widely used tools such as
#'   TXM(\url{https://txm.gitpages.huma-num.fr/textometrie/}) or CQPweb
#'   (\url{https://cwb.sourceforge.io/cqpweb.php}). Its uncompromising C
#'   implementation guarantees speed and makes it well suited to be integrated
#'   with R at the same time.
#'
#'   The package \code{RcppCWB} is a follow-up on the \code{rcqp} package that
#'   has pioneered to expose CWB functionality from within R. Indeed, the
#'   \code{rcqp} package, published at CRAN in 2015, offers robust access to CWB
#'   functionality. However, the "pure C" implementation of the \code{rcqp}
#'   package creates difficulties to make the package portable to Windows. The
#'   primary purpose of the \code{RcppCWB} package is to reimplement a wrapper
#'   library for the CWB using a design that makes it easier to achieve
#'   cross-platform portability.
#'   
#'   Even though \code{RcppCWB} functions may be used directly, the package is
#'   designed to serve as an interface to CWB indexed corpora in packages with
#'   higher-level functionality. In this regard, \code{RcppCWB} is the backend
#'   of the \code{polmineR} package. It is deliberately open to be used in other
#'   contexts. The package may stimulate using linguistically annotated, indexed
#'   and compressed corpora on all platforms. The paradigm of working with text
#'   as linguistic data may benefit from \code{RcppCWB}.

#' @section Implementation:
#' 
#'   When building the package, the first step is to compile the relevant parts
#'   of the CWB on Linux and macOS machines. On Windows, cross-compiled binaries
#'   are downloaded from a GitHub repository of the PolMine Project
#'   (\url{https://github.com/PolMine/libcl}). Second, \code{Rcpp} wrappers are
#'   compiled and make the relevant functions of the Corpus Library and CQP
#'   accessible. In addition to genuine CWB functions, \code{RcppCWB} offers a
#'   set of higher level functions implemented using \code{Rcpp} for common
#'   performance critical tasks.
#'
#' 
#' @section Getting Started with RcppCWB:
#' 
#'   To understand the data storage model of the CWB, in particular the notions
#'   of positional and structural attributes (s- and p-attributes), the vignette
#'   of the \code{rcqp} package is a very good starting point (see references).
#'
#'   The CWB 'Corpus Encoding Tutorial' explains how to create your own corpus,
#'   the 'CQP Query Language Tutorial' introduces the syntax of CQP (see
#'   references).
#'
#'   The \code{RcppCWB} package includes a sample corpus (REUTERS, the data also
#'   included in the \code{tm} package). The examples in the documentation
#'   of the functions may be a good starting point to understand how to use
#'   \code{RcppCWB}.
#'   
#' @section Digging Deeper: 
#' 
#'   The original paper of Christ (1994) explains the design choices of the CWB.
#'   The indexing and compression techniques of the CWB (Huffman coding) are
#'   explained in Witten et al. (1999).
#'   
#' @section Acknowledgements:
#' 
#'   The work of the all developers of the CWB is gratefully acknowledged. There
#'   is a particular intellectual debt to Bernard Desgraupes and Sylvain
#'   Loiseau, and the \code{rcqp} package they developed as the original R
#'   wrapper to expose the functionality of the CWB.
#' 
#' @references 
#' Christ, O. 1994. "A modular and flexible architecture for an integrated
#' corpus query system", in: Proceedings of COMPLEX '94, pp. 23-32. Budapest.
#' Available online at \url{https://cwb.sourceforge.io/files/Christ1994.pdf}
#' 
#' Desgraupes, B.; Loiseau, S. 2012. Introduction to the rcqp package.
#' Vignette of the rcqp package. Available at the CRAN archive at
#' \url{https://cran.r-project.org/src/contrib/Archive/rcqp/}
#' 
#' Evert, S. 2005. The CQP Query Language Tutorial. Available online at
#' \url{https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf}
#' 
#' Evert, S. 2005. The IMS Open Corpus Workbench (CWB). Corpus Encoding
#' Tutorial. Available online at
#' \url{https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial.pdf}
#' 
#' Open Corpus Workbench (\url{https://cwb.sourceforge.io})
#' 
#' Witten, I.H.; Moffat, A.; Bell, T.C. (1999). Managing Gigabytes. Morgan
#' Kaufmann Publishing, San Francisco, 2nd edition.
#' 
#' 
#' @keywords package
#' @docType package
#' @rdname RcppCWB-packge
#' @aliases RcppCWB RcppCWB-package
#' @name RcppCWB-package
#' @useDynLib RcppCWB, .registration = TRUE
#' @importFrom Rcpp evalCpp
#' @exportPattern "^[[:alpha:]]+"
#' @author Andreas Blaette (andreas.blaette@@uni-due.de)
#' @examples 
#' # functions of the corpus library (starting with cl) expose the low-level
#' # access to the CWB corpus library (CL)
#' 
#' ids <- cl_cpos2id("REUTERS", cpos = 1:20, p_attribute = "word", registry = get_tmp_registry())
#' tokens <- cl_id2str("REUTERS", id = ids, p_attribute = "word", registry = get_tmp_registry())
#' print(paste(tokens, collapse = " "))
#' 
#' # To use the corpus query processor (CQP) and its syntax, it is necessary first
#' # to initialize CQP (example: get concordances of 'oil')
#' 
#' cqp_query("REUTERS", query = '[]{5} "oil" []{5}')
#' cpos_matrix <- cqp_dump_subcorpus("REUTERS")
#' concordances_oil <- apply(
#'   cpos_matrix, 1,
#'   function(row){
#'     ids <- cl_cpos2id("REUTERS", p_attribute = "word", cpos = row[1]:row[2], get_tmp_registry())
#'     tokens <- cl_id2str("REUTERS", p_attribute = "word", id = ids, get_tmp_registry())
#'     paste(tokens, collapse = " ")
#'   }
#'  )
NULL

Try the RcppCWB package in your browser

Any scripts or data that you put into this service are public.

RcppCWB documentation built on Sept. 24, 2024, 1:08 a.m.