corpus_utils: Install and manage corpora.

corpus_installR Documentation

Install and manage corpora.


Utility functions to assist the installation and management of indexed CWB corpora.


  pkg = NULL,
  repo = "",
  tarball = NULL,
  doi = NULL,
  checksum = NULL,
  lib = .libPaths()[1],
  ask = interactive(),
  load = TRUE,
  verbose = TRUE,
  user = NULL,
  password = NULL,


  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE

corpus_remove(corpus, registry_dir, ask = interactive(), verbose = TRUE)

corpus_as_tarball(corpus, registry_dir, data_dir, tarfile, verbose = TRUE)

  data_dir = NULL,
  registry_dir_new = file.path(normalizePath(tempdir(), winslash = "/"), "cwb",
    "registry", fsep = "/"),
  data_dir_new = file.path(normalizePath(tempdir(), winslash = "/"), "cwb",
    "indexed_corpora", tolower(corpus), fsep = "/"),
  remove = FALSE,
  verbose = interactive(),
  progress = TRUE

  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  data_dir = registry_file_parse(corpus, registry_dir)[["home"]],
  skip = character(),
  to = c("latin1", "UTF-8"),
  verbose = TRUE

  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  verbose = TRUE

corpus_get_version(corpus, registry_dir = Sys.getenv("CORPUS_REGISTRY"))



Name of the data package.


URL of the repository.


URL, S3-URI or local filename of a tarball with a CWB indexed corpus. If NULL (default) and argument doi is stated, the whereabouts of a corpus tarball will be derived from DOI.


The DOI (Digital Object Identifier) of a corpus deposited at Zenodo (e.g. "10.5281/zenodo.3748858".)


A length-one character vector with a MD5 checksum to check for the integrity of a downloaded tarball. If the tarball is downloaded from Zenodo by stating a DOI (argument doi), the checksum included in the metadata for the record is used for the check.


Directory for R packages, defaults to .libPaths()[1].


The corpus registry directory. If missing, the result of cwb_registry_dir().


The directory that contains the data directories of indexed corpora. If missing, the value of cwb_corpus_dir() will be used.


A logical value, whether to ask user for confirmation before removing a corpus.


A logical value, whether to load corpus after installation.


Logical, whether to be verbose.


A user name that can be specified to download a corpus from a password protected site.


A password that can be specified to download a corpus from a password protected site.


Further parameters that will be passed into install.packages, if argument tarball is NULL, or into or download.file, if tarball is specified.


Name of the (old) corpus.


Name of the (new) corpus.


The ID of a CWB indexed corpus (in upper case).


The data directory where the files of the CWB corpus live.


Filename of tarball.


Target directory with for (new) registry files.


Target directory for corpus files.


A logical value, whether to remove orginal files after having created the copy.


Logical, whether to show a progress bar.


A character vector with s_attributes to skip.


Character string describing the target encoding of the corpus.


A CWB corpus consists a set of binary files with corpus data kept together in a data directory, and a registry file, which is a plain test file that details the corpus id, corpus properties, structural and positional attributes. The registry file also specifies the path to the corpus data directory. Typically, the registry directory and a corpus directory with the data directories for individual corpora are within one parent folder (which might be called "cwb" by default). See the following stylized directory structure.

  |- registry/
  |  |- corpus1
  |  +- corpus2
  + indexed_corpora/
    |- corpus1/
    |  |- file1
    |  |- file2
    |  +- file3
    +- corpus2/
       |- file1
       |- file2
       +- file3

The corpus_install function will assist the installation of a corpus. The following scenarios are offered:

  • If argument tarball is a local tarball, the tarball will be extracted and files will be moved.

  • If tarball is a URL, the tarball will be downloaded from the online location. It is possible to state user credentials using the arguments user and password. Then the aforementioned installation (scenario 1) is executed. If argument pkg is the name of an installed package, corpus files will be moved into this package.

  • If argument doi is Document Object Identifier (DOI), the URL from which a corpus tarball can be downloaded is derived from the information available at that location. The tarball is downloaded and the corpus installed. If argument pkg is defined, files will be moved into a R package, the syste registry and corpus directories are used otherwise. Note that at this stage, it is assumed that the DOI has been awarded by Zenodo

  • If argument pkg is provided and specifies an R package (and tarball is NULL), the corpus package available at a CRAN-style repository specified by argument repo will be installed. Internally, the install.packages function is called and further arguments can be passed into this function call. This can be used to pass user credentials, e.g. by adding method = "wget" extra = "--user donald --password duck".

If the corpus to be installed is already available, a dialogue will ask the user whether an existing corpus shall be deleted and installed anew, if argument ask is TRUE.

corpus_packages will detect the packages that include CWB corpora. Note that the directory structure of all installed packages is evaluated which may be slow on network-mounted file systems.

corpus_rename will rename a corpus, affecting the name of the registry file, the corpus id, and the name of the directory where data files reside.

corpus_remove() can be used to delete a corpus.

corpus_as_tarball will create a tarball (.tar.gz-file) with two subdirectories. The 'registry' subdirectory will host the registry file for the tarred corpus. The data files will be put in a subdirectory with the corpus name in the 'indexed_corpora' subdirectory.

corpus_copy will create a copy of a corpus (useful for experimental modifications, for instance).

corpus_get_version parses the registry file and derives the corpus version number from the corpus properties. The return value is a numeric_version class object. The corpus version is expected to follow semantic versioning (three digits, e.g. '0.8.1'). If the corpus version has another format or if it is not available, the return value is NA.


Logical value TRUE if installation has been successful, or FALSE if not.

See Also

For managing registry files, see registry_file_parse for switching to a packaged corpus.


registry_file_new <- file.path(
  normalizePath(tempdir(), winslash = "/"),
  "cwb", "registry", "reuters", fsep = "/"
if (file.exists(registry_file_new)) file.remove(registry_file_new)
  corpus = "REUTERS",
  registry_dir = system.file(package = "RcppCWB", "extdata", "cwb", "registry"),
  data_dir = system.file(
    package = "RcppCWB",
    "extdata", "cwb", "indexed_corpora", "reuters"
  normalizePath(tempdir(), winslash = "/"),
  "cwb", fsep = "/"),
  recursive = TRUE)
corpus <- "REUTERS"
pkg <- "RcppCWB"
s_attr <- "places"
Q <- '"oil"'

registry_dir_src <- system.file(package = pkg, "extdata", "cwb", "registry")
data_dir_src <- system.file(package = pkg, "extdata", "cwb", "indexed_corpora", tolower(corpus))

registry_dir_tmp <- file.path(
  normalizePath(tempdir(), winslash = "/"),
  "cwb", "registry", fsep = "/"
registry_file_tmp <- file.path(registry_dir_tmp, tolower(corpus), fsep = "/")
data_dir_tmp <- file.path(
  normalizePath(tempdir(), winslash = "/"),
  "cwb", "indexed_corpora", tolower(corpus), fsep = "/"

if (file.exists(registry_file_tmp)) file.remove(registry_file_tmp)
if (!dir.exists(data_dir_tmp)){
   dir.create(data_dir_tmp, recursive = TRUE)
} else {
  if (length(list.files(data_dir_tmp)) > 0L)
    file.remove(list.files(data_dir_tmp, full.names = TRUE))

  corpus = corpus,
  registry_dir = registry_dir_src,
  data_dir = data_dir_src,
  registry_dir_new = registry_dir_tmp,
  data_dir_new = data_dir_tmp

RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)

  corpus = corpus,
  registry_dir = registry_dir_tmp,
  data_dir = data_dir_tmp,
  to = "UTF-8"

RcppCWB::cl_delete_corpus(corpus = corpus, registry = registry_dir_tmp)
RcppCWB::cl_charset_name(corpus = corpus, registry = registry_dir_tmp)

n_strucs <- RcppCWB::cl_attribute_size(
  corpus = corpus, attribute = s_attr, attribute_type = "s", registry = registry_dir_tmp
strucs <- 0L:(n_strucs - 1L)
struc_values <- RcppCWB::cl_struc2str(
  corpus = corpus, s_attribute = s_attr, struc = strucs, registry = registry_dir_tmp
speakers <- unique(struc_values)

Sys.setenv("CORPUS_REGISTRY" = registry_dir_tmp)
if (RcppCWB::cqp_is_initialized()) RcppCWB::cqp_reset_registry() else RcppCWB::cqp_initialize()
RcppCWB::cqp_query(corpus = corpus, query = Q)
cpos <- RcppCWB::cqp_dump_subcorpus(corpus = corpus)
ids <- RcppCWB::cl_cpos2id(
  corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, cpos = cpos
str <- RcppCWB::cl_id2str(
  corpus = corpus, p_attribute = "word", registry = registry_dir_tmp, id = ids

unlink(file.path(normalizePath(tempdir(), winslash = "/"), "cwb", fsep = "/"), recursive = TRUE)

cwbtools documentation built on May 15, 2022, 1:06 a.m.