cl_lexicon_size: Get Lexicon Size.
In RcppCWB: 'Rcpp' Bindings for the 'Corpus Workbench' ('CWB')

cl_lexicon_size

R Documentation

Get Lexicon Size.

Description

Get the total number of unique tokens/ids of a positional attribute. Note that token ids are zero-based, i.e. when iterating through tokens, start at 0, the maximum will be cl_lexicon_size() minus 1.

Usage

cl_lexicon_size(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))

Arguments

`corpus`	name of a CWB corpus (upper case)
`p_attribute`	name of positional attribute
`registry`	path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY

Examples

lexicon_size <- cl_lexicon_size(
  "REUTERS",
  p_attribute = "word",
  registry = get_tmp_registry()
)

token_ids <- seq.int(from = 0, to = lexicon_size - 1)
cl_id2str(
  "REUTERS",
  p_attribute = "word",
  id = token_ids,
  registry = get_tmp_registry()
)

RcppCWB documentation built on April 11, 2025, 5:48 p.m.