cl_lexicon_size: Get Lexicon Size.

View source: R/cl.R

cl_lexicon_sizeR Documentation

Get Lexicon Size.

Description

Get the total number of unique tokens/ids of a positional attribute. Note that token ids are zero-based, i.e. when iterating through tokens, start at 0, the maximum will be cl_lexicon_size() minus 1.

Usage

cl_lexicon_size(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))

Arguments

corpus

name of a CWB corpus (upper case)

p_attribute

name of positional attribute

registry

path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY

Examples

lexicon_size <- cl_lexicon_size(
  "REUTERS",
  p_attribute = "word",
  registry = get_tmp_registry()
)

token_ids <- seq.int(from = 0, to = lexicon_size - 1)
cl_id2str(
  "REUTERS",
  p_attribute = "word",
  id = token_ids,
  registry = get_tmp_registry()
)

RcppCWB documentation built on Sept. 24, 2024, 1:08 a.m.