s_attributes: Using Structural Attributes.

Description Usage Arguments Examples

Description

Structural attributes store the metadata of texts in a CWB corpus and/or any kind of annotation of a region of text. The fundamental unit are so-called strucs, i.e. indices of regions identified by a left and a right corpus position. The corpus library (CL) offers a set of functions to make the translations between corpus positions (cpos) and strucs (struc).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cl_cpos2struc(corpus, s_attribute, cpos,
  registry = Sys.getenv("CORPUS_REGISTRY"))

cl_struc2cpos(corpus, s_attribute, registry = Sys.getenv("CORPUS_REGISTRY"),
  struc)

cl_struc2str(corpus, s_attribute, struc,
  registry = Sys.getenv("CORPUS_REGISTRY"))

cl_cpos2lbound(corpus, s_attribute, cpos,
  registry = Sys.getenv("CORPUS_REGISTRY"))

cl_cpos2rbound(corpus, s_attribute, cpos,
  registry = Sys.getenv("CORPUS_REGISTRY"))

Arguments

corpus

name of a CWB corpus (upper case)

s_attribute

name of structural attribute (character vector)

cpos

corpus positions (integer vector)

registry

path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY

struc

a struc identifying a region

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
registry <- if (!check_pkg_registry_files()) use_tmp_registry() else get_pkg_registry()

# get metadata for matches of token
# scenario: id of the texts with occurrence of 'oil'
token_to_get <- "oil"
token_id <- cl_str2id("REUTERS", p_attribute = "word", str = "oil")
token_cpos <- cl_id2cpos("REUTERS", p_attribute = "word", id = token_id)
strucs <- cl_cpos2struc("REUTERS", s_attribute = "id", cpos = token_cpos)
strucs_unique <- unique(strucs)
text_ids <- cl_struc2str("REUTERS", s_attribute = "id", struc = strucs_unique)

# get the full text of the first text with match for 'oil'
left_cpos <- cl_cpos2lbound("REUTERS", s_attribute = "id", cpos = min(token_cpos))
right_cpos <- cl_cpos2rbound("REUTERS", s_attribute = "id", cpos = min(token_cpos))
txt <- cl_cpos2str("REUTERS", p_attribute = "word", cpos = left_cpos:right_cpos)
fulltext <- paste(txt, collapse = " ")

# alternativ approach to achieve same result
first_struc_match_oil <- cl_cpos2struc("REUTERS", s_attribute = "id", cpos = min(token_cpos))
cpos_struc <- cl_struc2cpos("REUTERS", s_attribute = "id", struc = first_struc_match_oil)
txt <- cl_cpos2str("REUTERS", p_attribute = "word", cpos = cpos_struc[1]:cpos_struc[2])
fulltext <- paste(txt, collapse = " ")

RcppCWB documentation built on Oct. 22, 2018, 5:08 p.m.