region_matrix_ops: Get IDs and Counts for Region Matrices.
In RcppCWB: 'Rcpp' Bindings for the 'Corpus Workbench' ('CWB')

region_matrix_ops

R Documentation

Get IDs and Counts for Region Matrices.

Description

Get IDs and Counts for Region Matrices.

Usage

region_matrix_to_ids(
  corpus,
  p_attribute,
  registry = Sys.getenv("CORPUS_REGISTRY"),
  matrix
)

region_matrix_to_count_matrix(
  corpus,
  p_attribute,
  registry = Sys.getenv("CORPUS_REGISTRY"),
  matrix
)

region_matrix_context(
  corpus,
  registry = Sys.getenv("CORPUS_REGISTRY"),
  matrix,
  p_attribute,
  s_attribute,
  boundary,
  left,
  right
)

ranges_to_cpos(ranges)

Arguments

`corpus`	a CWB corpus
`p_attribute`	a positional attribute
`registry`	registry directory
`matrix`	a regions matrix
`s_attribute`	If not `NULL`, a structural attribute (length-one `character` vector), typically indicating a sentence ("s").
`boundary`	Structural attribute (length-one `character` vector) that serves as a boundary and that shall not be transgressed.
`left`	An `integer` value, number of strucs to move to the left.
`right`	An `integer` value, number of strucs to move to the right.
`ranges`	A two-column integer `matrix` of ranges (left and right corpus positions in first and second column, respectively).

Details

ranges_to_cpos() will turn a matrix of ranges into an integer vector with the individual corpus positions covered by the ranges.

Examples

# Scenario 1: Get full text for a subcorpus defined by regions
m <- get_region_matrix(
  corpus = "REUTERS", s_attribute = "places",
  strucs = 4L:5L, registry = get_tmp_registry()
  )
ids <- region_matrix_to_ids(
  corpus = "REUTERS", p_attribute = "word",
  registry = get_tmp_registry(), matrix = m
  )
tokenstream <- cl_id2str(
  corpus = "REUTERS", p_attribute = "word",
  registry = get_tmp_registry(), id = ids
  )
txt <- paste(tokenstream, collapse = " ")
txt

# Scenario 2: Get data.frame with counts for region matrix
y <- region_matrix_to_count_matrix(
  corpus = "REUTERS", p_attribute = "word",
  registry = get_tmp_registry(), matrix = m
  )
df <- as.data.frame(y)
colnames(df) <- c("token_id", "count")
df[["token"]] <- cl_id2str(
  "REUTERS", p_attribute = "word",
  registry = get_tmp_registry(), id = df[["token_id"]]
  )
df[order(df[["count"]], decreasing = TRUE),]
head(df)

RcppCWB documentation built on April 11, 2025, 5:48 p.m.