regions_class: Regions of a CWB corpus.

regionsR Documentation

Regions of a CWB corpus.

Description

Class to store and process the regions of a corpus. Regions are defined by start and end corpus positions and correspond to a set of tokens surrounded by start and end XML tags.

Usage

regions(x, s_attribute)

## S4 method for signature 'corpus'
regions(x, s_attribute)

## S4 method for signature 'subcorpus'
regions(x, s_attribute)

as.regions(x, ...)

## S3 method for class 'regions'
as.data.table(x, keep.rownames, values = NULL, ...)

Arguments

x

object of class regions

s_attribute

An s-attribute denoted by a length-one character vector for which regions shall be derived.

...

Further arguments.

keep.rownames

Required argument to safeguard consistency with S3 method definition in the data.table package. Unused in this context.

values

values to assign to a column that will be added

Details

The regions class is a minimal representation of regions and does not include information on the "strucs" (region IDs) that are used internally to obtain values of s-attributes or information, which combination of conditions on s-attributes has been used to obtain regions. This is left to the subcorpus corpus class. Whereas the subcorpus class is associated with the assumption, that a set of regions is a meaningful sub-unit of a corpus, the regions class has a focus on the individual sequences of tokens defined by a structural attribute (such as paragraphs, sentences, named entities).

Information on regions is maintained in the cpos slot of the regions S4 class: A two-column matrix with begin and end corpus positions (first and second column, respectively). All other slots are inherited from the corpus class.

The understanding of "regions" is modelled on the usage of terms by CWB developers. As it is put in the CQP Interface and Query Language Manual: "Matching pairs of XML start and end tags are encoded as token regions, identified by the corpus positions of the first token (immediately following the start tag) and the last token (immediately preceding the end tag) of the region." (p. 6)

The as.regions-method coerces objects to a regions-object.

The as.data.table method returns the matrix with corpus positions in the slot cpos as a data.table.

Slots

cpos

A two-column matrix with start and end corpus positions (first and second column, respectively).

See Also

Other classes to manage corpora: corpus-class, phrases-class, ranges-class, subcorpus

Examples

use("polmineR")
P <- partition("GERMAPARLMINI", date = "2009-11-12", speaker = "Jens Spahn")
R <- as.regions(P)
use(pkg = "RcppCWB", corpus = "REUTERS")

# Get regions matrix as data.table, without / with values
sc <- corpus("REUTERS") %>% subset(grep("saudi-arabia", places))
regions_dt <- as.data.table(sc)
regions_dt <- as.data.table(
  sc,
  values = s_attributes(sc, "id", unique = FALSE)
)

PolMine/polmineR documentation built on Nov. 9, 2023, 8:07 a.m.