polmineR: Verbs and Nouns for Corpus Analysis

Cooccurrences-class

R Documentation

Cooccurrences class for corpus/partition.

Description

The Cooccurrences-class stores the information for all cooccurrences in a corpus. As this data can be bulky, in-place modifications of the data.table in the stat-slot of a Cooccurrences-object are used wherever possible, to avoid copying potentially large objects whenever possible. The class inherits from the textstat-class, so that methods for textstat-objects are inherited (see examples).

Usage

## S4 method for signature 'Cooccurrences'
as.simple_triplet_matrix(x)

## S4 method for signature 'Cooccurrences'
as_igraph(
  x,
  edge_attributes = c("ll", "ab_count", "rank_ll"),
  vertex_attributes = "count",
  as.undirected = TRUE,
  drop = getOption("polmineR.villainChars")
)

## S4 method for signature 'Cooccurrences'
subset(x, ..., by)

## S4 method for signature 'Cooccurrences'
decode(.Object)

## S4 method for signature 'Cooccurrences'
kwic(
  .Object,
  left = getOption("polmineR.left"),
  right = getOption("polmineR.right"),
  verbose = TRUE,
  progress = TRUE
)

## S4 method for signature 'Cooccurrences'
as.sparseMatrix(x, col = "ab_count", ...)

## S4 method for signature 'Cooccurrences'
enrich(.Object)

Arguments

`x`	A `Cooccurrences` class object.
`edge_attributes`	Attributes from stat `data.table` in x to add to edges.
`vertex_attributes`	Vertex attributes to add to nodes.
`as.undirected`	Logical, whether to return directed or undirected graph.
`drop`	A character vector indicating names of nodes to drop from `igraph` object that is prepared.
`...`	Further arguments passed into a further call of `subset`.
`by`	A `features`-class object.
`.Object`	A `Cooccurrences`-class object.
`left`	Number of tokens to the left of the node.
`right`	Number of tokens to the right of the node.
`verbose`	Logical.
`progress`	Logical, whether to show progress bar.
`col`	A column to extract.

Details

The as.simple_triplet_matrix-method will transform a Cooccurrences object into a sparse matrix. For reasons of memory efficiency, decoding token ids is performed within the method at the as late as possible. It is NOT necessary that decoded tokens are present in the table in the Cooccurrences object.

The as_igraph-method can be used to turn an object of the Cooccurrences-class into an igraph-object.

The subset method, as a particular feature, allows a Coocccurrences-object to be subsetted by a featurs-Object resulting from a features extraction that compares two Cooccurrences objects.

For reasons of memory efficiency, the initial data.table in the slot stat of a Cooccurrences-object will identify tokens by an integer id, not by the string of the token. The decode()-method will replace these integer columns with human-readable character vectors. Due to the reference logic of the data.table object, this is an in-place operation, peformed without copying the table. The modified object is returned invisibly; usually it will not be necessary to catch the return value.

The kwic-method will add a column to the data.table in the stat-slot with the concordances that are behind a statistical finding, and to the data.table in the stat-slot of the partition in the slot partition. It is an in-place operation.

Returns a sparseMatrix based on the counts of term cooccurrences. At this stage, it is required that decoded tokens are present.

The enrich()-method will add columns 'a_count' and 'b_count' to the data.table in the 'stat' slot of the Cooccurrences object. If the count for the subcorpus/partition from which the cooccurrences are derived is not yet present, the count is performed first.

Slots

left: Single integer value, number of tokens to the left of the node.
right: Single integer value, number of tokens to the right of the node.
p_attribute: A character vector, the p-attribute(s) the evaluation of the corpus is based on.
corpus: Length-one character vector, the CWB corpus used.
stat: A data.table with the statistical analysis of cooccurrences.
encoding: Length-one character vector, the encoding of the corpus.
partition: The partition that is the basis for computations.
window_sizes: A data.table linking the number of tokens in the context of a token identified by id.
minimized: Logical, whether the object has been minimized.

Examples

## Not run: 
# takes too much time on CRAN test machines
use(pkg = "RcppCWB", corpus = "REUTERS")
X <- Cooccurrences("REUTERS", p_attribute = "word", left = 2L, right = 2L)
m <- as.simple_triplet_matrix(X)

## End(Not run)

use(pkg = "RcppCWB", corpus = "REUTERS")

X <- Cooccurrences("REUTERS", p_attribute = "word", left = 5L, right = 5L)
decode(X)
sm <- as.sparseMatrix(X)
stm <- as.simple_triplet_matrix(X)

PolMine/polmineR documentation built on Nov. 9, 2023, 8:07 a.m.