dsm: Create DSM Object Representing a Distributional Semantic...
In wordspace: Distributional Semantic Models in R

dsm	R Documentation

Create DSM Object Representing a Distributional Semantic Model (wordspace)

Description

This is the constructor function for dsm objects representing distributional semantic models, i.e. a co-occurrence matrix together with additional information on target terms (rows) and features (columns). A new DSM can be initialised with a dense or sparse co-occurrence matrix, or with a triplet representation of a sparse matrix.

Usage


dsm(M = NULL, target = NULL, feature = NULL, score = NULL,
    rowinfo = NULL, colinfo = NULL, N = NA,
    globals = list(), raw.freq = FALSE, sort = FALSE, verbose = FALSE)

Arguments

`M`	a dense or sparse co-occurrence matrix. A sparse matrix must be a subclass of `sparseMatrix` from the `Matrix` package. See "Details" below.
`target`	a character vector of target terms (see "Details" below)
`feature`	a character vector of feature terms (see "Details" below)
`score`	a numeric vector of co-occurrence frequencies or weighted/transformed scores (see "Details" below)
`rowinfo`	a data frame containing information about the rows of the co-occurrence matrix, corresponding to target terms. The data frame must include a column `term` with the target term labels. If unspecified, a minimal `rowinfo` table is compiled automatically (see "Details" below).
`colinfo`	a data frame containing information about the columns of the co-occurrence matrix, corresponding to feature terms. The data frame must include a column `term` with the feature term labels. If unspecified, a minimal `colinfo` table is compiled automatically (see "Details" below).
`N`	a single numeric value specifying the effective sample size of the co-occurrence matrix. This value may be determined automatically if `raw.freq=TRUE`.
`globals`	a list of global variables, which are included in the `globals` field of the DSM object. May contain an entry for the sample size N, which can be overridden by an explicitly specified value in the argument `N`.
`raw.freq`	if `TRUE`, entries of the co-occurrence matrix are interpreted as raw frequency counts. By default, it is assumed that some weighting/transformation has already been applied.
`sort`	if `TRUE`, sort rows and columns of a co-occurrence matrix specified in triplet form alphabetically. If the matrix is given directly (in argument `M`), rows and columns are never reordered.
`verbose`	if `TRUE`, a few progress and information messages are shown

Details

The co-occurrence matrix forming the core of the distributional semantic model (DSM) can be specified in two different ways:

As a dense or sparse matrix in argument M. A sparse matrix must be a subclass of dMatrix (from the Matrix package) and is automatically converted to the canonical storage mode used by the wordspace package. Row and column labels may be specified with arguments target and feature, which must be character vectors of suitable length; otherwise dimnames(M) are used.
As a triplet representation in arguments target (row label), feature (column label) and score (co-occurrence frequency or pre-computed score). The three arguments must be vectors of the same length; each set of corresponding elements specifies a non-zero cell of the co-occurrence matrix. If multiple entries for the same cell are given, their frequency or score values are added up.

The optional arguments rowinfo and colinfo are data frames with additional information about target and feature terms. If they are specified, they must contain a column $term matching the row or column labels of the co-occurrence matrix. Marginal frequencies and nonzero or document counts can be given in columns $f and $nnzero; any further columns are interpreted as meta-information on the target or feature terms. The rows of each data frame are automatically reordered to match the rows or columns of the co-occurrence matrix. Target or feature terms that do not appear in the co-occurrence matrix are silently discarded.

Counts of nonzero cells for each row and column are computed automatically, unless they are already present in the rowinfo and colinfo data frames. If the co-occurrence matrix contains raw frequency values, marginal frequencies for the target and feature terms are also computed automatically unless given in rowinfo and colinfo; the same holds for the effective sample size N.

If raw.freq=TRUE, all matrix entries must be non-negative; fractional frequency counts are allowed, however.

Value

An object of class dsm, a list with the following components:

`M`	A co-occurrence matrix of raw frequency counts in canonical format (see `dsm.canonical.matrix`).
`S`	A weighted and transformed co-occurrence matrix ("score" matrix) in canonical format (see `dsm.canonical.matrix`). Either `M` or `S` or both may be present. The object returned by `dsm()` will include `M` if `raw.freq=TRUE` and `S` otherwise.
`rows`	A data frame with information about the target terms, corresponding to the rows of the co-occurrence matrix. The data frame usually has at least three columns: `rows$term` the target term = row label `rows$f` marginal frequency of the target term; must be present if the DSM object contains a raw co-occurrence matrix `M` `rows$nnzero` number of nonzero entries in the corresponding row of the co-occurrence matrix Further columns may provide additional information.
`cols`	A data frame with information about the feature terms, corresponding to the columns of the co-occurrence matrix, in the same format as `rows`.
`globals`	A list of global variables. The following variables have a special meaning: `globals$N` effective sample size of the underlying corpus; may be `NA` if raw co-occurrence counts are not available `globals$locked` if `TRUE`, the marginal frequencies are no longer valid due to a `merge`, `rbind` or `cbind` operation; in this case, association scores cannot be computed from the co-occurrence frequencies `M`

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples


MyDSM <- dsm(
  target =  c("boat", "boat", "cat",  "dog", "dog"),
  feature = c("buy",  "use",  "feed", "buy", "feed"),
  score =   c(1,      3,      2,      1,     1),
  raw.freq = TRUE
)

print(MyDSM)   # 3 x 3 matrix with 5 out of 9 nonzero cells
print(MyDSM$M) # the actual co-occurrence matrix

print(MyDSM$rows) # row information
print(MyDSM$cols) # column information

wordspace documentation built on Aug. 23, 2022, 1:06 a.m.