dsm | R Documentation |
This is the constructor function for dsm
objects representing distributional semantic models,
i.e. a co-occurrence matrix together with additional information on target terms (rows) and features (columns).
A new DSM can be initialised with a dense or sparse co-occurrence matrix, or with a triplet representation of a sparse matrix.
dsm(M = NULL, target = NULL, feature = NULL, score = NULL, rowinfo = NULL, colinfo = NULL, N = NA, globals = list(), raw.freq = FALSE, sort = FALSE, verbose = FALSE)
M |
a dense or sparse co-occurrence matrix. A sparse matrix must be a subclass of |
target |
a character vector of target terms (see "Details" below) |
feature |
a character vector of feature terms (see "Details" below) |
score |
a numeric vector of co-occurrence frequencies or weighted/transformed scores (see "Details" below) |
rowinfo |
a data frame containing information about the rows of the co-occurrence matrix, corresponding to target terms. The data frame must include a column |
colinfo |
a data frame containing information about the columns of the co-occurrence matrix, corresponding to feature terms. The data frame must include a column |
N |
a single numeric value specifying the effective sample size of the co-occurrence matrix. This value may be determined automatically if |
globals |
a list of global variables, which are included in the |
raw.freq |
if |
sort |
if |
verbose |
if |
The co-occurrence matrix forming the core of the distributional semantic model (DSM) can be specified in two different ways:
As a dense or sparse matrix in argument M
. A sparse matrix must be a subclass of dMatrix
(from the Matrix
package) and is automatically converted to the canonical storage mode used by the wordspace
package. Row and column labels may be specified with arguments target
and feature
, which must be character vectors of suitable length; otherwise dimnames(M)
are used.
As a triplet representation in arguments target
(row label), feature
(column label) and score
(co-occurrence frequency or pre-computed score). The three arguments must be vectors of the same length; each set of corresponding elements specifies a non-zero cell of the co-occurrence matrix. If multiple entries for the same cell are given, their frequency or score values are added up.
The optional arguments rowinfo
and colinfo
are data frames with additional information about target and feature terms. If they are specified, they must contain a column $term
matching the row or column labels of the co-occurrence matrix. Marginal frequencies and nonzero or document counts can be given in columns $f
and $nnzero
; any further columns are interpreted as meta-information on the target or feature terms. The rows of each data frame are automatically reordered to match the rows or columns of the co-occurrence matrix. Target or feature terms that do not appear in the co-occurrence matrix are silently discarded.
Counts of nonzero cells for each row and column are computed automatically, unless they are already present in the rowinfo
and colinfo
data frames. If the co-occurrence matrix contains raw frequency values, marginal frequencies for the target and feature terms are also computed automatically unless given in rowinfo
and colinfo
; the same holds for the effective sample size N
.
If raw.freq=TRUE
, all matrix entries must be non-negative; fractional frequency counts are allowed, however.
An object of class dsm
, a list with the following components:
M |
A co-occurrence matrix of raw frequency counts in canonical format (see |
S |
A weighted and transformed co-occurrence matrix ("score" matrix) in canonical format (see |
rows |
A data frame with information about the target terms, corresponding to the rows of the co-occurrence matrix. The data frame usually has at least three columns:
Further columns may provide additional information. |
cols |
A data frame with information about the feature terms, corresponding to the columns of the co-occurrence matrix, in the same format as |
globals |
A list of global variables. The following variables have a special meaning:
|
Stephanie Evert (https://purl.org/stephanie.evert)
See dsm.canonical.matrix
for a description of the canonical matrix formats. DSM objects are usually loaded directly from a disk file in UCS (read.dsm.ucs
) or triplet (read.dsm.triplet
) format.
MyDSM <- dsm( target = c("boat", "boat", "cat", "dog", "dog"), feature = c("buy", "use", "feed", "buy", "feed"), score = c(1, 3, 2, 1, 1), raw.freq = TRUE ) print(MyDSM) # 3 x 3 matrix with 5 out of 9 nonzero cells print(MyDSM$M) # the actual co-occurrence matrix print(MyDSM$rows) # row information print(MyDSM$cols) # column information
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.