term_matrices: Term-document and term-cooccurrence matrices

Description Usage Arguments Details

Description

Compute and update various term-counts of a corpus with various output types.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
  "column", "df"), nthreads = mlvocab_nthreads())

tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
  "row", "df"), nthreads = mlvocab_nthreads())

tcm(corpus, vocab = NULL, window_size = 5,
  window_weights = 1/seq.int(window_size), context = c("symmetric",
  "right", "left"), ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
  "row", "df"))

Arguments

corpus

text corpus; see [vocab()].

vocab

a data.frame produced by an early call to vocab(). When vocab is NULL and nbuckets is NULL or 0, the vocabulary is first computed from corpus. When nbuckets > 0 and vocab is NULL the result matrix will consist of buckets only.

ngram

an integer vector of the form [ngram_min, ngram_max]. Defaults to the ngram settings used during the creation of vocab. Explicitly providing this parameter should rarely be needed.

nbuckets

number of unknown buckets

output

one of "triplet", "column", "row", "df" or an unambiguous abbreviation thereof. First three options return the corresponding sparse matrices from Matrix package, "df" results in a triplet data.frame.

nthreads

Number of OMP threads to use for computation, 0 for maximum number of threads available on the machine. The value is picked from options("mlvocab.nthreads"), or if that is unset from the environment variable MLVOCAB_NTHREADS, and defaults to 0 if neither is set.

The default output corresponds to the most efficient option in terms of CPU and memory usage ("row" for dtm, "column" for tdm and "triplet" for tcm), but benefits are marginal unless the matrices barely fit into memory. If you plan to further perform matrix algebra on these matrices it is recommended choose "column" type because of a much better support for those in the Matrix package.

window_size

sliding window size used for co-occurrence computation. In this implementation the window includes the context word; thus, window_size == 1 will result in 0 co-occurrence matrix. This convention allows for consistent weighting schemes across different values of ngram_min and ngram_max.

window_weights

vector of weights which are superimposed on the sliding window. First element is a weight for distance 0 (aka context word itself), second for distance 1 etc. First weight is ignored for ngram_max == 1, see details. window_weights is recycled to length window_size if needed. It can be a string naming a function or a function which accepts one argument, window_size, and returns a window_weights vector. Defaults to [1, 1/2, ..., 1/window_size].

context

when "symmetric", matrix entries (i, j) and (j, i) are the same and represent coocurence of terms i and j within window_size. When "right", entry (i, j) represents coocurence of the term j on the right side of i. When "left", entry (i, j) represents the coocurence of the termjon the left of termi'.

Details

For ngram_max > 1 the weights vectors is automatically extended to match the "imaginary" sliding window over the ngrams. A proximity weight attached for an n-gram is an average of weights of the constituents of the ngram in the original sequence. Such scheme results in a consistent weighting across different values of ngram_min and ngram_max, and it is the reason why first element of window_weights is the proximity to the context word itself (i.e. distance 0). For example:


vspinu/mlvocab documentation built on June 11, 2021, 7:37 a.m.