term_matrices: Term-document and term-cooccurrence matrices
In mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description Usage Arguments Details

These functions compute or update various term-counts of a corpus with flexible output specification.

dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
  "column", "df"))

tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
  "row", "df"))

tcm(corpus, vocab = NULL, window_size = 5,
  window_weights = 1/seq.int(window_size), context = c("symmetric",
  "right", "left"), ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
  "row", "df"))

`corpus`	text corpus; see `[vocab()]`.
`vocab`	a `data.frame` produced by an early call to `vocab()`. When `vocab` is `NULL` and `nbuckets` is `NULL` or `0`, the vocabulary is first computed from corpus. When `nbuckets` > `0` and `vocab` is `NULL` the result matrix will consist of buckets only.
`ngram`	an integer vector of the form `[ngram_min, ngram_max]`. Defaults to the `ngram` settings used during the creation of `vocab`. Explicitly providing this parameter should rarely be needed.
`nbuckets`	number of unknown buckets
`output`	one of "triplet", "column", "row", "df" or an unambiguous abbreviation thereof. First three options return the corresponding sparse matrices from Matrix package, "df" results in a triplet `data.frame`. The default output type corresponds to the most efficient computation in terms of CPU and memory usage ("row" for `dtm`, "column" for `tdm` and "triplet" for `tcm`), but benefits are marginal unless your matrices are so big that they barely fit into memory. If you plan to further perform matrix algebra on these matrices it's a good idea to choose "column" type because of the much better support from the Matrix package.
`window_size`	sliding window size used for co-occurrence computation. In this implementation the window includes the context word; thus, window_size == 1 will result in 0 co-occurrence matrix. This convention allows for consistent weighting schemes across different values of `ngram_min` and `ngram_max`.
`window_weights`	vector of weights which are superimposed on the sliding `window`. First element is a weight for distance 0 (aka context word itself), second for distance 1 etc. First weight is ignored for `ngram_max` == 1, see details. `window_weights` is recycled to length `window_size` if needed. It can be a string naming a function or a function which accepts one argument, `window_size`, and returns a `window_weights` vector. Defaults to `[1, 1/2, ..., 1/window_size]`.
`context`	when "symmetric", matrix entries `(i, j)` and `(j, i)` are the same and represent coocurence of terms `i` and `j` within `window_size`. When "right", entry `(i, j)` represents coocurence of the term `j` on the right side of `i`. When "left", entry `(i, j) represents the coocurence of the term`j`on the left of term`i'.

For ngram_max > 1 the weights vectors is automatically extended to match the "imaginary" sliding window over the ngrams. A proximity weight attached for an n-gram is an average of weights of the constituents of the ngram in the original sequence. Such scheme results in a consistent weighting across different values of ngram_min and ngram_max, and it is the reason why first element of window_weights is the proximity to the context word itself (i.e. distance 0). For example:

default weights for the context window ["a" "b" "c" "d" "e"]

a b c d e

1.00 0.50 0.33 0.25 0.20
for ngram=c(1L, 3L)

a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e

1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20
for ngram=c(2L, 3L)

a_b a_b_c b_c b_c_d c_d c_d_e d_e

0.75 0.61 0.42 0.36 0.29 0.26 0.22