term_matrices: Term-document and term-cooccurrence matrices
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description Usage Arguments Details

Compute and update various term-counts of a corpus with various output types.

dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
  "column", "df"), nthreads = mlvocab_nthreads())

tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
  "row", "df"), nthreads = mlvocab_nthreads())

tcm(corpus, vocab = NULL, window_size = 5,
  window_weights = 1/seq.int(window_size), context = c("symmetric",
  "right", "left"), ngram = attr(vocab, "ngram"),
  nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
  "row", "df"))

`corpus`	text corpus; see `[vocab()]`.
`vocab`	a `data.frame` produced by an early call to `vocab()`. When `vocab` is `NULL` and `nbuckets` is `NULL` or `0`, the vocabulary is first computed from corpus. When `nbuckets` > `0` and `vocab` is `NULL` the result matrix will consist of buckets only.
`ngram`	an integer vector of the form `[ngram_min, ngram_max]`. Defaults to the `ngram` settings used during the creation of `vocab`. Explicitly providing this parameter should rarely be needed.
`nbuckets`	number of unknown buckets
`output`	one of "triplet", "column", "row", "df" or an unambiguous abbreviation thereof. First three options return the corresponding sparse matrices from Matrix package, "df" results in a triplet `data.frame`.
`nthreads`	Number of OMP threads to use for computation, 0 for maximum number of threads available on the machine. The value is picked from `options("mlvocab.nthreads")`, or if that is unset from the environment variable `MLVOCAB_NTHREADS`, and defaults to 0 if neither is set. The default output corresponds to the most efficient option in terms of CPU and memory usage ("row" for `dtm`, "column" for `tdm` and "triplet" for `tcm`), but benefits are marginal unless the matrices barely fit into memory. If you plan to further perform matrix algebra on these matrices it is recommended choose "column" type because of a much better support for those in the Matrix package.
`window_size`	sliding window size used for co-occurrence computation. In this implementation the window includes the context word; thus, window_size == 1 will result in 0 co-occurrence matrix. This convention allows for consistent weighting schemes across different values of `ngram_min` and `ngram_max`.
`window_weights`	vector of weights which are superimposed on the sliding `window`. First element is a weight for distance 0 (aka context word itself), second for distance 1 etc. First weight is ignored for `ngram_max` == 1, see details. `window_weights` is recycled to length `window_size` if needed. It can be a string naming a function or a function which accepts one argument, `window_size`, and returns a `window_weights` vector. Defaults to `[1, 1/2, ..., 1/window_size]`.
`context`	when "symmetric", matrix entries `(i, j)` and `(j, i)` are the same and represent coocurence of terms `i` and `j` within `window_size`. When "right", entry `(i, j)` represents coocurence of the term `j` on the right side of `i`. When "left", entry `(i, j) represents the coocurence of the term`j`on the left of term`i'.

For ngram_max > 1 the weights vectors is automatically extended to match the "imaginary" sliding window over the ngrams. A proximity weight attached for an n-gram is an average of weights of the constituents of the ngram in the original sequence. Such scheme results in a consistent weighting across different values of ngram_min and ngram_max, and it is the reason why first element of window_weights is the proximity to the context word itself (i.e. distance 0). For example:

default weights for the context window ["a" "b" "c" "d" "e"]

a b c d e

1.00 0.50 0.33 0.25 0.20
for ngram=c(1L, 3L)

a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e

1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20
for ngram=c(2L, 3L)

a_b a_b_c b_c b_c_d c_d c_d_e d_e

0.75 0.61 0.42 0.36 0.29 0.26 0.22

vspinu/mlvocab documentation built on June 11, 2021, 7:37 a.m.

vspinu/mlvocab index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vspinu/mlvocab
Vocabulary and Corpus Preprocessing for Natural Language Pipelines

term_matrices: Term-document and term-cooccurrence matrices
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description

Usage

Arguments

Details

Related to term_matrices in vspinu/mlvocab...

R Package Documentation

Browse R Packages

We want your feedback!

a	a_b	a_b_c	b	b_c	b_c_d	c	c_d	c_d_e	d	d_e	e
1.00	0.75	0.61	0.50	0.42	0.36	0.33	0.29	0.26	0.25	0.22	0.20

vspinu/mlvocab Vocabulary and Corpus Preprocessing for Natural Language Pipelines

term_matrices: Term-document and term-cooccurrence matrices In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description

Usage

Arguments

Details

Related to term_matrices in vspinu/mlvocab...

R Package Documentation

Browse R Packages

We want your feedback!

vspinu/mlvocab
Vocabulary and Corpus Preprocessing for Natural Language Pipelines

term_matrices: Term-document and term-cooccurrence matrices
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines