Description Usage Arguments Details

These functions compute or update various term-counts of a corpus with flexible output specification.

1 2 3 4 5 6 7 8 9 10 11 12 13 | ```
dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
"column", "df"))
tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
"row", "df"))
tcm(corpus, vocab = NULL, window_size = 5,
window_weights = 1/seq.int(window_size), context = c("symmetric",
"right", "left"), ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
"row", "df"))
``` |

`corpus` |
text corpus; see |

`vocab` |
a |

`ngram` |
an integer vector of the form |

`nbuckets` |
number of unknown buckets |

`output` |
one of "triplet", "column", "row", "df" or an unambiguous
abbreviation thereof. First three options return the corresponding sparse
matrices from Matrix package, "df" results in a triplet
The default output type corresponds to the most efficient computation in
terms of CPU and memory usage ("row" for |

`window_size` |
sliding window size used for co-occurrence
computation. In this implementation the window includes the context word;
thus, window_size == 1 will result in 0 co-occurrence matrix. This
convention allows for consistent weighting schemes across different values
of |

`window_weights` |
vector of weights which are superimposed on the
sliding |

`context` |
when "symmetric", matrix entries |

For `ngram_max > 1`

the weights vectors is automatically extended to match
the "imaginary" sliding window over the ngrams. A proximity weight attached
for an n-gram is an average of weights of the constituents of the ngram in
the original sequence. Such scheme results in a consistent weighting across
different values of `ngram_min`

and `ngram_max`

, and it is the reason why
first element of `window_weights`

is the proximity to the context word
itself (i.e. distance `0`

). For example:

default weights for the context window

`["a" "b" "c" "d" "e"]`

a b c d e 1.00 0.50 0.33 0.25 0.20 for

`ngram=c(1L, 3L)`

a a_b a_b_c b b_c b_c_d c c_d c_d_e d d_e e 1.00 0.75 0.61 0.50 0.42 0.36 0.33 0.29 0.26 0.25 0.22 0.20 for

`ngram=c(2L, 3L)`

a_b a_b_c b_c b_c_d c_d c_d_e d_e 0.75 0.61 0.42 0.36 0.29 0.26 0.22

mlvocab documentation built on Sept. 21, 2018, 6:35 p.m.

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.