Description Usage Arguments Details
Compute and update various term-counts of a corpus with various output types.
1 2 3 4 5 6 7 8 9 10 11 12 13 | dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
"column", "df"), nthreads = mlvocab_nthreads())
tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
"row", "df"), nthreads = mlvocab_nthreads())
tcm(corpus, vocab = NULL, window_size = 5,
window_weights = 1/seq.int(window_size), context = c("symmetric",
"right", "left"), ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
"row", "df"))
|
corpus |
text corpus; see |
vocab |
a |
ngram |
an integer vector of the form |
nbuckets |
number of unknown buckets |
output |
one of "triplet", "column", "row", "df" or an unambiguous
abbreviation thereof. First three options return the corresponding sparse
matrices from Matrix package, "df" results in a triplet |
nthreads |
Number of OMP threads to use for computation, 0 for maximum
number of threads available on the machine. The value is picked from
The default output corresponds to the most efficient option in terms of
CPU and memory usage ("row" for |
window_size |
sliding window size used for co-occurrence
computation. In this implementation the window includes the context word;
thus, window_size == 1 will result in 0 co-occurrence matrix. This
convention allows for consistent weighting schemes across different values
of |
window_weights |
vector of weights which are superimposed on the
sliding |
context |
when "symmetric", matrix entries |
For ngram_max > 1
the weights vectors is automatically extended to match
the "imaginary" sliding window over the ngrams. A proximity weight attached
for an n-gram is an average of weights of the constituents of the ngram in
the original sequence. Such scheme results in a consistent weighting across
different values of ngram_min
and ngram_max
, and it is the reason why
first element of window_weights
is the proximity to the context word
itself (i.e. distance 0
). For example:
default weights for the context window ["a" "b" "c" "d" "e"]
a | b | c | d | e |
1.00 | 0.50 | 0.33 | 0.25 | 0.20 |
for ngram=c(1L, 3L)
a | a_b | a_b_c | b | b_c | b_c_d | c | c_d | c_d_e | d | d_e | e |
1.00 | 0.75 | 0.61 | 0.50 | 0.42 | 0.36 | 0.33 | 0.29 | 0.26 | 0.25 | 0.22 | 0.20 |
for ngram=c(2L, 3L)
a_b | a_b_c | b_c | b_c_d | c_d | c_d_e | d_e |
0.75 | 0.61 | 0.42 | 0.36 | 0.29 | 0.26 | 0.22 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.