Description Usage Arguments Details
These functions compute or update various term-counts of a corpus with flexible output specification.
1 2 3 4 5 6 7 8 9 10 11 12 13 | dtm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("row", "triplet",
"column", "df"))
tdm(corpus, vocab = NULL, ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("column", "triplet",
"row", "df"))
tcm(corpus, vocab = NULL, window_size = 5,
window_weights = 1/seq.int(window_size), context = c("symmetric",
"right", "left"), ngram = attr(vocab, "ngram"),
nbuckets = attr(vocab, "nbuckets"), output = c("triplet", "column",
"row", "df"))
|
corpus |
text corpus; see |
vocab |
a |
ngram |
an integer vector of the form |
nbuckets |
number of unknown buckets |
output |
one of "triplet", "column", "row", "df" or an unambiguous
abbreviation thereof. First three options return the corresponding sparse
matrices from Matrix package, "df" results in a triplet
The default output type corresponds to the most efficient computation in
terms of CPU and memory usage ("row" for |
window_size |
sliding window size used for co-occurrence
computation. In this implementation the window includes the context word;
thus, window_size == 1 will result in 0 co-occurrence matrix. This
convention allows for consistent weighting schemes across different values
of |
window_weights |
vector of weights which are superimposed on the
sliding |
context |
when "symmetric", matrix entries |
For ngram_max > 1
the weights vectors is automatically extended to match
the "imaginary" sliding window over the ngrams. A proximity weight attached
for an n-gram is an average of weights of the constituents of the ngram in
the original sequence. Such scheme results in a consistent weighting across
different values of ngram_min
and ngram_max
, and it is the reason why
first element of window_weights
is the proximity to the context word
itself (i.e. distance 0
). For example:
default weights for the context window ["a" "b" "c" "d" "e"]
a | b | c | d | e |
1.00 | 0.50 | 0.33 | 0.25 | 0.20 |
for ngram=c(1L, 3L)
a | a_b | a_b_c | b | b_c | b_c_d | c | c_d | c_d_e | d | d_e | e |
1.00 | 0.75 | 0.61 | 0.50 | 0.42 | 0.36 | 0.33 | 0.29 | 0.26 | 0.25 | 0.22 | 0.20 |
for ngram=c(2L, 3L)
a_b | a_b_c | b_c | b_c_d | c_d | c_d_e | d_e |
0.75 | 0.61 | 0.42 | 0.36 | 0.29 | 0.26 | 0.22 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.