Description Usage Arguments Details Value Source References Examples
The calculations are done with the word2vec package.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | word2vec(
text,
tokenizer = text2vec::space_tokenizer,
dim = 50,
type = c("cbow", "skip-gram"),
window = 5L,
min_count = 5L,
loss = c("ns", "hs"),
negative = 5L,
n_iter = 5L,
lr = 0.05,
sample = 0.001,
stopwords = character(),
threads = 1L,
collapse_character = "\t",
composition = c("tibble", "data.frame", "matrix")
)
|
text |
Character string. |
tokenizer |
Function, function to perform tokenization. Defaults to text2vec::space_tokenizer. |
dim |
dimension of the word vectors. Defaults to 50. |
type |
the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow' |
window |
skip length between words. Defaults to 5. |
min_count |
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
loss |
Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns". |
negative |
integer with the number of negative samples. Only used in case hs is set to FALSE |
n_iter |
Integer, number of training iterations. Defaults to 5. |
lr |
initial learning rate also known as alpha. Defaults to 0.05 |
sample |
threshold for occurrence of words. Defaults to 0.001 |
stopwords |
a character vector of stopwords to exclude from training |
threads |
number of CPU threads to use. Defaults to 1. |
collapse_character |
Character vector with length 1. Character used to
glue together tokens after tokenizing. See details for more information.
Defaults to |
composition |
Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors. |
A trade-off have been made to allow for an arbitrary tokenizing function. The
text is first passed through the tokenizer. Then it is being collapsed back
together into strings using collapse_character
as the separator. You
need to pick collapse_character
to be a character that will not appear
in any of the tokens after tokenizing is done. The default value is a "tab"
character. If you pick a character that is present in the tokens then those
words will be split.
The choice of loss functions are one of:
"ns" negative sampling
"hs" hierarchical softmax
A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality
1 2 3 4 | word2vec(fairy_tales)
# Custom tokenizer that splits on non-alphanumeric characters
word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))
|
sh: 1: wc: Permission denied
# A tibble: 452 x 51
tokens V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 closed -0.837 -0.336 -0.731 -1.15 0.741 0.689 0.963 0.390 1.02 0.747 0.161
2 same -0.880 -0.389 -0.737 -1.16 0.762 0.679 0.947 0.357 1.04 0.726 0.216
3 filled -0.896 -0.377 -0.719 -1.11 0.761 0.649 0.927 0.354 0.972 0.748 0.143
4 garden -0.881 -0.392 -0.747 -1.17 0.767 0.679 1.01 0.368 0.996 0.710 0.198
5 own -0.853 -0.378 -0.750 -1.14 0.787 0.677 1.00 0.396 1.01 0.717 0.162
6 money, -0.857 -0.379 -0.686 -1.17 0.808 0.718 0.992 0.351 1.07 0.722 0.219
7 recei… -0.858 -0.374 -0.738 -1.15 0.764 0.687 0.982 0.371 1.06 0.704 0.222
8 bring -0.855 -0.328 -0.727 -1.23 0.765 0.724 1.03 0.429 1.07 0.721 0.223
9 - -0.876 -0.375 -0.688 -1.21 0.747 0.741 0.985 0.384 1.01 0.753 0.139
10 then, -0.892 -0.372 -0.692 -1.14 0.764 0.714 0.996 0.398 1.04 0.723 0.199
# … with 442 more rows, and 39 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>,
# V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>,
# V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>,
# V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>,
# V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>,
# V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>,
# V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>
# A tibble: 489 x 51
tokens V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 soldi… 1.49 0.878 0.662 0.122 1.39 1.37 0.0630 -1.67 0.298 -0.289 1.59
2 buy 1.49 0.869 0.678 0.140 1.36 1.43 0.112 -1.66 0.274 -0.263 1.54
3 heads 1.52 0.851 0.697 0.0780 1.38 1.39 0.0662 -1.73 0.280 -0.298 1.54
4 roof 1.45 0.831 0.717 0.115 1.35 1.43 0.0366 -1.68 0.219 -0.319 1.62
5 life 1.47 0.893 0.674 0.153 1.33 1.38 0.0601 -1.63 0.233 -0.284 1.62
6 says 1.52 0.857 0.624 0.126 1.37 1.41 0.114 -1.67 0.302 -0.266 1.59
7 excla… 1.50 0.857 0.657 0.155 1.37 1.37 0.0537 -1.64 0.281 -0.337 1.59
8 hurt 1.49 0.827 0.637 0.124 1.36 1.40 0.0895 -1.69 0.292 -0.317 1.58
9 nice 1.50 0.845 0.680 0.0776 1.33 1.37 0.0925 -1.72 0.276 -0.253 1.61
10 bottom 1.48 0.876 0.667 0.124 1.34 1.37 0.0481 -1.69 0.290 -0.276 1.55
# … with 479 more rows, and 39 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>,
# V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>,
# V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>,
# V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>,
# V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>,
# V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>,
# V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.