word2vec: Extract word vectors from word2vec word embedding
In wordsalad: Provide Tools to Extract and Analyze Word Vectors

Description Usage Arguments Details Value Source References Examples

View source: R/word2vec.R

The calculations are done with the word2vec package.

word2vec(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 50,
  type = c("cbow", "skip-gram"),
  window = 5L,
  min_count = 5L,
  loss = c("ns", "hs"),
  negative = 5L,
  n_iter = 5L,
  lr = 0.05,
  sample = 0.001,
  stopwords = character(),
  threads = 1L,
  collapse_character = "\t",
  composition = c("tibble", "data.frame", "matrix")
)

`text`	Character string.
`tokenizer`	Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
`dim`	dimension of the word vectors. Defaults to 50.
`type`	the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'
`window`	skip length between words. Defaults to 5.
`min_count`	integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
`loss`	Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns".
`negative`	integer with the number of negative samples. Only used in case hs is set to FALSE
`n_iter`	Integer, number of training iterations. Defaults to 5.
`lr`	initial learning rate also known as alpha. Defaults to 0.05
`sample`	threshold for occurrence of words. Defaults to 0.001
`stopwords`	a character vector of stopwords to exclude from training
`threads`	number of CPU threads to use. Defaults to 1.
`collapse_character`	Character vector with length 1. Character used to glue together tokens after tokenizing. See details for more information. Defaults to `"\t"`.
`composition`	Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.

A trade-off have been made to allow for an arbitrary tokenizing function. The text is first passed through the tokenizer. Then it is being collapsed back together into strings using collapse_character as the separator. You need to pick collapse_character to be a character that will not appear in any of the tokens after tokenizing is done. The default value is a "tab" character. If you pick a character that is present in the tokens then those words will be split.

The choice of loss functions are one of:

"ns" negative sampling
"hs" hierarchical softmax

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality

word2vec(fairy_tales)

# Custom tokenizer that splits on non-alphanumeric characters
word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))

sh: 1: wc: Permission denied
# A tibble: 452 x 51
   tokens     V1     V2     V3    V4    V5    V6    V7    V8    V9   V10   V11
   <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 closed -0.837 -0.336 -0.731 -1.15 0.741 0.689 0.963 0.390 1.02  0.747 0.161
 2 same   -0.880 -0.389 -0.737 -1.16 0.762 0.679 0.947 0.357 1.04  0.726 0.216
 3 filled -0.896 -0.377 -0.719 -1.11 0.761 0.649 0.927 0.354 0.972 0.748 0.143
 4 garden -0.881 -0.392 -0.747 -1.17 0.767 0.679 1.01  0.368 0.996 0.710 0.198
 5 own    -0.853 -0.378 -0.750 -1.14 0.787 0.677 1.00  0.396 1.01  0.717 0.162
 6 money, -0.857 -0.379 -0.686 -1.17 0.808 0.718 0.992 0.351 1.07  0.722 0.219
 7 recei… -0.858 -0.374 -0.738 -1.15 0.764 0.687 0.982 0.371 1.06  0.704 0.222
 8 bring  -0.855 -0.328 -0.727 -1.23 0.765 0.724 1.03  0.429 1.07  0.721 0.223
 9 -      -0.876 -0.375 -0.688 -1.21 0.747 0.741 0.985 0.384 1.01  0.753 0.139
10 then,  -0.892 -0.372 -0.692 -1.14 0.764 0.714 0.996 0.398 1.04  0.723 0.199
# … with 442 more rows, and 39 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>,
#   V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>,
#   V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>,
#   V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>,
#   V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>,
#   V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>,
#   V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>
# A tibble: 489 x 51
   tokens    V1    V2    V3     V4    V5    V6     V7    V8    V9    V10   V11
   <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>
 1 soldi…  1.49 0.878 0.662 0.122   1.39  1.37 0.0630 -1.67 0.298 -0.289  1.59
 2 buy     1.49 0.869 0.678 0.140   1.36  1.43 0.112  -1.66 0.274 -0.263  1.54
 3 heads   1.52 0.851 0.697 0.0780  1.38  1.39 0.0662 -1.73 0.280 -0.298  1.54
 4 roof    1.45 0.831 0.717 0.115   1.35  1.43 0.0366 -1.68 0.219 -0.319  1.62
 5 life    1.47 0.893 0.674 0.153   1.33  1.38 0.0601 -1.63 0.233 -0.284  1.62
 6 says    1.52 0.857 0.624 0.126   1.37  1.41 0.114  -1.67 0.302 -0.266  1.59
 7 excla…  1.50 0.857 0.657 0.155   1.37  1.37 0.0537 -1.64 0.281 -0.337  1.59
 8 hurt    1.49 0.827 0.637 0.124   1.36  1.40 0.0895 -1.69 0.292 -0.317  1.58
 9 nice    1.50 0.845 0.680 0.0776  1.33  1.37 0.0925 -1.72 0.276 -0.253  1.61
10 bottom  1.48 0.876 0.667 0.124   1.34  1.37 0.0481 -1.69 0.290 -0.276  1.55
# … with 479 more rows, and 39 more variables: V12 <dbl>, V13 <dbl>, V14 <dbl>,
#   V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>, V20 <dbl>,
#   V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>, V25 <dbl>, V26 <dbl>,
#   V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>, V31 <dbl>, V32 <dbl>,
#   V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>, V37 <dbl>, V38 <dbl>,
#   V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>, V43 <dbl>, V44 <dbl>,
#   V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>, V49 <dbl>, V50 <dbl>