fasttext: Extract word vectors from fasttext word embedding

Description Usage Arguments Details Value Source References Examples

View source: R/fasttext.R

Description

The calculations are done with the fastTextR package.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fasttext(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 10L,
  type = c("skip-gram", "cbow"),
  window = 5L,
  loss = "hs",
  negative = 5L,
  n_iter = 5L,
  min_count = 5L,
  threads = 1L,
  composition = c("tibble", "data.frame", "matrix"),
  verbose = FALSE
)

Arguments

text

Character string.

tokenizer

Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.

dim

Integer, number of dimension of the resulting word vectors.

type

Character, the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'skip-gram'.

window

Integer, skip length between words. Defaults to 5.

loss

Charcter, choice of loss function must be one of "ns", "hs", or "softmax". See details for more Defaults to "hs".

negative

integer with the number of negative samples. Only used when loss = "ns".

n_iter

Integer, number of training iterations. Defaults to 5. numeric = -1 defines early stopping strategy. Stop fitting when one of two following conditions will be satisfied: (a) passed all iterations (b) cost_previous_iter / cost_current_iter - 1 < convergence_tol. Defaults to -1.

min_count

Integer, number of times a token should appear to be considered in the model. Defaults to 5.

threads

number of CPU threads to use. Defaults to 1.

composition

Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.

verbose

Logical, controls whether progress is reported as operations are executed.

Details

The choice of loss functions are one of:

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

Source

https://fasttext.cc/

References

Enriching Word Vectors with Subword Information, 2016, P. Bojanowski, E. Grave, A. Joulin, T. Mikolov.

Examples

1
2
3
4
5
6
fasttext(fairy_tales, n_iter = 2)

# Custom tokenizer that splits on non-alphanumeric characters
fasttext(fairy_tales,
         n_iter = 2,
         tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))

wordsalad documentation built on Oct. 23, 2020, 7:56 p.m.