starspace: Interface to Starspace for training a Starspace model

View source: R/embed-all-the-things.R

starspaceR Documentation

Interface to Starspace for training a Starspace model

Description

Interface to Starspace for training a Starspace model, providing raw access to the C++ functionality.

Usage

starspace(
  model = "textspace.bin",
  file,
  trainMode = 0,
  fileFormat = c("fastText", "labelDoc"),
  label = "__label__",
  dim = 100,
  epoch = 5,
  lr = 0.01,
  loss = c("hinge", "softmax"),
  margin = 0.05,
  similarity = c("cosine", "dot"),
  negSearchLimit = 50,
  adagrad = TRUE,
  ws = 5,
  minCount = 1,
  minCountLabel = 1,
  ngrams = 1,
  thread = 1,
  ...
)

Arguments

model

the full path to where the model file will be saved. Defaults to 'textspace.bin'.

file

the full path to the file on disk which will be used for training.

trainMode

integer with the training mode. Possible values are 0, 1, 2, 3, 4 or 5. Defaults to 0. The use cases are

  • 0: tagspace (classification tasks) and search tasks

  • 1: pagespace & docspace (interest-based or content-based recommendation)

  • 2: articlespace (sentences within document)

  • 3: sentence embeddings and entity similarity

  • 4: multi-relational graphs

  • 5: word embeddings

fileFormat

either one of 'fastText' or 'labelDoc'. See the documentation of StarSpace

label

labels prefix (character string identifying how a label is prefixed, defaults to '__label__')

dim

the size of the embedding vectors (integer, defaults to 100)

epoch

number of epochs (integer, defaults to 5)

lr

learning rate (numeric, defaults to 0.01)

loss

loss function (either 'hinge' or 'softmax')

margin

margin parameter in case of hinge loss (numeric, defaults to 0.05)

similarity

cosine or dot product similarity in cas of hinge loss (character, defaults to 'cosine')

negSearchLimit

number of negatives sampled (integer, defaults to 50)

adagrad

whether to use adagrad in training (logical)

ws

the size of the context window for word level training - only used in trainMode 5 (integer, defaults to 5)

minCount

minimal number of word occurences for being part of the dictionary (integer, defaults to 1 keeping all words)

minCountLabel

minimal number of label occurences for being part of the dictionary (integer, defaults to 1 keeping all labels)

ngrams

max length of word ngram (integer, defaults to 1, using only unigrams)

thread

integer with the number of threads to use. Defaults to 1.

...

arguments passed on to ruimtehol:::textspace. See the details below.

Value

an object of class textspace which is a list with elements

  • model: a Rcpp pointer to the model

  • args: a list with elements

    1. file: the binary file of the model saved on disk

    2. dim: the dimension of the embedding

    3. data: data-specific Starspace training parameters

    4. param: algorithm-specific Starspace training parameters

    5. dictionary: parameters which define ths dictionary of words and labels in Starspace

    6. options: parameters specific to duration of training, the text preparation and the training batch size

    7. test: parameters specific to model testing

  • iter: a list with element epoch, lr, error and error_validation showing the error after each epoch

Note

The function starspace is a tiny wrapper over the internal function ruimtehol:::textspace which allows direct access to the C++ code in order to run Starspace.
The following arguments are available in that functionality when you do the training. Default settings are shown next to the definition. Some of these arguments are directly set in the starspace function, others can be passed on with ... .

Arguments which define how the training is done:

  • dim: size of embedding vectors [100]

  • epoch: number of epochs [5]

  • lr: learning rate [0.01]

  • loss: loss function hinge, softmax [hinge]

  • margin: margin parameter in hinge loss. It's only effective if hinge loss is used. [0.05]

  • similarity: takes value in [cosine, dot]. Whether to use cosine or dot product as similarity function in hinge loss. It's only effective if hinge loss is used. [cosine]

  • negSearchLimit: number of negatives sampled [50]

  • maxNegSamples: max number of negatives in a batch update [10]

  • p: normalization parameter: normalize sum of embeddings by dividing Size^p [0.5]

  • adagrad: whether to use adagrad in training [1]

  • ws: only used in trainMode 5, the size of the context window for word level training. [5]

  • dropoutLHS: dropout probability for LHS features. [0]

  • dropoutRHS: dropout probability for RHS features. [0]

  • shareEmb: whether to use the same embedding matrix for LHS and RHS. [1]

  • initRandSd: initial values of embeddings are randomly generated from normal distribution with mean=0, standard deviation=initRandSd. [0.001]

Arguments specific to the dictionary of words and labels:

  • minCount: minimal number of word occurences [1]

  • minCountLabel: minimal number of label occurences [1]

  • ngrams: max length of word ngram [1]

  • bucket: number of buckets [100000]

  • label: labels prefix [__label__]

Arguments which define early stopping or proceeding of model building:

  • initModel: if not empty, it loads a previously trained model in -initModel and carry on training.

  • validationFile: validation file path

  • validationPatience: number of iterations of validation where does not improve before we stop training [10]

  • saveEveryEpoch: save intermediate models after each epoch [0]

  • saveTempModel: save intermediate models after each epoch with an unique name including epoch number [0]

  • maxTrainTime: max train time (secs) [8640000]

Other:

  • trainWord: whether to train word level together with other tasks (for multi-tasking). [0]

  • wordWeight: if trainWord is true, wordWeight specifies example weight for word level training examples. [0.5]

  • useWeight whether input file contains weights [0]

References

https://github.com/facebookresearch

Examples

## Not run: 
data(dekamer, package = "ruimtehol")
x <- strsplit(dekamer$question, "\\W")
x <- lapply(x, FUN = function(x) x[x != ""])
x <- sapply(x, FUN = function(x) paste(x, collapse = " "))

idx <- sample.int(n = nrow(dekamer), size = round(nrow(dekamer) * 0.7))
writeLines(x[idx], con = "traindata.txt")
writeLines(x[-idx], con = "validationdata.txt")

set.seed(123456789)
m <- starspace(file = "traindata.txt", validationFile = "validationdata.txt", 
               trainMode = 5, dim = 10, 
               loss = "softmax", lr = 0.01, ngrams = 2, minCount = 5,
               similarity = "cosine", adagrad = TRUE, ws = 7, epoch = 3,
               maxTrainTime = 10)
str(starspace_dictionary(m))              
wordvectors <- as.matrix(m)
wv <- starspace_embedding(m, 
                          x = c("Nationale Loterij", "migranten", "pensioen"),
                          type = "ngram")
wv
mostsimilar <- embedding_similarity(wordvectors, wv["pensioen", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
starspace_knn(m, "koning")

## clean up for cran
file.remove(c("traindata.txt", "validationdata.txt"))

## End(Not run)

ruimtehol documentation built on Jan. 7, 2023, 1:25 a.m.