View source: R/embed-all-the-things.R
starspace | R Documentation |
Interface to Starspace for training a Starspace model, providing raw access to the C++ functionality.
starspace(
model = "textspace.bin",
file,
trainMode = 0,
fileFormat = c("fastText", "labelDoc"),
label = "__label__",
dim = 100,
epoch = 5,
lr = 0.01,
loss = c("hinge", "softmax"),
margin = 0.05,
similarity = c("cosine", "dot"),
negSearchLimit = 50,
adagrad = TRUE,
ws = 5,
minCount = 1,
minCountLabel = 1,
ngrams = 1,
thread = 1,
...
)
model |
the full path to where the model file will be saved. Defaults to 'textspace.bin'. |
file |
the full path to the file on disk which will be used for training. |
trainMode |
integer with the training mode. Possible values are 0, 1, 2, 3, 4 or 5. Defaults to 0. The use cases are
|
fileFormat |
either one of 'fastText' or 'labelDoc'. See the documentation of StarSpace |
label |
labels prefix (character string identifying how a label is prefixed, defaults to '__label__') |
dim |
the size of the embedding vectors (integer, defaults to 100) |
epoch |
number of epochs (integer, defaults to 5) |
lr |
learning rate (numeric, defaults to 0.01) |
loss |
loss function (either 'hinge' or 'softmax') |
margin |
margin parameter in case of hinge loss (numeric, defaults to 0.05) |
similarity |
cosine or dot product similarity in cas of hinge loss (character, defaults to 'cosine') |
negSearchLimit |
number of negatives sampled (integer, defaults to 50) |
adagrad |
whether to use adagrad in training (logical) |
ws |
the size of the context window for word level training - only used in trainMode 5 (integer, defaults to 5) |
minCount |
minimal number of word occurences for being part of the dictionary (integer, defaults to 1 keeping all words) |
minCountLabel |
minimal number of label occurences for being part of the dictionary (integer, defaults to 1 keeping all labels) |
ngrams |
max length of word ngram (integer, defaults to 1, using only unigrams) |
thread |
integer with the number of threads to use. Defaults to 1. |
... |
arguments passed on to ruimtehol:::textspace. See the details below. |
an object of class textspace which is a list with elements
model: a Rcpp pointer to the model
args: a list with elements
file: the binary file of the model saved on disk
dim: the dimension of the embedding
data: data-specific Starspace training parameters
param: algorithm-specific Starspace training parameters
dictionary: parameters which define ths dictionary of words and labels in Starspace
options: parameters specific to duration of training, the text preparation and the training batch size
test: parameters specific to model testing
iter: a list with element epoch, lr, error and error_validation showing the error after each epoch
The function starspace
is a tiny wrapper over the internal function ruimtehol:::textspace which
allows direct access to the C++ code in order to run Starspace.
The following arguments are available in that functionality when you do the training.
Default settings are shown next to the definition. Some of these arguments are directly set in the starspace
function,
others can be passed on with ... .
Arguments which define how the training is done:
dim: size of embedding vectors [100]
epoch: number of epochs [5]
lr: learning rate [0.01]
loss: loss function: hinge, softmax [hinge]
margin: margin parameter in hinge loss. It's only effective if hinge loss is used. [0.05]
similarity: takes value in [cosine, dot]. Whether to use cosine or dot product as similarity function in hinge loss. It's only effective if hinge loss is used. [cosine]
negSearchLimit: number of negatives sampled [50]
maxNegSamples: max number of negatives in a batch update [10]
p: normalization parameter: normalize sum of embeddings by dividing Size^p [0.5]
adagrad: whether to use adagrad in training [1]
ws: only used in trainMode 5, the size of the context window for word level training. [5]
dropoutLHS: dropout probability for LHS features. [0]
dropoutRHS: dropout probability for RHS features. [0]
shareEmb: whether to use the same embedding matrix for LHS and RHS. [1]
initRandSd: initial values of embeddings are randomly generated from normal distribution with mean=0, standard deviation=initRandSd. [0.001]
Arguments specific to the dictionary of words and labels:
minCount: minimal number of word occurences [1]
minCountLabel: minimal number of label occurences [1]
ngrams: max length of word ngram [1]
bucket: number of buckets [100000]
label: labels prefix [__label__]
Arguments which define early stopping or proceeding of model building:
initModel: if not empty, it loads a previously trained model in -initModel and carry on training.
validationFile: validation file path
validationPatience: number of iterations of validation where does not improve before we stop training [10]
saveEveryEpoch: save intermediate models after each epoch [0]
saveTempModel: save intermediate models after each epoch with an unique name including epoch number [0]
maxTrainTime: max train time (secs) [8640000]
Other:
trainWord: whether to train word level together with other tasks (for multi-tasking). [0]
wordWeight: if trainWord is true, wordWeight specifies example weight for word level training examples. [0.5]
useWeight whether input file contains weights [0]
https://github.com/facebookresearch
## Not run:
data(dekamer, package = "ruimtehol")
x <- strsplit(dekamer$question, "\\W")
x <- lapply(x, FUN = function(x) x[x != ""])
x <- sapply(x, FUN = function(x) paste(x, collapse = " "))
idx <- sample.int(n = nrow(dekamer), size = round(nrow(dekamer) * 0.7))
writeLines(x[idx], con = "traindata.txt")
writeLines(x[-idx], con = "validationdata.txt")
set.seed(123456789)
m <- starspace(file = "traindata.txt", validationFile = "validationdata.txt",
trainMode = 5, dim = 10,
loss = "softmax", lr = 0.01, ngrams = 2, minCount = 5,
similarity = "cosine", adagrad = TRUE, ws = 7, epoch = 3,
maxTrainTime = 10)
str(starspace_dictionary(m))
wordvectors <- as.matrix(m)
wv <- starspace_embedding(m,
x = c("Nationale Loterij", "migranten", "pensioen"),
type = "ngram")
wv
mostsimilar <- embedding_similarity(wordvectors, wv["pensioen", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
starspace_knn(m, "koning")
## clean up for cran
file.remove(c("traindata.txt", "validationdata.txt"))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.