View source: R/embedallthethings.R
starspace  R Documentation 
Interface to Starspace for training a Starspace model, providing raw access to the C++ functionality.
starspace(
model = "textspace.bin",
file,
trainMode = 0,
fileFormat = c("fastText", "labelDoc"),
label = "__label__",
dim = 100,
epoch = 5,
lr = 0.01,
loss = c("hinge", "softmax"),
margin = 0.05,
similarity = c("cosine", "dot"),
negSearchLimit = 50,
adagrad = TRUE,
ws = 5,
minCount = 1,
minCountLabel = 1,
ngrams = 1,
thread = 1,
...
)
model 
the full path to where the model file will be saved. Defaults to 'textspace.bin'. 
file 
the full path to the file on disk which will be used for training. 
trainMode 
integer with the training mode. Possible values are 0, 1, 2, 3, 4 or 5. Defaults to 0. The use cases are

fileFormat 
either one of 'fastText' or 'labelDoc'. See the documentation of StarSpace 
label 
labels prefix (character string identifying how a label is prefixed, defaults to '__label__') 
dim 
the size of the embedding vectors (integer, defaults to 100) 
epoch 
number of epochs (integer, defaults to 5) 
lr 
learning rate (numeric, defaults to 0.01) 
loss 
loss function (either 'hinge' or 'softmax') 
margin 
margin parameter in case of hinge loss (numeric, defaults to 0.05) 
similarity 
cosine or dot product similarity in cas of hinge loss (character, defaults to 'cosine') 
negSearchLimit 
number of negatives sampled (integer, defaults to 50) 
adagrad 
whether to use adagrad in training (logical) 
ws 
the size of the context window for word level training  only used in trainMode 5 (integer, defaults to 5) 
minCount 
minimal number of word occurences for being part of the dictionary (integer, defaults to 1 keeping all words) 
minCountLabel 
minimal number of label occurences for being part of the dictionary (integer, defaults to 1 keeping all labels) 
ngrams 
max length of word ngram (integer, defaults to 1, using only unigrams) 
thread 
integer with the number of threads to use. Defaults to 1. 
... 
arguments passed on to ruimtehol:::textspace. See the details below. 
an object of class textspace which is a list with elements
model: a Rcpp pointer to the model
args: a list with elements
file: the binary file of the model saved on disk
dim: the dimension of the embedding
data: dataspecific Starspace training parameters
param: algorithmspecific Starspace training parameters
dictionary: parameters which define ths dictionary of words and labels in Starspace
options: parameters specific to duration of training, the text preparation and the training batch size
test: parameters specific to model testing
iter: a list with element epoch, lr, error and error_validation showing the error after each epoch
The function starspace
is a tiny wrapper over the internal function ruimtehol:::textspace which
allows direct access to the C++ code in order to run Starspace.
The following arguments are available in that functionality when you do the training.
Default settings are shown next to the definition. Some of these arguments are directly set in the starspace
function,
others can be passed on with ... .
Arguments which define how the training is done:
dim: size of embedding vectors [100]
epoch: number of epochs [5]
lr: learning rate [0.01]
loss: loss function: hinge, softmax [hinge]
margin: margin parameter in hinge loss. It's only effective if hinge loss is used. [0.05]
similarity: takes value in [cosine, dot]. Whether to use cosine or dot product as similarity function in hinge loss. It's only effective if hinge loss is used. [cosine]
negSearchLimit: number of negatives sampled [50]
maxNegSamples: max number of negatives in a batch update [10]
p: normalization parameter: normalize sum of embeddings by dividing Size^p [0.5]
adagrad: whether to use adagrad in training [1]
ws: only used in trainMode 5, the size of the context window for word level training. [5]
dropoutLHS: dropout probability for LHS features. [0]
dropoutRHS: dropout probability for RHS features. [0]
shareEmb: whether to use the same embedding matrix for LHS and RHS. [1]
initRandSd: initial values of embeddings are randomly generated from normal distribution with mean=0, standard deviation=initRandSd. [0.001]
Arguments specific to the dictionary of words and labels:
minCount: minimal number of word occurences [1]
minCountLabel: minimal number of label occurences [1]
ngrams: max length of word ngram [1]
bucket: number of buckets [100000]
label: labels prefix [__label__]
Arguments which define early stopping or proceeding of model building:
initModel: if not empty, it loads a previously trained model in initModel and carry on training.
validationFile: validation file path
validationPatience: number of iterations of validation where does not improve before we stop training [10]
saveEveryEpoch: save intermediate models after each epoch [0]
saveTempModel: save intermediate models after each epoch with an unique name including epoch number [0]
maxTrainTime: max train time (secs) [8640000]
Other:
trainWord: whether to train word level together with other tasks (for multitasking). [0]
wordWeight: if trainWord is true, wordWeight specifies example weight for word level training examples. [0.5]
useWeight whether input file contains weights [0]
https://github.com/facebookresearch
## Not run:
data(dekamer, package = "ruimtehol")
x < strsplit(dekamer$question, "\\W")
x < lapply(x, FUN = function(x) x[x != ""])
x < sapply(x, FUN = function(x) paste(x, collapse = " "))
idx < sample.int(n = nrow(dekamer), size = round(nrow(dekamer) * 0.7))
writeLines(x[idx], con = "traindata.txt")
writeLines(x[idx], con = "validationdata.txt")
set.seed(123456789)
m < starspace(file = "traindata.txt", validationFile = "validationdata.txt",
trainMode = 5, dim = 10,
loss = "softmax", lr = 0.01, ngrams = 2, minCount = 5,
similarity = "cosine", adagrad = TRUE, ws = 7, epoch = 3,
maxTrainTime = 10)
str(starspace_dictionary(m))
wordvectors < as.matrix(m)
wv < starspace_embedding(m,
x = c("Nationale Loterij", "migranten", "pensioen"),
type = "ngram")
wv
mostsimilar < embedding_similarity(wordvectors, wv["pensioen", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
starspace_knn(m, "koning")
## clean up for cran
file.remove(c("traindata.txt", "validationdata.txt"))
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.