embed_wordspace: Build a Starspace model which calculates word embeddings

View source: R/r-all-the-things.R

embed_wordspaceR Documentation

Build a Starspace model which calculates word embeddings

Description

Build a Starspace model which calculates word embeddings

Usage

embed_wordspace(
  x,
  model = "wordspace.bin",
  early_stopping = 0.75,
  useBytes = FALSE,
  ...
)

Arguments

x

a character vector of text where tokens are separated by spaces

model

name of the model which will be saved, passed on to starspace

early_stopping

the percentage of the data that will be used as training data. If set to a value smaller than 1, 1-early_stopping percentage of the data which will be used as the validation set and early stopping will be executed. Defaults to 0.75.

useBytes

set to TRUE to avoid re-encoding when writing out train and/or test files. See writeLines for details

...

further arguments passed on to starspace except file, trainMode and fileFormat

Value

an object of class textspace as returned by starspace.

Examples


library(udpipe)
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- strsplit(x$feedback, "\\W")
x <- lapply(x, FUN = function(x) x[x != ""])
x <- sapply(x, FUN = function(x) paste(x, collapse = " "))
x <- tolower(x)

set.seed(123456789)
model <- embed_wordspace(x, early_stopping = 0.9,
                         dim = 15, ws = 7, epoch = 10, minCount = 5, ngrams = 1,
                         maxTrainTime = 2) ## maxTrainTime only set for CRAN
plot(model)
wordvectors <- as.matrix(model)

mostsimilar <- embedding_similarity(wordvectors, wordvectors["weekend", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
mostsimilar <- embedding_similarity(wordvectors, wordvectors["vriendelijk", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
mostsimilar <- embedding_similarity(wordvectors, wordvectors["grote", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)


ruimtehol documentation built on May 29, 2024, 5:26 a.m.