knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) reticulate::use_virtualenv("../env")
First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.
library(gensimr) set.seed(42) # rerproducability # sample data data(corpus, package = "gensimr") # preprocess corpus docs <- prepare_documents(corpus)
Word2vec works somewhat differently. The example below is a reproduction of the Kaggle Gensim Word2Vec Tutorial.
# initialise word2vec <- model_word2vec(size = 100L, window = 5L, min_count = 1L) word2vec$build_vocab(docs) word2vec$train(docs, total_examples = word2vec$corpus_count, epochs = 20L) word2vec$init_sims(replace = TRUE)
Now we can explore the model.
word2vec$wv$most_similar(positive = c("interface"))
We expect "trees" to be the odd one out, it is a term that was in a different topic (#2) whereas other terms were in topics #1.
word2vec$wv$doesnt_match(c("human", "interface", "trees"))
Test similarity between words.
word2vec$wv$similarity("human", "trees") word2vec$wv$similarity("eps", "survey")
Automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.
Here we use and example dataset. The idea is that it is saved to a file (on disk) thereby allowing gensim to stream its content which is much more efficient than loading everything in memory before runnig the model.
Let's look at the content of the example file.
file <- datapath('testcorpus.txt') # example dataset readLines(file) # just to show you what it looks like
We observe that it is very similar to the output of prepare_documents(corpus)
(the docs
) object in this document. We can now scan the file to build a corpus with text8corpus
sentences <- text8corpus(file) phrases <- phrases(docs, min_count = 1L, threshold = 1L)
That simple, now we can apply the model to new sentences.
sentence <- list('trees', 'graph', 'minors') wrap(phrases, sentence)
We can add vocabulary to an already trained model with.
phrases$add_vocab(list(list("hello", "world"), list("meow")))
We can create a faster model with.
bigram <- phraser(phrases) wrap(bigram, sentence)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.