knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

reticulate::use_virtualenv("../env")

Preprocessing

First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.

library(gensimr)

set.seed(42) # rerproducability

# sample data
data(corpus, package = "gensimr")

# preprocess corpus
docs <- prepare_documents(corpus)

Model

Word2vec works somewhat differently. The example below is a reproduction of the Kaggle Gensim Word2Vec Tutorial.

# initialise
word2vec <- model_word2vec(size = 100L, window = 5L, min_count = 1L)
word2vec$build_vocab(docs) 
word2vec$train(docs, total_examples = word2vec$corpus_count, epochs = 20L)
word2vec$init_sims(replace = TRUE)

Now we can explore the model.

word2vec$wv$most_similar(positive = c("interface"))

We expect "trees" to be the odd one out, it is a term that was in a different topic (#2) whereas other terms were in topics #1.

word2vec$wv$doesnt_match(c("human", "interface", "trees"))

Test similarity between words.

word2vec$wv$similarity("human", "trees")
word2vec$wv$similarity("eps", "survey")

Phrases

Automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.

Here we use and example dataset. The idea is that it is saved to a file (on disk) thereby allowing gensim to stream its content which is much more efficient than loading everything in memory before runnig the model.

Let's look at the content of the example file.

file <- datapath('testcorpus.txt') # example dataset
readLines(file) # just to show you what it looks like

We observe that it is very similar to the output of prepare_documents(corpus) (the docs) object in this document. We can now scan the file to build a corpus with text8corpus

sentences <- text8corpus(file)
phrases <- phrases(docs, min_count = 1L, threshold = 1L)

That simple, now we can apply the model to new sentences.

sentence <- list('trees', 'graph', 'minors')
wrap(phrases, sentence)

We can add vocabulary to an already trained model with.

phrases$add_vocab(list(list("hello", "world"), list("meow")))

We can create a faster model with.

bigram <- phraser(phrases)
wrap(bigram, sentence)


news-r/gensimr documentation built on Jan. 9, 2021, 5:55 a.m.