read_word_embeddings: Read word embeddings and format for input to 'scale_text'

Description Usage Arguments Details See Also Examples

View source: R/read_word_embeddings.R

Description

read_word_embeddings reads specified words from word embedding files quickly and without using much memory. It formats its output for the scale_text function. The rows of the output are words and the columns are the dimensions from the word embeddings. Correspondingly, the row names are the vocabulary and the column names are the names of the dimensions.

Usage

1
2
read_word_embeddings(in_vocab, ovefile = NA, ovefile2 = NA,
  wikfile = NA, twifile = NA)

Arguments

in_vocab

Character vector. This is the vocabulary to look for in the word embeddings.

ovefile

A character scalar (filename). Use this for O2M_overlap.txt from the meta embeddings. This is a meta-analysis of many pre-trained word embeddings. Recommended.

ovefile2

A character scalar (filename). Use this for O2M_oov.txt from the meta embeddings. These are the rare words for the meta-analysis of many pre-trained word embeddings.

wikfile

A character scalar (filename). Use this for glove.6B.300d.txt from the Wikipedia embeddings. These word embeddings are trained on Wikipedia entries only.

twifile

A character scalar (filename). Use this for glove.twitter.27B.200d.txt from the Twitter embeddings. These word embeddings are trained on Twitter data only.

Details

This function reads one or more of the pre-trained word embeddings listed above. You need to first download these files and unzip them on your computer before you can use them. Remember to add the file path to the file name when you specify it in this function.

Meta embeddings: http://www.cis.uni-muenchen.de/~wenpeng/renamed-meta-emb.tar.gz

Wikipedia embeddings: http://nlp.stanford.edu/data/glove.6B.zip

Twitter embeddings: http://nlp.stanford.edu/data/glove.twitter.27B.zip

You can specify one or more pre-trained word embedding files. I recommend the meta embeddings. The full meta embeddings are contained in two files – one for ordinary words and one for rare words and/or misspellings (that appeared in only a subset of the different text sources).

See Also

scale_text, doc_to_tdm, get_keywords, plot_keywords, score_documents

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
# download and extract embeddings data first

embeddings <- read_word_embeddings(
    in_vocab = out$vocab,
    # must add location on your computer "path/to/O2M_overlap.txt"
    ovefile = "O2M_overlap.txt",
    ovefile2 = "O2M_oov.txt" # very rare words and misspellings
    ## available here:
    ## http://www.cis.uni-muenchen.de/~wenpeng/renamed-meta-emb.tar.gz
    ## must unpack and replace "path/to/" with location on your computer
    )

## End(Not run)

wilryh/parrot documentation built on Jan. 9, 2020, 2:16 p.m.