|

knitr::opts_chunk$set(echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE)

This vignette demonstrates some advanced features of the text2vec package: how to read large collections of text stored on disk rather than in memory, and how to let text2vec functions use multiple cores.

Working with files

In many cases, you will have a corpus of texts which are too large to fit in memory. This section demonstrates how to use text2vec to vectorize large collections of text stored in files.

Imagine we have a collection of movie reviews stored in multiple text files on disk. For this vignette, we will create files on disk using the movie_review dataset:

library(text2vec)
library(magrittr)
data("movie_review")

# remove all internal EOL to simplify reading
movie_review$review = gsub(pattern = '\n', replacement = ' ', 
                            x = movie_review$review, fixed = TRUE)
N_FILES = 10
CHUNK_LEN = nrow(movie_review) / N_FILES
files = sapply(1:N_FILES, function(x) tempfile())
chunks = split(movie_review, rep(1:N_FILES, 
                                  each = nrow(movie_review) / N_FILES ))
for (i in 1:N_FILES ) {
  write.table(chunks[[i]], files[[i]], quote = T, row.names = F,
              col.names = T, sep = '|')
}

# Note what the moview review data looks like
str(movie_review, strict.width = 'cut')

The text2vec provides functions to easily work with files. You need to follow a few steps.

  1. Construct an iterator over the files with the ifiles() function.
  2. Provide a reader() function to ifiles() that can read those files. You can use a function from base R or any other package to read plain text, XML, or other files and convert them to text. The text2vec package doesn't handle the reading itself. reader function should return NAMED character vector:
    • elements of character vector will be treated as documents
    • names of the elements will will be treated as documents ids
    • If user won't provide named character vector, text2vec will generate document ids filename + line_number (assuming that each line is a separate document)
  3. Construct a tokens iterator from the files iterator using the itoken() function.

Let's see how it works:

library(data.table)
reader = function(x, ...) {
  # read
  chunk = data.table::fread(x, header = T, sep = '|')
  # select column with review
  res = chunk$review
  # assign ids to reviews
  names(res) = chunk$id
  res
}
# create iterator over files
it_files  = ifiles(files, reader = reader)
# create iterator over tokens from files iterator
it_tokens = itoken(it_files, preprocessor = tolower, tokenizer = word_tokenizer, progressbar = FALSE)

vocab = create_vocabulary(it_tokens)

Now are able to construct DTM:

dtm = create_dtm(it_tokens, vectorizer = vocab_vectorizer(vocab))
str(dtm, list.len = 5)

Note that the DTM has document ids. They are inherited from the document names we assigned in reader function. This is a convenient way to assign document IDs when working with files.

Fall back to auto-generated ids. Lets see how text2vec would handle the cases when user didn't provide document ids:

for (i in 1:N_FILES ) {
  write.table(chunks[[i]][["review"]], files[[i]], quote = T, row.names = F,
              col.names = T, sep = '|')
}
# read with default reader - readLines
it_files  = ifiles(files)
# create iterator over tokens from files iterator
it_tokens = itoken(it_files, preprocessor = tolower,  tokenizer = word_tokenizer, progressbar = FALSE)
dtm = create_dtm(it_tokens, vectorizer = hash_vectorizer())
str(dtm, list.len = 5)

Multicore machines

For many tasks text2vec allows to take the advantage of multicore machines. The functions create_dtm(), create_tcm(), and create_vocabulary() are good example. In contrast to GloVe fitting which uses low-level thread parallelism via RcppParallel, these functions use standard high-level R parallelizatin provided by the foreach package. They are flexible and can use diffrent parallel backends, such as doParallel() or doRedis(). But remember that such high-level parallelism might involve significant overhead.

The user must do two things manually to take advantage of a multicore machine:

  1. Register a parallel backend.
  2. Use ifiles_parallel and itoken_parallel iterators.

Data in memory

N_WORKERS = 2
library(doParallel)
# register parallel backend
registerDoParallel(N_WORKERS)

# note that we can control level of granularity with `n_chunks` argument
it_token_par = itoken_parallel(movie_review$review, preprocessor = tolower, 
                               tokenizer = word_tokenizer, ids = movie_review$id, 
                               n_chunks = 8)

vocab = create_vocabulary(it_token_par)
v_vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it_token_par, vectorizer = v_vectorizer)

Data on disk

Processing files from disk is also easy with ifiles_parallel and itoken_parallel:

N_WORKERS = 2
library(doParallel)
# register parallel backend
registerDoParallel(N_WORKERS)

it_files_par = ifiles_parallel(file_paths = files)
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)

vocab = create_vocabulary(it_token_par)
# DTM vocabulary vectorization
v_vectorizer = vocab_vectorizer(vocab)
dtm_v = create_dtm(it_token_par, vectorizer = v_vectorizer)

# DTM hash vectorization
h_vectorizer = hash_vectorizer()
dtm_h = create_dtm(it_token_par, vectorizer = h_vectorizer)

# co-ocurence statistics
tcm = create_tcm(it_token_par, vectorizer = v_vectorizer, skip_grams_window = 5)


Try the text2vec package in your browser

Any scripts or data that you put into this service are public.

text2vec documentation built on Jan. 12, 2018, 1:04 a.m.