create_tcm: Term-co-occurence matrix construction

Description Usage Arguments Details Value See Also Examples

View source: R/tcm.R

Description

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)

## S3 method for class 'itoken'
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)

## S3 method for class 'itoken_parallel'
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)

Arguments

it

list of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.

vectorizer

function vectorizer function. See vectorizers.

skip_grams_window

integer window for term-co-occurence matrix construction. skip_grams_window should be > 0 if you plan to use vectorizer in create_tcm function. Value of 0L means to not construct the TCM.

skip_grams_window_context

one of c("symmetric", "right", "left") - which context words to use when count co-occurence statistics.

weights

weights for context/distant words during co-occurence statistics calculation. By default we are setting weight = 1 / distance_from_current_word. Should have length equal to skip_grams_window.

binary_cooccurence

FALSE by default. If set to TRUE then function only counts first appearence of the context word and remaining occurrence are ignored. Useful when creating TCM for evaluation of coherence of topic models. "symmetric" by default - take into account skip_grams_window left and right.

...

arguments to foreach function which is used to iterate over it.

Details

If a parallel backend is registered, it will construct the TCM in multiple threads. The user should keep in mind that he/she should split data and provide a list of itoken iterators. Each element of it will be handled in a separate thread combined at the end of processing.

Value

dgTMatrix TCM matrix

See Also

itoken create_dtm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Not run: 
data("movie_review")

# single thread

tokens = word_tokenizer(tolower(movie_review$review))
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)

# parallel version

# set to number of cores on your machine
N_WORKERS = 1
if(require(doParallel)) registerDoParallel(N_WORKERS)
splits = split_into(movie_review$review, N_WORKERS)
jobs = lapply(splits, itoken, tolower, word_tokenizer)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
jobs = lapply(splits, itoken, tolower, word_tokenizer)

tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")

## End(Not run)

dselivanov/text2vec documentation built on Sept. 23, 2018, 1:57 a.m.