create_tcm: Term-co-occurence matrix construction
In dselivanov/text2vec: Modern Text Mining Framework for R

create_tcm

R Documentation

Term-co-occurence matrix construction

Description

This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.

Usage

create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
  ...)

## S3 method for class 'itoken'
create_tcm(it, vectorizer, skip_grams_window = 5L,
  skip_grams_window_context = c("symmetric", "right", "left"),
  weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
  ...)

## S3 method for class 'itoken_parallel'
create_tcm(it, vectorizer,
  skip_grams_window = 5L, skip_grams_window_context = c("symmetric",
  "right", "left"), weights = 1/seq_len(skip_grams_window),
  binary_cooccurence = FALSE, ...)

Arguments

`it`	`list` of iterators over tokens from itoken. Each element is a list of tokens, that is, tokenized and normalized strings.
`vectorizer`	`function` vectorizer function. See vectorizers.
`skip_grams_window`	`integer` window for term-co-occurence matrix construction. `skip_grams_window` should be > 0 if you plan to use `vectorizer` in create_tcm function. Value of `0L` means to not construct the TCM.
`skip_grams_window_context`	one of `c("symmetric", "right", "left")` - which context words to use when count co-occurence statistics.
`weights`	weights for context/distant words during co-occurence statistics calculation. By default we are setting `weight = 1 / distance_from_current_word`. Should have length equal to skip_grams_window.
`binary_cooccurence`	`FALSE` by default. If set to `TRUE` then function only counts first appearence of the context word and remaining occurrence are ignored. Useful when creating TCM for evaluation of coherence of topic models. `"symmetric"` by default - take into account `skip_grams_window` left and right.
`...`	placeholder for additional arguments (not used at the moment). `it`.

Details

If a parallel backend is registered, it will construct the TCM in multiple threads. The user should keep in mind that he/she should split data and provide a list of itoken iterators. Each element of it will be handled in a separate thread combined at the end of processing.

Value

TsparseMatrix TCM matrix

Examples

## Not run: 
data("movie_review")

# single thread

tokens = word_tokenizer(tolower(movie_review$review))
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)

# parallel version

# set to number of cores on your machine
it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N])
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix')
tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")

## End(Not run)

dselivanov/text2vec documentation built on Aug. 20, 2024, 11:58 p.m.