collapse.bigrams: Replace specified bigrams with terms representing the bigrams

Description Usage Arguments Value

Description

After tokenization, use this function to replace all occurrences of a given bigram with a single token representing the bigram, and 'delete' the occurrences of the two individual tokens that comprised the bigram (so that it is still a generative model for text).

Usage

1
2
collapse.bigrams(bigrams = character(), doc.id = integer(),
  term.id = integer(), vocab = character())

Arguments

bigrams

A character vector, each element of which is a bigram represented by two terms separated by a hyphen, such as 'term1-term2'. Every consecutive occurrence of 'term1' and 'term2' in the data will be replaced by a single token representing this bigram.

doc.id

an interger vector containing the document ID number of every token in the corpus. Should take values between 1 and D, where D is the total number of documents in the corpus.

term.id

an integer vector containing the term ID number of every token in the corpus. Should take values between 1 and W, where W is the number of terms in the vocabulary.

vocab

a character vector of length W, containing the terms in the vocabulary. This vector must align with term.id, such that a term.id of 1 indicates the first element of vocab, a term.id of 2 indicates the second element of vocab, etc.

Value

Returns a list of length three. The first element, new.vocab, is a character vector containing the new vocabulary. The second element, new.term.id is the new vector of term ID numbers for all tokens in the data, taking integer values from 1 to the length of the new vocabulary. The third element is new.doc.id, which is the new version of the document id vector. If any of the specified bigrams were present in the data, then new.term.id and new.doc.id will be shorter vectors than the original term.id and doc.id vectors.


kshirley/LDAtools documentation built on May 20, 2019, 7:03 p.m.