bigram.table: Compute table of bigrams

Description Usage Arguments Value

Description

This function counts the bigrams in the data. It's based on the vector of term IDs and document IDs – that is, the vocabulary has already been established, and this function simply counts occurrences of consecutive terms in the data.

Usage

1
2
bigram.table(term.id = integer(), doc.id = integer(), vocab = character(),
  n = integer())

Arguments

term.id

an integer vector containing the term ID number of every token in the corpus. Should take values between 1 and W, where W is the number of terms in the vocabulary.

doc.id

an interger vector containing the document ID number of every token in the corpus. Should take values between 1 and D, where D is the total number of documents in the corpus.

vocab

a character vector of length W, containing the terms in the vocabulary. This vector must align with term.id, such that a term.id of 1 indicates the first element of vocab, a term.id of 2 indicates the second element of vocab, etc.

n

an integer specifying how large the bigram table should be. The function will return the top n most frequent bigrams. This argument is here because the number of bigrams can be as large as W^2.

Value

a dataframe with three columns and n rows, containing the bigrams (column 2), their frequencies (column 3), and their rank in decreasing order of frequency (column 1). The table is sorted by default in decreasing order of frequency.


kshirley/LDAtools documentation built on May 20, 2019, 7:03 p.m.