count_words: A function to efficiently form aggregate word counts and a...
In matthewjdenny/SpeedReader: High Performance Text Analysis

A function to efficiently form aggregate word counts and a common vocabulary vector from an unordered list of document term vectors.

1
2
3

count_words(document_term_vector_list, maximum_vocabulary_size = -1,
  existing_vocabulary = NULL, existing_word_counts = NULL,
  document_term_count_list = NULL)

`document_term_vector_list`	A list of string vectors (or a single string vector) from which we wish to find a unique vocabulary and counts for all unique words.
`maximum_vocabulary_size`	A number larger than maximum vocabulary size we expect to find. Defaults to 1,000,000 but can be adjusted appropriately to conserve memory, or if more unique words are expected. The reason we specify this number beforehand is becasue all word count vectors are pre-allocated to improve performance over a growing vector.
`existing_vocabulary`	An existing vocabulary vector we wish to add to. Defaults to NULL in which case a new word count and vocabulry is generated.
`existing_word_counts`	A vector of existing word counts that must also be provided if we are specifying existing_vocabulary. Defaults to NULL in which case a new word count and vocabulry is generated.
`document_term_count_list`	A list of vectors of word counts can optionally be provided, in which case we will aggregate over them. This can be useful if we wish to store documents in a memory efficent way. Defaults to NULL.

A list object with a unique_words field containing a vector of all unique word types, in descending order of their frequency, as well as a word_counts field containing word counts for each of those words, in the same order, and a total_unique_words field – the size of the vocabulary.

matthewjdenny/SpeedReader documentation built on March 25, 2020, 5:32 p.m.