README.md
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

The following two-step abstraction is provided by the package:

The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.
term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.
subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.

Package is in alpha state. API changes are likely.

vspinu/mlvocab documentation built on June 11, 2021, 7:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com