README.md
In mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

The following two-step abstraction is provided by the package:

The vocabulary object is first built from the entire corpus with the help of vocab(), update_vocab() and prune_vocab() functions.
Then, the vocabulary is passed alongside the corpus to a variety of corpus pre-processing functions. Most of the mlvocab functions accept nbuckets argument for partial or full hashing of the corpus.

Current functionality includes:

term index sequences: tix_seq(), tix_mat() and tix_df() produce integer sequences suitable for direct consumption by various sequence models.
term matrices: dtm(), tdm() and tcm() create document-term term-document and term-co-occurrence matrices respectively.
subseting embedding matrices: given pre-trained word-vectors prune_embeddings() creates smaller embedding matrices treating missing and unknown vocabulary terms with grace.
tfidf weighting: tfidf() computes various versions of term frequency, inverse document frequency weighting of dtm and tdm matrices.

Package is in alpha state. API changes are likely.

Any scripts or data that you put into this service are public.

mlvocab documentation built on Sept. 21, 2018, 6:35 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com