API

Goals which we aimed to achieve as a result of development of text2vec:

Conceptually we can divide API into several pieces:

Vectorization

See Vectorization section for details.

create_* family functions, vocab_vectorizer() and hash_vectorizer() are made to create vocabularies, Document-Term matrices and Term co-occurence matrices. Simply this family of functions is in charge of converting text into numeric form. Main functions are:

I/O handling

All functions from create_* family work with iterators over tokens as input. Good examples for creation of such iterators are:

Once user needs some custom source (for example data stream from some RDBMS), he/she just needs to create correct iterator over tokens.

Easy parallel processing

text2vec also provides convenient functions for easy parallel processing of text (many of tasks are emrassingly parallel).

Parallel itoken iterators can be used in create_dtm(), create_tcm() functions exatly the same way as sequential counterparts.

Models

text2vec provides unified interface for models, which is inspired by scikit-learn interface. Models in text2vec are mostly transformers and decompositions - they transform Document-Term matrix or decompose into 2 low-rank matrices.

Models include:

All text2vec models are mutable! This means that fit() and fit_transform() methods change model which was provided as argument.

Important verbs

All models have unified interface. User should only remember few verbs for models manipulation:

Decomposition models decompose matrix into 2 low rank matrices $X$ and $Y$. $X$ corresponds to item embeddings and $Y$ corresponds to feature embeddings. For example for LDA $X$ will be document-topic assignements and $Y$ will be topic-word assignements. While fit_transform or transform methods gives you $X$, second, matrix $Y$ is available as components read-only field: model$components. Examples of "decomposition" models in text2vec are LDA, LSA, GloVe. Check documentation of these classes for additional information.

Distances

See Distances section for details.

text2vec package provides 2 set of functions for measuring various distances/similarity in a unified way. All methods are written with special attention to computational performance and memory efficiency.

  1. sim2(x, y, method) - calculates similarity between each row of matrix x and each row of matrix y using given method.
  2. psim2(x, y, method) - calculates parallel similarity between rows of matrix x and corresponding rows of matrix y using given method.
  3. dist2(x, y, method) - calculates distance/dissimilarity between each row of matrix x and each row of matrix y using given method.
  4. pdist2(x, y, method) - calculates parallel distance/dissimilarity between rows of matrix x and corresponding rows of matrix y using given method.

Distances/similarities implemented at the moment:



dselivanov/text2vec documentation built on Nov. 16, 2023, 6:37 p.m.