model_tfidf: Tf-idf Model

Description Usage Arguments SMART Examples

View source: R/models.R

Description

Initialise a model based on the document frequencies of all its features.

Usage

1
2
3
4
model_tfidf(mm, normalize = FALSE, smart = "nfc", pivot = NULL,
  slope = 0.25, ...)

load_tfidf(file)

Arguments

mm

A matrix market as returned by mmcorpus_serialize.

normalize

ormalize document vectors to unit euclidean length? You can also inject your own function into normalize.

smart

SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example nfc, bpn and so on, where the letters represents the term weighting of the document vector. See SMART section.

pivot

You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps: 1) Set either the u or b document normalization in the smartirs parameter. 2) Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary. If pivot is NULL and you don’t follow steps 1 and 2, then pivoted document length normalization will be disabled.

slope

Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal [2] suggests setting the slope between 0.2 and 0.3 for best results.

...

Any other options, from the official documentation.

file

Path to a saved model.

SMART

Term frequency weighing:

Document frequency weighting:

Document normalization:

Examples

1
2
3
4
5
6
docs <- prepare_documents(corpus)
dictionary <- corpora_dictionary(docs)
corpora <- doc2bow(dictionary, docs)

# fit model
tfidf <- model_tfidf(corpora)

news-r/gensimr documentation built on Jan. 9, 2021, 5:55 a.m.