TfIdf: TfIdf

Description Usage Format Details Usage Methods Arguments Examples

Description

Creates TfIdf(Latent semantic analysis) model. The IDF is defined as follows: idf = log(# documents in the corpus) / (# documents where the term appears + 1)

Usage

1

Format

R6Class object.

Details

Term Frequency Inverse Document Frequency

Usage

For usage details see Methods, Arguments and Examples sections.

1
2
3
tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE)
tfidf$fit_transform(x)
tfidf$transform(x)

Methods

$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)

Creates tf-idf model

$fit_transform(x)

fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.

$transform(x)

transform new data x using tf-idf from train data

Arguments

tfidf

A TfIdf object

x

An input term-co-occurence matrix. Preferably in dgCMatrix format

smooth_idf

TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. This prevents division by zero.

norm

c("l1", "l2", "none") Type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document.

sublinear_tf

FALSE Apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF)

Examples

1
2
3
4
5
6
data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)

Example output



text2vec documentation built on Jan. 12, 2018, 1:04 a.m.