tfidf: Calculate TF-IDF using a input matrix with terms in rows and...

View source: R/tfidf.R

tfidfR Documentation

Calculate TF-IDF using a input matrix with terms in rows and documents in columns

Description

Calculate TF-IDF using a input matrix with terms in rows and documents in columns

Usage

tfidf(
  tdMat,
  tfVariant = c("raw", "binary", "frequency", "log", "doubleNorm0.5"),
  idfVariant = c("raw", "smooth", "probabilistic"),
  idfAddOne = TRUE
)

Arguments

tdMat

A term-document matrix, terms in rows, documents in columns, and counts as integers (or logical values) in cells

tfVariant

Variant of term frequency. See details below.

idfVariant

Variant of inverse document frequency. See details below.

idfAddOne

Logical, whether one should be added to both numerator and denominator to calculate IDF. See details below.

Details

tfVariant accepts following options:

raw

The input matrix is used as it is.

binary

The input matrix is transformed into logical values.

frequency

Term frequency per document is calculated from the input matrix.

log

Transformation log(1+tfMat)

doubleNorm0.5

Double normalisation 0.5

idfVariant accepts following options:

raw

log(N/Nt)

smooth

log(1+N/Nt)

probabilistic

log((N-nt)/nt)

, where N represents the total number of documents in the corpus, and nt is the number of documents where the term t appears. If idfAddOne is set TRUE, both numbers with addition of 1 to prevent division-by-zero.

References

The Wikipedia item on TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.

Examples

tiExample <- matrix(c(1,1,1,1,1,
1,1,0,0,0,
1,0,0,0,0,
0,1,0,0,0,
0,0,0,1,0,
1,0,1,0,1,
0,0,0,0,1), ncol=5, byrow=TRUE)
colnames(tiExample) <- sprintf("D%d", 1:ncol(tiExample))
rownames(tiExample) <- sprintf("t%d", 1:nrow(tiExample))
tiRes <- tfidf(tiExample)


bedapub/ribiosMath documentation built on Jan. 29, 2023, 1:48 p.m.