eztfidf: Turn character vectors into tfidf distances easily.

Description Usage Arguments Examples

Description

This function returns an eztfidf list containing convenient functions.

Usage

1
eztfidf(char_vector, replace_words = c(`\t` = " "))

Arguments

char_vector

A character vector of documents. To be passed as a VectorSource (tm package). The values may be duplicated but the names may not.

replace_words

A named character vector. The element names will be replaced with the elements.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
super_heroes <- c(
    'The Flash', 'The HULK', 'she-hulk', 'ant-man', 'Ironman', 'BATMAN',
    'superman', 'the green arrow', 'aqua-man', 'the silver surfer', 'green lantern'
    )
names(super_heroes) <- super_heroes
super_heroes <- gsub('man$', '-MAN', super_heroes, TRUE)  # custom cleaning
x <- eztfidf(
    super_heroes, replace_words = c('-' = ' ', 'silver' = 'gold')
)

# Use numeric index or original names to see changes to docs
x$docs[1:10]
x$docs[c('The HULK','the silver surfer')]

# Inspect bag-of-words tfidf values as a list or matrix
x$values(c('the green arrow','green lantern'))
x$values(c(2,3,8,11), mode = 'matrix')

# Best matching values and cosine similarity matrix easily accessible
x$CosineSimVector(3, top = 3)
x$CosineSimVector('the green arrow', top = 3)
x$CosineSimMatrix(c(2,3,8,11))

patricklyngrutz/eztfidf documentation built on May 6, 2019, 8:31 p.m.