Bind the term frequency and inverse document frequency of a tidy text dataset to the dataset

Description

Calculate and bind the term frequency and inverse document frequency of a tidy text dataset, along with the product, tf-idf to the dataset. Each of these values are added as columns.

Usage

1
2
3
bind_tf_idf(tbl, term_col, document_col, n_col)

bind_tf_idf_(tbl, term_col, document_col, n_col)

Arguments

tbl

A tidy text dataset with one-row-per-term-per-document

term_col

Column containing terms

document_col

Column containing document IDs

n_col

Column containing document-term counts

Details

tf_idf is given bare names, while tf_idf_ is given strings and is therefore suitable for programming with.

If the dataset is grouped, the groups are ignored but are retained.

The dataset must have exactly one row per document-term combination for this to work.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
library(dplyr)
library(janeaustenr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

book_words

# find the words most distinctive to each document
book_words %>%
  bind_tf_idf(word, book, n) %>%
  arrange(desc(tf_idf))