In LudvigOlsen/vocabular2: Document Vocabulary Comparison

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  dpi = 92,
  fig.retina = 2,
  out.width = "100%"
)

vocabular2

The goal of vocabular2 is to compare vocabularies on a set of metrics. There's currently no clear development path for the package. It may become usable in the future, but for now it's not adviced to use the code for your projects. I haven't spent enough time thinking about the meaningfulness of the metrics to recommend them. They were simply intuitive to me at 4am on some exam-stressed winter night. It's also very possible that they are in the literature under different names. :)

Installation

You can install the development version with:

devtools::install_github("ludvigolsen/vocabular2")

Main functions

compare_vocabs()
get_doc_metrics()
stack_doc_metrics()

Simple Example

Note: By default, negative values are set to 0 for most of the metrics (not TD-IDF and TF-IRF).

See the metric formulas below the example.

Attach packages

library(vocabular2)
library(tm)
library(tidyverse)
library(knitr)

Load the included 'hamlet' dataset

# The included dataset with Hamlet lines
# Extracted from https://www.opensourceshakespeare.org/
hamlet %>% head(5)

# Collect the lines for each character
data <- hamlet %>% 
  dplyr::group_by(Character) %>% 
  dplyr::summarise(txt = paste0(Line, collapse = " "))

data

# Assign each text to a variable
# This could be done in a loop if we had a lot of texts
claudius <- data[1, "txt"][[1]]
gertrude <- data[2, "txt"][[1]]
hamlet <- data[3, "txt"][[1]] # note: overwrites the dataset
horatio <- data[4, "txt"][[1]]
ophelia <- data[5, "txt"][[1]]

Count the terms

# Create a term-count tibble for each document

count_terms <- function(t){
  docs <- Corpus(VectorSource(t))
  # do things like removing stopwords, lemmatization, etc.
  docs <- tm_map(docs, removeWords, stopwords("english"))
  docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = TRUE)
  dtm <- TermDocumentMatrix(docs)
  m <- as.matrix(dtm)
  v <- sort(rowSums(m), decreasing=TRUE)
  d <- tibble::tibble(Word = names(v), Count=v)
  d
}

claudius_tc <- count_terms(claudius)
gertrude_tc <- count_terms(gertrude)
hamlet_tc <- count_terms(hamlet)
horatio_tc <- count_terms(horatio)
ophelia_tc <- count_terms(ophelia)

Compare the vocabularies

This is where the metrics are calculated. We get a column per document with a nested tibble containing the metrics.

scores <- compare_vocabs(tc_dfs = list("claudius" = claudius_tc,
                                       "gertrude" = gertrude_tc,
                                       "hamlet" = hamlet_tc,
                                       "horatio" = horatio_tc,
                                       "ophelia" = ophelia_tc))
scores

Extract the metrics for Claudius

get_doc_metrics(scores, "claudius") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Extract the metrics for Gertrude

get_doc_metrics(scores, "gertrude") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Extract the metrics for Hamlet

get_doc_metrics(scores, "hamlet") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Extract the metrics for Horatio

get_doc_metrics(scores, "horatio") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Extract the metrics for Ophelia

get_doc_metrics(scores, "ophelia") %>% 
  arrange(desc(REL_TF_NRTF)) %>% 
  head(10) %>% 
  kable()

Extract and stack metrics for all documents

stack_doc_metrics(scores)

Metrics

TF-IDF and TF-IRF (Term Frequency - Inverse Rest Frequency)

These are highly correlated (>0.999).

$tf(t,d)=\frac{f_{t,d}}{\sum_{t'}^{d}f_{t',d}}$

$idf(t,D)=\log{\frac{|D|}{1+|{d \in D:t \in d}|}}$

$irf(t,d,D)=\log{\frac{|D|-1}{1+|\{d \in D:t \in d \land d' \not = d \}|}}$

$tfidf(t,d,D) = tf(t,d) \cdot idf(t,D)$

$tfirf(t,d,D) = tf(t,d) \cdot irf(t,d,D)$

TF-RTF (Term Frequency - Rest Term Frequency)

TF-RTF is positive when the term frequency is higher in the current document than the sum of the term frequencies in the rest of the corpus.

$rtf(t,d,D) = \sum_{d' \not = d}^{D}tf(t,d')$

$tfrtf(t,d,D) = tf(t,d)-rtf(t,d,D)$

TF-NRTF (Term Frequency - Normalized Rest Term Frequency)

As our selected TF function ensures that frequencies add up to 1 document-wise, the NRTF (Normalized Rest Term Frequency) is simply the average term frequency in the other documents, instead of the sum as in RTF.

TF-NRTF is positive when the term frequency is higher in the current document than the average term frequency in the rest of the corpus.

$nrtf(t,d,D) = \frac{rtf(t,d,D)}{|D|-1}$

$tfnrtf(t,d,D) = tf(t,d)-nrtf(t,d,D)$

TF-MRTF (Term Frequency - Maximum Rest Term Frequency)

Instead of the normalized/average rest term frequency, we instead use the maximum rest term frequency.

TF-MRTF is positive when the term frequency is higher in the current document than the maximum term frequency in the rest of the corpus.

$Mrtf(t,d,D) = \max{\{tf(t,d'):d' \in D \land d' \not = d\}}$

$tfMrtf(t,d,D) = tf(t,d)-Mrtf(t,d,D)$

Relative TF-NRTF (Relative Term Frequency - Normalized Rest Term Frequency)

Where the TF-NRTF tend to be dominated by highly frequent words, the Relative TF-NRTF instead uses the relative distance to the NRTF. As that would likely be dominated by very infrequent words, we multiply it by the term frequency.

$\epsilon(t,d,D) = \frac{1}{\sum_{d' \not = d}^{D}f_{t,d'}}$

$rel\_tfnrtf(t,d,D) = tf(t,d)^{\beta}\frac{tfnrtf(t,d,D)}{\log(1 + nrtf(t,d,D) + \epsilon(t,d,D))}$

Epsilon (ε) is added to avoid zero-division. It is calculated to resemble +1 smoothing in the rest population.

The beta (β) exponentiator allows us to control the influence of the term frequency. By setting it to 0, we simply get the relative difference (log scaled).

Relative TF-MRTF (Relative Term Frequency - Maximum Rest Term Frequency)

Similar to Relative TF-NRTF but for MRTF instead.

$rel\_tfMrtf(t,d,D) = tf(t,d)^{\beta}\frac{tfMrtf(t,d,D)}{\log(1 + Mrtf(t,d,D) + \epsilon(t,d,D))}$

LudvigOlsen/vocabular2 documentation built on Jan. 4, 2020, 4:15 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

LudvigOlsen/vocabular2
Document Vocabulary Comparison

In LudvigOlsen/vocabular2: Document Vocabulary Comparison

vocabular2

Installation

Main functions

Simple Example

Attach packages

Load the included 'hamlet' dataset

Count the terms

Compare the vocabularies

Extract the metrics for Claudius

Extract the metrics for Gertrude

Extract the metrics for Hamlet

Extract the metrics for Horatio

Extract the metrics for Ophelia

Extract and stack metrics for all documents

Metrics

TF-IDF and TF-IRF (Term Frequency - Inverse Rest Frequency)

TF-RTF (Term Frequency - Rest Term Frequency)

TF-NRTF (Term Frequency - Normalized Rest Term Frequency)

TF-MRTF (Term Frequency - Maximum Rest Term Frequency)

Relative TF-NRTF (Relative Term Frequency - Normalized Rest Term Frequency)

Relative TF-MRTF (Relative Term Frequency - Maximum Rest Term Frequency)

R Package Documentation

Browse R Packages

We want your feedback!

LudvigOlsen/vocabular2 Document Vocabulary Comparison

In LudvigOlsen/vocabular2: Document Vocabulary Comparison

vocabular2

Installation

Main functions

Simple Example

Attach packages

Load the included 'hamlet' dataset

Count the terms

Compare the vocabularies

Extract the metrics for Claudius

Extract the metrics for Gertrude

Extract the metrics for Hamlet

Extract the metrics for Horatio

Extract the metrics for Ophelia

Extract and stack metrics for all documents

Metrics

TF-IDF and TF-IRF (Term Frequency - Inverse Rest Frequency)

TF-RTF (Term Frequency - Rest Term Frequency)

TF-NRTF (Term Frequency - Normalized Rest Term Frequency)

TF-MRTF (Term Frequency - Maximum Rest Term Frequency)

Relative TF-NRTF (Relative Term Frequency - Normalized Rest Term Frequency)

Relative TF-MRTF (Relative Term Frequency - Maximum Rest Term Frequency)

R Package Documentation

Browse R Packages

We want your feedback!

LudvigOlsen/vocabular2
Document Vocabulary Comparison