reference_distribution_distance: Reference distribtuion distances

Description Usage Arguments Value

View source: R/reference_distribution_distance.R

Description

Calculates the euclidean distance (up to a proportionality) between document term distributions and a set of reference distributions.

Usage

1
2
3
reference_distribution_distance(category_reference_distribution,
  document_term_matrix, inverse_frequency_weighting = TRUE,
  large_matrix = FALSE)

Arguments

category_reference_distribution

A simple_triplet_matrix where each row represents the distribution over terms in a particular category. These can be normalized or raw counts.

document_term_matrix

A simple_triplet_matrix where each row represents a document and each column, a term in the vocabulary. The columns in both matrices should match up.

inverse_frequency_weighting

If TRUE, then distances are weighted by the inverse of the term's aggregate count in the document term matrix. This means that differences in more frequently occuring terms will have less weight than those for less frequently appearing terms. Defaults to TRUE.

large_matrix

Defaults to FALSE. If TRUE, then a method that is robust to large matrices will be used. Set this if you get an erro of the form: "'i, j, nrow, ncol' invalid type".

Value

A dataframe with distances of each document to each reference distribution. The last column indicates the closest reference distribtuion for each document.


matthewjdenny/SpeedReader documentation built on March 25, 2020, 5:32 p.m.