collocation_frequency: Mapping Collocation Frequency to Source Document

View source: R/collocation_frequency.R

collocation_frequencyR Documentation

Mapping Collocation Frequency to Source Document

Description

This function provides the frequency of collocations in comments that correspond to the provided source document.

Usage

collocation_frequency(
  tbl,
  source_row,
  text_column,
  collocate_length = 5,
  fuzzy = FALSE,
  n_bands = 50,
  threshold = 0.7,
  n_gram_width = 4,
  band_width = 8
)

Arguments

tbl

data frame containing documents, where each row represents a document

source_row

row containing text to be treated as source

text_column

string indicating the name of the column containing derivative text

collocate_length

the length of the collocation. Default is 5

fuzzy

whether or not to use fuzzy matching in collocation calculations

n_bands

number of bands used in MinHash algorithm passed to zoomerjoin::jaccard_right_join(). Default is 50

threshold

Jaccard distance threshold to be considered a match passed to zoomerjoin::jaccard_right_join(). Default is 0.7

n_gram_width

width of n-grams used in Jaccard distance calculation passed to zoomerjoin::jaccard_right_join(). Default is 4

band_width

width of band used in MinHash algorithm passed to zoomerjoin::jaccard_right_join(). Default is 8

Details

Collocations are sequences of words present in the source document. For example, the phrase "the blue bird flies" contains one collocation of length 4 ("the blue bird flies"), two collocations of length 3 ("the blue bird" and "blue bird flies"), and three collocations of length 2 ("the blue", "blue bird", and "bird flies"). This function counts the number of corresponding phrases in the 'notes', or the derivative documents. This count is divided by the number of times the phrase occurs in the source document. When fuzzy matching is included, indirect matches are included with a weight of (n*d)/m, where n is the frequency of the fuzzy collocation, d is the Jaccard similarity between the transcript and note collocation, and m is the number of closest matches for the note collocation.

Value

a dataframe of the transcript document with collocation values by word

Examples

src_row <- which(notepad_example$ID=="source")
merged_frequency <- collocation_frequency(notepad_example, src_row, "Text")

highlightr documentation built on April 11, 2026, 1:06 a.m.