View source: R/calculate_ngram_is.R
| calculate_ngram_is | R Documentation |
This function calculates the IS (Absorption Index) from Morrone (1996) for all n-grams in the corpus. Only n-grams that start AND end with lexical words are considered.
calculate_ngram_is(
dfTag,
max_ngram = 5,
term = "lemma",
pos = c("NOUN", "ADJ", "ADV", "VERB"),
min_freq = 1,
min_IS_norm = 0
)
dfTag |
A data frame with tagged text data containing columns: doc_id, sentence_id, token_id, lemma/token, upos |
max_ngram |
Maximum length of n-grams to generate (default: 5) |
term |
Character string indicating which column to use: "lemma" or "token" (default: "lemma") |
pos |
Character vector of POS tags considered lexical (default: c("NOUN", "ADJ", "ADV", "VERB")) |
min_freq |
Minimum frequency threshold for n-grams (default: 1) |
min_IS_norm |
Minimum normalized IS threshold for n-grams (default: 0) |
The IS index is calculated as: IS = (sum 1/freq_i) × freq_ngram × n_lexical where freq_i is the frequency of each word in the n-gram, freq_ngram is the frequency of the n-gram, and n_lexical is the number of lexical words. IS_norm is the normalized version: IS / L^2 where L is the n-gram length.
OPTIMIZATION: Only n-grams that start AND end with lexical words (as defined by the 'pos' parameter) are generated, significantly reducing computation time.
A tibble with columns: ngram, n_length, ngram_freq, n_lexical, IS, IS_norm
## Not run:
IS <- calculate_ngram_is(dfTag, max_ngram = 4, term = "lemma", min_freq = 2)
head(IS)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.