cnlp_utils_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame
In statsmaths/cleanNLP: A Tidy Data Model for Natural Language Processing

cnlp_utils_tfidf

R Documentation

Construct the TF-IDF Matrix from Annotation or Data Frame

Description

Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.

Usage

cnlp_utils_tfidf(
  object,
  tf_weight = c("lognorm", "binary", "raw", "dnorm"),
  idf_weight = c("idf", "smooth", "prob", "uniform"),
  min_df = 0.1,
  max_df = 0.9,
  max_features = 10000,
  doc_var = "doc_id",
  token_var = "lemma",
  vocabulary = NULL,
  doc_set = NULL
)

cnlp_utils_tf(
  object,
  tf_weight = "raw",
  idf_weight = "uniform",
  min_df = 0,
  max_df = 1,
  max_features = 10000,
  doc_var = "doc_id",
  token_var = "lemma",
  vocabulary = NULL,
  doc_set = NULL
)

Arguments

`object`	a data frame containing an identifier for the document (set with `doc_var`) and token (set with `token_var`)
`tf_weight`	the weighting scheme for the term frequency matrix. The selection `lognorm` takes one plus the log of the raw frequency (or zero if zero), `binary` encodes a zero one matrix indicating simply whether the token exists at all in the document, `raw` returns raw counts, and `dnorm` uses double normalization.
`idf_weight`	the weighting scheme for the inverse document matrix. The selection `idf` gives the logarithm of the simple inverse frequency, `smooth` gives the logarithm of one plus the simple inverse frequency, and `prob` gives the log odds of the the token occurring in a randomly selected document. Set to `uniform` to return just the term frequencies.
`min_df`	the minimum proportion of documents a token should be in to be included in the vocabulary
`max_df`	the maximum proportion of documents a token should be in to be included in the vocabulary
`max_features`	the maximum number of tokens in the vocabulary
`doc_var`	character vector. The name of the column in `object` that contains the document ids. Defaults to "doc_id".
`token_var`	character vector. The name of the column in `object` that contains the tokens. Defaults to "lemma".
`vocabulary`	character vector. The vocabulary set to use in constructing the matrices. Will be computed within the function if set to `NULL`. When supplied, the options `min_df`, `max_df`, and `max_features` are ignored.
`doc_set`	optional character vector of document ids. Useful to create empty rows in the output matrix for documents without data in the input. Most users will want to keep this equal to `NULL`, the default, to have the function compute the document set automatically.