sm_text_tfidf: Construct the TF-IDF Matrix from Annotation or Data Frame

Description Usage Arguments Value

View source: R/text.R

Description

Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.

Usage

1
2
3
4
5
6
7
8
9
sm_text_tfidf(
  object,
  min_df = 0.1,
  max_df = 0.9,
  max_features = 10000,
  doc_var = "doc_id",
  token_var = "lemma",
  vocabulary = NULL
)

Arguments

object

a data frame containing an identifier for the document (set with doc_var) and token (set with token_var)

min_df

the minimum proportion of documents a token should be in to be included in the vocabulary

max_df

the maximum proportion of documents a token should be in to be included in the vocabulary

max_features

the maximum number of tokens in the vocabulary

doc_var

character vector. The name of the column in object that contains the document ids. Defaults to "doc_id".

token_var

character vector. The name of the column in object that contains the tokens. Defaults to "lemma".

vocabulary

character vector. The vocabulary set to use in constructing the matrices. Will be computed within the function if set to NULL. When supplied, the options min_df, max_df, and max_features are ignored.

Value

a tibble in wide format with term frequencies and tf-idf values.


statsmaths/smodels documentation built on Jan. 9, 2021, 1:07 p.m.