cnlp_utils_tfidf | R Documentation |
Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.
cnlp_utils_tfidf(
object,
tf_weight = c("lognorm", "binary", "raw", "dnorm"),
idf_weight = c("idf", "smooth", "prob", "uniform"),
min_df = 0.1,
max_df = 0.9,
max_features = 10000,
doc_var = "doc_id",
token_var = "lemma",
vocabulary = NULL,
doc_set = NULL
)
cnlp_utils_tf(
object,
tf_weight = "raw",
idf_weight = "uniform",
min_df = 0,
max_df = 1,
max_features = 10000,
doc_var = "doc_id",
token_var = "lemma",
vocabulary = NULL,
doc_set = NULL
)
object |
a data frame containing an identifier for the document
(set with |
tf_weight |
the weighting scheme for the term frequency matrix.
The selection |
idf_weight |
the weighting scheme for the inverse document
matrix. The selection |
min_df |
the minimum proportion of documents a token should be in to be included in the vocabulary |
max_df |
the maximum proportion of documents a token should be in to be included in the vocabulary |
max_features |
the maximum number of tokens in the vocabulary |
doc_var |
character vector. The name of the column in
|
token_var |
character vector. The name of the column in
|
vocabulary |
character vector. The vocabulary set to use in
constructing the matrices. Will be computed
within the function if set to |
doc_set |
optional character vector of document ids. Useful to
create empty rows in the output matrix for documents
without data in the input. Most users will want to keep
this equal to |
a sparse matrix with dimnames giving the documents and vocabular.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.