document_term_frequencies | R Documentation |
Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document
document_term_frequencies(x, document, ...) ## S3 method for class 'data.frame' document_term_frequencies( x, document = colnames(x)[1], term = colnames(x)[2], ... ) ## S3 method for class 'character' document_term_frequencies( x, document = paste("doc", seq_along(x), sep = ""), split = "[[:space:][:punct:][:digit:]]+", ... )
x |
a data.frame or data.table containing a field which can be considered
as a document (defaults to the first column in |
document |
If |
... |
further arguments passed on to the methods |
term |
If |
split |
The regular expression to be used if |
a data.table with columns doc_id, term, freq indicating how many times a term occurred in each document.
If freq occurred in the input dataset the resulting data will have summed the freq. If freq is not in the dataset,
will assume that freq is 1 for each row in the input dataset x
.
data.frame
: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document
character
: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document
## ## Calculate document_term_frequencies on a data.frame ## data(brussels_reviews_anno) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")]) x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")]) str(x) brussels_reviews_anno$my_doc_id <- paste(brussels_reviews_anno$doc_id, brussels_reviews_anno$sentence_id) x <- document_term_frequencies(brussels_reviews_anno[, c("my_doc_id", "lemma")]) ## ## Calculate document_term_frequencies on a character vector ## data(brussels_reviews) x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, split = " ") x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, split = "[[:space:][:punct:][:digit:]]+") ## ## document-term-frequencies on several fields to easily include bigram and trigrams ## library(data.table) x <- as.data.table(brussels_reviews_anno) x <- x[, token_bigram := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)] x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)] x <- document_term_frequencies(x = x, document = "doc_id", term = c("token", "token_bigram", "token_trigram")) head(x)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.