document_term_frequencies: Aggregate a data.frame to the document/term level by...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

document_term_frequencies

R Documentation

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Description

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Usage

document_term_frequencies(x, document, ...)

## S3 method for class 'data.frame'
document_term_frequencies(
  x,
  document = colnames(x)[1],
  term = colnames(x)[2],
  ...
)

## S3 method for class 'character'
document_term_frequencies(
  x,
  document = paste("doc", seq_along(x), sep = ""),
  split = "[[:space:][:punct:][:digit:]]+",
  ...
)

Arguments

`x`	a data.frame or data.table containing a field which can be considered as a document (defaults to the first column in `x`) and a field which can be considered as a term (defaults to the second column in `x`). If the dataset also contains a column called 'freq', this will be summed over instead of counting the number of rows occur by document/term combination. If `x` is a character vector containing several terms, the text will be split by the argument `split` before doing the agregation at the document/term level.
`document`	If `x` is a data.frame, the column in `x` which identifies a document. If `x` is a character vector then `document` is a vector of the same length as `x` where `document[i]` is the document id which corresponds to the text in `x[i]`.
`...`	further arguments passed on to the methods
`term`	If `x` is a data.frame, the column in `x` which identifies a term. Defaults to the second column in `x`.
`split`	The regular expression to be used if `x` is a character vector. This will split the character vector `x` in pieces by the provides split argument. Defaults to splitting according to spaces/punctuations/digits.

Value

a data.table with columns doc_id, term, freq indicating how many times a term occurred in each document. If freq occurred in the input dataset the resulting data will have summed the freq. If freq is not in the dataset, will assume that freq is 1 for each row in the input dataset x.

Methods (by class)

data.frame: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document
character: Create a data.frame with one row per document/term combination indicating the frequency of the term in the document

Examples


##
## Calculate document_term_frequencies on a data.frame
##
data(brussels_reviews_anno)

x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")])
x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "lemma")])
str(x)

brussels_reviews_anno$my_doc_id <- paste(brussels_reviews_anno$doc_id, 
                                         brussels_reviews_anno$sentence_id)
x <- document_term_frequencies(brussels_reviews_anno[, c("my_doc_id", "lemma")])

##
## Calculate document_term_frequencies on a character vector
##
data(brussels_reviews)
x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, 
                               split = " ")
x <- document_term_frequencies(x = brussels_reviews$feedback, document = brussels_reviews$id, 
                               split = "[[:space:][:punct:][:digit:]]+")
                               
##
## document-term-frequencies on several fields to easily include bigram and trigrams
##
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- x[, token_bigram  := txt_nextgram(token, n = 2), by = list(doc_id, sentence_id)]
x <- x[, token_trigram := txt_nextgram(token, n = 3), by = list(doc_id, sentence_id)]
x <- document_term_frequencies(x = x, 
                               document = "doc_id", 
                               term = c("token", "token_bigram", "token_trigram"))
head(x)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

document_term_frequencies: Aggregate a data.frame to the document/term level by...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Description

Usage

Arguments

Value

Methods (by class)

Examples

Related to document_term_frequencies in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

document_term_frequencies: Aggregate a data.frame to the document/term level by... In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Aggregate a data.frame to the document/term level by calculating how many times a term occurs per document

Description

Usage

Arguments

Value

Methods (by class)

Examples

Related to document_term_frequencies in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

document_term_frequencies: Aggregate a data.frame to the document/term level by...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit