dtm_remove_tfidf: Remove terms from a Document-Term-Matrix and documents with...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_remove_tfidf

R Documentation

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency

Description

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency. Either giving in the maximum number of terms (argument top), the tfidf cutoff (argument cutoff) or a quantile (argument prob)

Usage

dtm_remove_tfidf(dtm, top, cutoff, prob, remove_emptydocs = TRUE)

Arguments

`dtm`	an object returned by `document_term_matrix`
`top`	integer with the number of terms which should be kept as defined by the highest mean tfidf
`cutoff`	numeric cutoff value to keep only terms in `dtm` where the tfidf obtained by `dtm_tfidf` is higher than this value
`prob`	numeric quantile indicating to keep only terms in `dtm` where the tfidf obtained by `dtm_tfidf` is higher than the `prob` percent quantile
`remove_emptydocs`	logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to `TRUE`.

Value

a sparse Matrix as returned by sparseMatrix where terms with high tfidf are kept and documents without any remaining terms are removed

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, xpos == "NN")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 10)
dim(dtm)

## Keep only terms with high tfidf
x <- dtm_remove_tfidf(dtm, top=50)
dim(x)
x <- dtm_remove_tfidf(dtm, top=50, remove_emptydocs = FALSE)
dim(x)

## Keep only terms with tfidf above 1.1
x <- dtm_remove_tfidf(dtm, cutoff=1.1)
dim(x)

## Keep only terms with tfidf above the 60 percent quantile
x <- dtm_remove_tfidf(dtm, prob=0.6)
dim(x)

udpipe documentation built on Jan. 30, 2026, 5:09 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_remove_tfidf: Remove terms from a Document-Term-Matrix and documents with...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency

Description

Usage

Arguments

Value

Examples

Related to dtm_remove_tfidf in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_remove_tfidf: Remove terms from a Document-Term-Matrix and documents with... In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency

Description

Usage

Arguments

Value

Examples

Related to dtm_remove_tfidf in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_remove_tfidf: Remove terms from a Document-Term-Matrix and documents with...
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit