dtm_svd_similarity: Semantic Similarity to a Singular Value Decomposition
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_svd_similarity

R Documentation

Semantic Similarity to a Singular Value Decomposition

Description

Calculate the similarity of a document term matrix to a set of terms based on a Singular Value Decomposition (SVD) embedding matrix.
This can be used to easily construct a sentiment score based on the latent scale defined by a set of positive or negative terms.

Usage

dtm_svd_similarity(
  dtm,
  embedding,
  weights,
  terminology = rownames(embedding),
  type = c("cosine", "dot")
)

Arguments

`dtm`	a sparse matrix such as a "dgCMatrix" object which is returned by `document_term_matrix` containing frequencies of terms for each document
`embedding`	a matrix containing the `v` element from an singular value decomposition with the right singular vectors. The rownames of that matrix should contain terms which are available in the `colnames(dtm)`. See the examples.
`weights`	a numeric vector with weights giving your definition of which terms are positive or negative, The names of this vector should be terms available in the rownames of the embedding matrix. See the examples.
`terminology`	a character vector of terms to limit the calculation of the similarity for the `dtm` to the linear combination of the weights. Defaults to all terms from the `embedding` matrix.
`type`	either 'cosine' or 'dot' indicating to respectively calculate cosine similarities or inner product similarities between the `dtm` and the SVD embedding space. Defaults to 'cosine'.

Value

an object of class 'svd_similarity' which is a list with elements

weights: The weights used. These are scaled to sum up to 1 as well on the positive as the negative side
type: The type of similarity calculated (either 'cosine' or 'dot')
terminology: A data.frame with columns term, freq and similarity where similarity indicates the similarity between the term and the SVD embedding space of the weights and freq is how frequently the term occurs in the dtm. This dataset is sorted in descending order by similarity.
similarity: A data.frame with columns doc_id and similarity indicating the similarity between the dtm and the SVD embedding space of the weights. The doc_id is the identifier taken from the rownames of dtm.
scale: A list with elements terminology and weights indicating respectively the similarity in the SVD embedding space between the terminology and each of the weights and between the weight terms itself

Examples

data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language %in% "nl" & (upos %in% "ADJ" | lemma %in% "niet"))
dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 3)

## Function performing Singular Value Decomposition on sparse/dense data
dtm_svd <- function(dtm, dim = 5, type = c("RSpectra", "svd"), ...){
  type <- match.arg(type)
  if(type == "svd"){
    SVD <- svd(dtm, nu = 0, nv = dim, ...)
  }else if(type == "RSpectra"){
    #Uncomment this if you want to use the faster sparse SVD by RSpectra
    #SVD <- RSpectra::svds(dtm, nu = 0, k = dim, ...)
  }
  rownames(SVD$v) <- colnames(dtm)
  SVD$v
}
#embedding <- dtm_svd(dtm, dim = 5)
embedding <- dtm_svd(dtm, dim = 5, type = "svd")

## Define positive / negative terms and calculate the similarity to these
weights <- setNames(c(1, 1, 1, 1, -1, -1, -1, -1),
                    c("fantastisch", "schoon", "vriendelijk", "net",
                      "lawaaiig", "lastig", "niet", "slecht"))
scores <- dtm_svd_similarity(dtm, embedding = embedding, weights = weights)
scores
str(scores$similarity)
hist(scores$similarity$similarity)

plot(scores$terminology$similarity_weight, log(scores$terminology$freq), 
     type = "n")
text(scores$terminology$similarity_weight, log(scores$terminology$freq), 
     labels = scores$terminology$term)
     
## Not run: 
## More elaborate example using word2vec
## building word2vec model on all Dutch texts, 
## finding similarity of dtm to adjectives only
set.seed(123)
library(word2vec)
text      <- subset(brussels_reviews_anno, language == "nl")
text      <- paste.data.frame(text, term = "lemma", group = "doc_id")
text      <- text$lemma
model     <- word2vec(text, dim = 10, iter = 20, type = "cbow", min_count = 1)
predict(model, newdata = names(weights), type = "nearest", top_n = 3)
embedding <- as.matrix(model)

## End(Not run)
data(brussels_reviews_w2v_embeddings_lemma_nl)
embedding <- brussels_reviews_w2v_embeddings_lemma_nl
adjective <- subset(brussels_reviews_anno, language %in% "nl" & upos %in% "ADJ")
adjective <- txt_freq(adjective$lemma)
adjective <- subset(adjective, freq >= 5 & nchar(key) > 1)
adjective <- adjective$key

scores    <- dtm_svd_similarity(dtm, embedding, weights = weights, type = "dot", 
                                terminology = adjective)
scores
plot(scores$terminology$similarity_weight, log(scores$terminology$freq), 
     type = "n")
text(scores$terminology$similarity_weight, log(scores$terminology$freq), 
     labels = scores$terminology$term, cex = 0.8)

udpipe documentation built on Jan. 30, 2026, 5:09 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_svd_similarity: Semantic Similarity to a Singular Value Decomposition
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Semantic Similarity to a Singular Value Decomposition

Description

Usage

Arguments

Value

See Also

Examples

Related to dtm_svd_similarity in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_svd_similarity: Semantic Similarity to a Singular Value Decomposition In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Semantic Similarity to a Singular Value Decomposition

Description

Usage

Arguments

Value

See Also

Examples

Related to dtm_svd_similarity in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

dtm_svd_similarity: Semantic Similarity to a Singular Value Decomposition
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit