dtm_svd_similarity | R Documentation |
Calculate the similarity of a document term matrix to a set of terms based on
a Singular Value Decomposition (SVD) embedding matrix.
This can be used to easily construct a sentiment score based on the latent scale defined by a set of positive or negative terms.
dtm_svd_similarity( dtm, embedding, weights, terminology = rownames(embedding), type = c("cosine", "dot") )
dtm |
a sparse matrix such as a "dgCMatrix" object which is returned by |
embedding |
a matrix containing the |
weights |
a numeric vector with weights giving your definition of which terms are positive or negative, The names of this vector should be terms available in the rownames of the embedding matrix. See the examples. |
terminology |
a character vector of terms to limit the calculation of the similarity for the |
type |
either 'cosine' or 'dot' indicating to respectively calculate cosine similarities or inner product similarities between the |
an object of class 'svd_similarity' which is a list with elements
weights: The weights used. These are scaled to sum up to 1 as well on the positive as the negative side
type: The type of similarity calculated (either 'cosine' or 'dot')
terminology: A data.frame with columns term, freq and similarity where similarity indicates
the similarity between the term and the SVD embedding space of the weights and freq is how frequently the term occurs in the dtm
.
This dataset is sorted in descending order by similarity.
similarity: A data.frame with columns doc_id and similarity indicating the similarity between
the dtm
and the SVD embedding space of the weights. The doc_id is the identifier taken from the rownames of dtm
.
scale: A list with elements terminology and weights
indicating respectively the similarity in the SVD embedding space
between the terminology
and each of the weights and between the weight terms itself
https://en.wikipedia.org/wiki/Latent_semantic_analysis
data("brussels_reviews_anno", package = "udpipe") x <- subset(brussels_reviews_anno, language %in% "nl" & (upos %in% "ADJ" | lemma %in% "niet")) dtm <- document_term_frequencies(x, document = "doc_id", term = "lemma") dtm <- document_term_matrix(dtm) dtm <- dtm_remove_lowfreq(dtm, minfreq = 3) ## Function performing Singular Value Decomposition on sparse/dense data dtm_svd <- function(dtm, dim = 5, type = c("RSpectra", "svd"), ...){ type <- match.arg(type) if(type == "svd"){ SVD <- svd(dtm, nu = 0, nv = dim, ...) }else if(type == "RSpectra"){ #Uncomment this if you want to use the faster sparse SVD by RSpectra #SVD <- RSpectra::svds(dtm, nu = 0, k = dim, ...) } rownames(SVD$v) <- colnames(dtm) SVD$v } #embedding <- dtm_svd(dtm, dim = 5) embedding <- dtm_svd(dtm, dim = 5, type = "svd") ## Define positive / negative terms and calculate the similarity to these weights <- setNames(c(1, 1, 1, 1, -1, -1, -1, -1), c("fantastisch", "schoon", "vriendelijk", "net", "lawaaiig", "lastig", "niet", "slecht")) scores <- dtm_svd_similarity(dtm, embedding = embedding, weights = weights) scores str(scores$similarity) hist(scores$similarity$similarity) plot(scores$terminology$similarity_weight, log(scores$terminology$freq), type = "n") text(scores$terminology$similarity_weight, log(scores$terminology$freq), labels = scores$terminology$term) ## Not run: ## More elaborate example using word2vec ## building word2vec model on all Dutch texts, ## finding similarity of dtm to adjectives only set.seed(123) library(word2vec) text <- subset(brussels_reviews_anno, language == "nl") text <- paste.data.frame(text, term = "lemma", group = "doc_id") text <- text$lemma model <- word2vec(text, dim = 10, iter = 20, type = "cbow", min_count = 1) predict(model, newdata = names(weights), type = "nearest", top_n = 3) embedding <- as.matrix(model) ## End(Not run) data(brussels_reviews_w2v_embeddings_lemma_nl) embedding <- brussels_reviews_w2v_embeddings_lemma_nl adjective <- subset(brussels_reviews_anno, language %in% "nl" & upos %in% "ADJ") adjective <- txt_freq(adjective$lemma) adjective <- subset(adjective, freq >= 5 & nchar(key) > 1) adjective <- adjective$key scores <- dtm_svd_similarity(dtm, embedding, weights = weights, type = "dot", terminology = adjective) scores plot(scores$terminology$similarity_weight, log(scores$terminology$freq), type = "n") text(scores$terminology$similarity_weight, log(scores$terminology$freq), labels = scores$terminology$term, cex = 0.8)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.