Description Usage Arguments Value Note References See Also Examples
Perform text clustering by using semantic embeddings of documents and words to find topics of text documents which are semantically similar.
1 2 3 4 5 6 7 8 9 10 |
x |
either an object returned by |
data |
optionally, a data.frame with columns 'doc_id' and 'text' representing documents. This dataset is just stored, in order to extract the text of the most similar documents to a topic. If it also contains a field 'text_doc2vec', this will be used to indicate the most relevant topic words by class-based tfidf |
control.umap |
a list of arguments to pass on to |
control.dbscan |
a list of arguments to pass on to |
control.doc2vec |
optionally, a list of arguments to pass on to |
umap |
function to apply UMAP. Defaults to |
trace |
logical indicating to print evolution of the algorithm |
... |
further arguments not used yet |
an object of class top2vec
which is a list with elements
embedding: a list of matrices with word and document embeddings
doc2vec: a doc2vec model
umap: a matrix of representations of the documents of x
dbscan: the result of the hdbscan clustering
data: a data.frame with columns doc_id and text
size: a vector of frequency statistics of topic occurrence
k: the number of clusters
control: a list of control arguments to doc2vec / umap / dbscan
The topic '0' is the noise topic
https://arxiv.org/abs/2008.09470
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x <- data.frame(doc_id = be_parliament_2020$doc_id,
text = be_parliament_2020$text_nl,
stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
x <- subset(x, txt_count_words(text) < 1000)
d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 50,
lr = 0.05, iter = 10,
window = 15, hs = TRUE, negative = 0,
sample = 0.00001, min_count = 5,
threads = 1)
# write.paragraph2vec(d2v, "d2v.bin")
# d2v <- read.paragraph2vec("d2v.bin")
model <- top2vec(d2v, data = x,
control.dbscan = list(minPts = 50),
control.umap = list(n_neighbors = 15L, n_components = 4), trace = TRUE)
model <- top2vec(d2v, data = x,
control.dbscan = list(minPts = 50),
control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap,
trace = TRUE)
info <- summary(model, top_n = 7)
info$topwords
info$topdocs
library(udpipe)
info <- summary(model, top_n = 7, type = "c-tfidf")
info$topwords
## Change the model: reduce doc2vec model to 2D
model <- update(model, type = "umap",
n_neighbors = 100, n_components = 2, metric = "cosine", umap = tumap,
trace = TRUE)
info <- summary(model, top_n = 7)
data = x
info$topwords
info$topdocs
## Change the model: have minimum 200 points for the core elements in the hdbscan density
model <- update(model, type = "hdbscan", minPts = 200, trace = TRUE)
info <- summary(model, top_n = 7)
data = x
info$topwords
info$topdocs
##
## Example on a small sample
## with unrealistic hyperparameter settings especially regarding dim / iter / n_epochs
## in order to have a basic example finishing < 5 secs
##
library(uwot)
library(dbscan)
library(word2vec)
data(be_parliament_2020, package = "doc2vec")
x <- data.frame(doc_id = be_parliament_2020$doc_id,
text = be_parliament_2020$text_nl,
stringsAsFactors = FALSE)
x <- head(x, 1000)
x$text <- txt_clean_word2vec(x$text)
x <- subset(x, txt_count_words(text) < 1000)
d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 10,
lr = 0.05, iter = 0,
window = 5, hs = TRUE, negative = 0,
sample = 0.00001, min_count = 5)
emb <- list(docs = as.matrix(d2v, which = "docs"),
words = as.matrix(d2v, which = "words"))
model <- top2vec(emb,
data = x,
control.dbscan = list(minPts = 50),
control.umap = list(n_neighbors = 15, n_components = 2,
init = "spectral"),
umap = tumap, trace = TRUE)
info <- summary(model, top_n = 7)
print(info, top_n = c(5, 2))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.