Let me know if you run into any problems.
Here's a link to the paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3044864
The method in this code is slightly different from the one in the paper, which needs to be updated. It uses the square root of counts instead of counts in the word cooccurrence matrix. This makes the 'first' dimension of the output pick up word frequency and document length. This frequency dimension is labeled 'X0' in the output.
The pivoting is also done in 2 stages for a sharper separation between common words and rare words, along with a standardization in between stages to help with messier data sets (tweets).
The scale_text function suggests a truncation for the SVD. It is a function of the vocabulary size and produces approximately the same number of pivot words.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | ## Not run:
## install.packages(
## c("stm","ggplot2","gridExtra","Matrix",
## "reshape2","ForeCA","devtools","magrittr"))
#### word embeddings
## install.packages(c("dplyr","readr","tidyr","CCA"))
#### recommended
## install.packages(c("RSpectra","roxygen2"))
#### optional
## install.packages(c("knitr"))
library(devtools)
##
install_github("wilryh/parrot", dependencies=TRUE)
library(stm)
library(parrot)
processed <- textProcessor(
input_data$text,
data.frame(input_data),
removestopwords=T, lowercase=T, stem=F
)
out <- prepDocuments(
processed$documents, processed$vocab, processed$meta
)
tdm <- doc_to_tdm(out)
embeddings <- read_word_embeddings(
in_vocab=out$vocab,
ovefile = "O2M_overlap.txt" # must add location on your computer
## "path/to/O2M_overlap.txt"
## ovefile2 = "path/to/O2M_oov.txt", # very rare words and misspellings
## available here:
## http://www.cis.uni-muenchen.de/~wenpeng/renamed-meta-emb.tar.gz
## must unpack and replace "path/to/" with location on your computer
)
scores <- scale_text(
meta=out$meta,
tdm=tdm,
embeddings=embeddings[["meta"]],
compress_fast=TRUE,
constrain_outliers=TRUE
)
document_scores <- score_documents(
scores=scores, n_dimensions=10
)
get_keywords(scores, n_dimensions=3, n_words=15)
with(document_scores, cor(sqrt(n_words), X0, use="complete"))
plot_keywords(
scores, x_dimension=1, y_dimension=2, q_cutoff=0.9
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.