Description Usage Arguments Value Note See Also Examples
spark.lda
fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
summary
to get a summary of the fitted LDA model, spark.posterior
to compute
posterior probabilities on new data, spark.perplexity
to compute log perplexity on new
data and write.ml
/read.ml
to save/load fitted models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | spark.lda(data, ...)
spark.posterior(object, newData)
spark.perplexity(object, data)
## S4 method for signature 'SparkDataFrame'
spark.lda(
data,
features = "features",
k = 10,
maxIter = 20,
optimizer = c("online", "em"),
subsamplingRate = 0.05,
topicConcentration = -1,
docConcentration = -1,
customizedStopWords = "",
maxVocabSize = bitwShiftL(1, 18)
)
## S4 method for signature 'LDAModel'
summary(object, maxTermsPerTopic)
## S4 method for signature 'LDAModel,SparkDataFrame'
spark.perplexity(object, data)
## S4 method for signature 'LDAModel,SparkDataFrame'
spark.posterior(object, newData)
## S4 method for signature 'LDAModel,character'
write.ml(object, path, overwrite = FALSE)
|
data |
A SparkDataFrame for training. |
... |
additional argument(s) passed to the method. |
object |
A Latent Dirichlet Allocation model fitted by |
newData |
A SparkDataFrame for testing. |
features |
Features column name. Either libSVM-format column or character-format column is valid. |
k |
Number of topics. |
maxIter |
Maximum iterations. |
optimizer |
Optimizer to train an LDA model, "online" or "em", default is "online". |
subsamplingRate |
(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1]. |
topicConcentration |
concentration parameter (commonly named |
docConcentration |
concentration parameter (commonly named |
customizedStopWords |
stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column. |
maxVocabSize |
maximum vocabulary size, default 1 << 18 |
maxTermsPerTopic |
Maximum number of terms to collect for each topic. Default value of 10. |
path |
The directory where the model is saved. |
overwrite |
Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists. |
spark.lda
returns a fitted Latent Dirichlet Allocation model.
summary
returns summary information of the fitted model, which is a list.
The list includes
|
concentration parameter commonly named |
|
concentration parameter commonly named |
|
log likelihood of the entire corpus |
|
log perplexity |
|
TRUE for distributed model while FALSE for local model |
|
number of terms in the corpus |
|
top 10 terms and their weights of all topics |
|
whole terms of the training corpus, NULL if libsvm format file used as training set |
|
Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em") |
|
Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em") |
spark.perplexity
returns the log perplexity of given SparkDataFrame, or the log
perplexity of the training data if missing argument "data".
spark.posterior
returns a SparkDataFrame containing posterior probabilities
vectors named "topicDistribution".
spark.lda since 2.1.0
summary(LDAModel) since 2.1.0
spark.perplexity(LDAModel) since 2.1.0
spark.posterior(LDAModel) since 2.1.0
write.ml(LDAModel, character) since 2.1.0
topicmodels: https://cran.r-project.org/package=topicmodels
read.ml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ## Not run:
text <- read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")
model <- spark.lda(data = text, optimizer = "em")
# get a summary of the model
summary(model)
# compute posterior probabilities
posterior <- spark.posterior(model, text)
showDF(posterior)
# compute perplexity
perplexity <- spark.perplexity(model, text)
# save and load the model
path <- "path/to/model"
write.ml(model, path)
savedModel <- read.ml(path)
summary(savedModel)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.