Description Usage Arguments Value Note See Also
View source: R/ml_clustering.R
ml_lda fits a Latent Dirichlet Allocation model on a spark_tbl.
Users can call
summary to get a summary of the fitted LDA model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22  | ml_lda(
  data,
  features = "features",
  k = 10,
  maxIter = 20,
  optimizer = c("online", "em"),
  subsamplingRate = 0.05,
  topicConcentration = -1,
  docConcentration = -1,
  customizedStopWords = "",
  maxVocabSize = bitwShiftL(1, 18)
)
## S4 method for signature 'LDAModel'
summary(object, maxTermsPerTopic)
ml_perplexity(object, data)
ml_posterior(object, newData)
## S4 method for signature 'LDAModel,character'
write_ml(object, path, overwrite = FALSE)
 | 
data | 
 A spark_tbl for training.  | 
features | 
 Features column name. Either libSVM-format column or character-format column is valid.  | 
k | 
 Number of topics.  | 
maxIter | 
 Maximum iterations.  | 
optimizer | 
 Optimizer to train an LDA model, "online" or "em", default is "online".  | 
subsamplingRate | 
 (For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].  | 
topicConcentration | 
 concentration parameter (commonly named   | 
docConcentration | 
 concentration parameter (commonly named   | 
customizedStopWords | 
 stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.  | 
maxVocabSize | 
 maximum vocabulary size, default 1 << 18  | 
object | 
 A Latent Dirichlet Allocation model fitted by   | 
maxTermsPerTopic | 
 Maximum number of terms to collect for each topic. Default value of 10.  | 
newData | 
 A spark_tbl for testing.  | 
path | 
 The directory where the model is saved.  | 
overwrite | 
 Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.  | 
... | 
 additional argument(s) passed to the method.  | 
ml_lda returns a fitted Latent Dirichlet Allocation model.
summary returns summary information of the fitted model, which is a list.
The list includes
 | 
 concentration parameter commonly named   | 
 | 
 concentration parameter commonly named   | 
 | 
 log likelihood of the entire corpus  | 
 | 
 log perplexity  | 
 | 
 TRUE for distributed model while FALSE for local model  | 
 | 
 number of terms in the corpus  | 
 | 
 top 10 terms and their weights of all topics  | 
 | 
 whole terms of the training corpus, NULL if libsvm format file used as training set  | 
 | 
 Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")  | 
 | 
 Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")  | 
ml_perplexity returns the log perplexity of given
spark_tbl, or the log perplexity of the training data if
missing argument "data".
ml_posterior returns a spark_tbl containing posterior probabilities
vectors named "topicDistribution".
summary(LDAModel) since 2.1.0
ml_perplexity(LDAModel) since 2.1.0
ml_posterior(LDAModel) since 2.1.0
write_ml(LDAModel, character) since 2.1.0
topicmodels: https://cran.r-project.org/package=topicmodels
read_ml
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.