ml_lda: Latent Dirichlet Allocation
In danzafar/tidyspark: A Tidy Interface to Spark

Description Usage Arguments Value Note See Also

ml_lda fits a Latent Dirichlet Allocation model on a spark_tbl. Users can call summary to get a summary of the fitted LDA model.

ml_lda(
  data,
  features = "features",
  k = 10,
  maxIter = 20,
  optimizer = c("online", "em"),
  subsamplingRate = 0.05,
  topicConcentration = -1,
  docConcentration = -1,
  customizedStopWords = "",
  maxVocabSize = bitwShiftL(1, 18)
)

## S4 method for signature 'LDAModel'
summary(object, maxTermsPerTopic)

ml_perplexity(object, data)

ml_posterior(object, newData)

## S4 method for signature 'LDAModel,character'
write_ml(object, path, overwrite = FALSE)

`data`	A spark_tbl for training.
`features`	Features column name. Either libSVM-format column or character-format column is valid.
`k`	Number of topics.
`maxIter`	Maximum iterations.
`optimizer`	Optimizer to train an LDA model, "online" or "em", default is "online".
`subsamplingRate`	(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].
`topicConcentration`	concentration parameter (commonly named `beta` or `eta`) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective topicConcentration. Only 1-size numeric is accepted.
`docConcentration`	concentration parameter (commonly named `alpha`) for the prior placed on documents distributions over topics (`theta`), default -1 to set automatically on the Spark side. Use `summary` to retrieve the effective docConcentration. Only 1-size or `k`-size numeric is accepted.
`customizedStopWords`	stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.
`maxVocabSize`	maximum vocabulary size, default 1 << 18
`object`	A Latent Dirichlet Allocation model fitted by `spark.lda`.
`maxTermsPerTopic`	Maximum number of terms to collect for each topic. Default value of 10.
`newData`	A spark_tbl for testing.
`path`	The directory where the model is saved.
`overwrite`	Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.
`...`	additional argument(s) passed to the method.

ml_lda returns a fitted Latent Dirichlet Allocation model.

summary returns summary information of the fitted model, which is a list. The list includes

`docConcentration`	concentration parameter commonly named `alpha` for the prior placed on documents distributions over topics `theta`
`topicConcentration`	concentration parameter commonly named `beta` or `eta` for the prior placed on topic distributions over terms
`logLikelihood`	log likelihood of the entire corpus
`logPerplexity`	log perplexity
`isDistributed`	TRUE for distributed model while FALSE for local model
`vocabSize`	number of terms in the corpus
`topics`	top 10 terms and their weights of all topics
`vocabulary`	whole terms of the training corpus, NULL if libsvm format file used as training set
`trainingLogLikelihood`	Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs \| topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")
`logPrior`	Log probability of the current parameter estimate: log P(topics, topic distributions for docs \| Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")