ml_lda: Latent Dirichlet Allocation

Description Usage Arguments Value Note See Also

View source: R/ml_clustering.R

Description

ml_lda fits a Latent Dirichlet Allocation model on a spark_tbl. Users can call summary to get a summary of the fitted LDA model.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
ml_lda(
  data,
  features = "features",
  k = 10,
  maxIter = 20,
  optimizer = c("online", "em"),
  subsamplingRate = 0.05,
  topicConcentration = -1,
  docConcentration = -1,
  customizedStopWords = "",
  maxVocabSize = bitwShiftL(1, 18)
)

## S4 method for signature 'LDAModel'
summary(object, maxTermsPerTopic)

ml_perplexity(object, data)

ml_posterior(object, newData)

## S4 method for signature 'LDAModel,character'
write_ml(object, path, overwrite = FALSE)

Arguments

data

A spark_tbl for training.

features

Features column name. Either libSVM-format column or character-format column is valid.

k

Number of topics.

maxIter

Maximum iterations.

optimizer

Optimizer to train an LDA model, "online" or "em", default is "online".

subsamplingRate

(For online optimizer) Fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent, in range (0, 1].

topicConcentration

concentration parameter (commonly named beta or eta) for the prior placed on topic distributions over terms, default -1 to set automatically on the Spark side. Use summary to retrieve the effective topicConcentration. Only 1-size numeric is accepted.

docConcentration

concentration parameter (commonly named alpha) for the prior placed on documents distributions over topics (theta), default -1 to set automatically on the Spark side. Use summary to retrieve the effective docConcentration. Only 1-size or k-size numeric is accepted.

customizedStopWords

stopwords that need to be removed from the given corpus. Ignore the parameter if libSVM-format column is used as the features column.

maxVocabSize

maximum vocabulary size, default 1 << 18

object

A Latent Dirichlet Allocation model fitted by spark.lda.

maxTermsPerTopic

Maximum number of terms to collect for each topic. Default value of 10.

newData

A spark_tbl for testing.

path

The directory where the model is saved.

overwrite

Overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.

...

additional argument(s) passed to the method.

Value

ml_lda returns a fitted Latent Dirichlet Allocation model.

summary returns summary information of the fitted model, which is a list. The list includes

docConcentration

concentration parameter commonly named alpha for the prior placed on documents distributions over topics theta

topicConcentration

concentration parameter commonly named beta or eta for the prior placed on topic distributions over terms

logLikelihood

log likelihood of the entire corpus

logPerplexity

log perplexity

isDistributed

TRUE for distributed model while FALSE for local model

vocabSize

number of terms in the corpus

topics

top 10 terms and their weights of all topics

vocabulary

whole terms of the training corpus, NULL if libsvm format file used as training set

trainingLogLikelihood

Log likelihood of the observed tokens in the training set, given the current parameter estimates: log P(docs | topics, topic distributions for docs, Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")

logPrior

Log probability of the current parameter estimate: log P(topics, topic distributions for docs | Dirichlet hyperparameters) It is only for distributed LDA model (i.e., optimizer = "em")

ml_perplexity returns the log perplexity of given spark_tbl, or the log perplexity of the training data if missing argument "data".

ml_posterior returns a spark_tbl containing posterior probabilities vectors named "topicDistribution".

Note

summary(LDAModel) since 2.1.0

ml_perplexity(LDAModel) since 2.1.0

ml_posterior(LDAModel) since 2.1.0

write_ml(LDAModel, character) since 2.1.0

See Also

topicmodels: https://cran.r-project.org/package=topicmodels

read_ml


danzafar/tidyspark documentation built on Sept. 30, 2020, 12:19 p.m.