madlib.lda: Wrapper for MADlib's Latent Dirichilet Allocation

Description Usage Arguments Value Author(s) References See Also Examples

Description

This function is a wrapper for MADlib's Latent Dirichlet Allocation. The computation is parallelized by MADlib if the connected database is distributed. Please refer to MADlib documentation for details of the algorithm implementation [1].

Usage

1
2
  madlib.lda(data, topic_num, alpha, beta, iter_num = 20,
  nstart = 1, best = TRUE,...)

Arguments

data

An object of db.obj class. This is the database table containing the documents on which the algorithm will train. The text of each document should be tokenized into 'words'.

topic_num

Number of topics.

alpha

Dirichlet parameter for the per-doc topic multinomial.

beta

Dirichlet parameter for the per-topic word multinomial.

iter_num

Number of iterations.

nstart

Number of repeated random starts.

best

If TRUE only the model with the minimum perplexity is returned.

...

Other optional parameters. Not implemented.

Value

An lda.madlib object or a list of them, which is a list that contains the following items:

assignments

The per-document topic assignments.

document_sums

The per-document topic counts.

model_table

The db.table object for accessing the model table in the database.

output_table

The db.table object for accessing the output table in the database.

tf_table

The db.table object for accessing the term frequency table in the database.

topic_sums

The per-topic sum of assignments.

topics

The per-word association with topics.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

References

[1] Documentation of LDA in the latest MADlib release, https://madlib.apache.org/docs/latest/group__grp__lda.html

See Also

predict.lda.madlib is used for prediction-labelling test documents using a learned lda.madlib model.

perplexity.lda.madlib is used for computing the perplexity of a learned lda.madlib model.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
## Not run: 


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

dat <- db.data.frame("__madlib_pivotalr_lda_data__", conn.id = cid,
  verbose = FALSE)

output.db <- madlib.lda(dat, 2,0.1,0.1, 50)

perplexity.db <- perplexity.lda.madlib(output.db)
print(perplexity.db)

## Run LDA multiple times and get the best one
output.db <- madlib.lda(dat, 2,0.1,0.1, 50, nstart=2)
perplexity.db <- perplexity.lda.madlib(output.db)
print(perplexity.db)

## Run LDA multiple times and keep all models
output.db <- madlib.lda(dat, 2,0.1,0.1, 50, nstart=2, best=FALSE)

perplexity.db <- perplexity.lda.madlib(output.db[[1]])
print(perplexity.db)

perplexity.db <- perplexity.lda.madlib(output.db[[2]])
print(perplexity.db)

db.disconnect(cid)

## End(Not run)

PivotalR documentation built on March 13, 2021, 1:06 a.m.