getPrototype: Determine the Prototype LDA
In JonasRieger/ldaPrototype: Prototype of Multiple Latent Dirichlet Allocation Runs

getPrototype

R Documentation

Determine the Prototype LDA

Description

Returns the Prototype LDA of a set of LDAs. This set is given as LDABatch object, LDARep object, or as list of LDAs. If the matrix of S-CLOP scores sclop is passed, no calculation is needed/done.

Usage

getPrototype(...)

## S3 method for class 'LDARep'
getPrototype(
  x,
  vocab,
  limit.rel,
  limit.abs,
  atLeast,
  progress = TRUE,
  pm.backend,
  ncpus,
  keepTopics = FALSE,
  keepSims = FALSE,
  keepLDAs = FALSE,
  sclop,
  ...
)

## S3 method for class 'LDABatch'
getPrototype(
  x,
  vocab,
  limit.rel,
  limit.abs,
  atLeast,
  progress = TRUE,
  pm.backend,
  ncpus,
  keepTopics = FALSE,
  keepSims = FALSE,
  keepLDAs = FALSE,
  sclop,
  ...
)

## Default S3 method:
getPrototype(
  lda,
  vocab,
  id,
  job,
  limit.rel,
  limit.abs,
  atLeast,
  progress = TRUE,
  pm.backend,
  ncpus,
  keepTopics = FALSE,
  keepSims = FALSE,
  keepLDAs = FALSE,
  sclop,
  ...
)

Arguments

`...`	additional arguments
`x`	[`named list`] `LDABatch` or `LDARep` object.
`vocab`	[`character`] Vocabularies taken into consideration for merging topic matrices. Not considered, if `sclop` is passed. Default is the vocabulary of the first LDA.
`limit.rel`	[0,1] See `jaccardTopics`. Default is `1/500`. Not considered for calculation, if `sclop` is passed. But should be passed determining the correct value for the resulting object.
`limit.abs`	[`integer(1)`] See `jaccardTopics`. Default is `10`. Not considered for calculation, if `sclop` is passed. But should be passed determining the correct value for the resulting object.
`atLeast`	[`integer(1)`] See `jaccardTopics`. Default is `0`. Not considered for calculation, if `sclop` is passed. But should be passed determining the correct value for the resulting object.
`progress`	[`logical(1)`] Should a nice progress bar be shown for the steps of `mergeTopics` and `jaccardTopics`? Turning it off, could lead to significantly faster calculation. Default ist `TRUE`. Not considered, if `sclop` is passed.
`pm.backend`	[`character(1)`] One of "multicore", "socket" or "mpi". If `pm.backend` is set, `parallelStart` is called before computation is started and `parallelStop` is called after. Not considered, if `sclop` is passed.
`ncpus`	[`integer(1)`] Number of (physical) CPUs to use. If `pm.backend` is passed, default is determined by `availableCores`. Not considered, if `sclop` is passed.
`keepTopics`	[`logical(1)`] Should the merged topic matrix from `mergeTopics` be kept? Not considered, if `sclop` is passed.
`keepSims`	[`logical(1)`] Should the calculated topic similarities matrix from `jaccardTopics` be kept? Not considered, if `sclop` is passed.
`keepLDAs`	[`logical(1)`] Should the considered LDAs be kept?
`sclop`	[`symmetrical named matrix`] (optional) All pairwise S-CLOP scores of the given LDA runs determined by `SCLOP.pairwise`. Matching of names is not implemented yet, so order matters.
`lda`	[`named list`] List of `LDA` objects, named by the corresponding "job.id".
`id`	[`character(1)`] A name for the computation. If not passed, it is set to "LDARep". Not considered for `LDABatch` or `LDARep` objects.
`job`	[`data.frame` or `named vector`] A data.frame or data.table with named columns (at least) "job.id" (`integerish`), "K", "alpha", "eta" and "num.iterations" or a named vector with entries (at least) "K", "alpha", "eta" and "num.iterations". If not passed, it is interpreted from `param` of each LDA. Not considered for `LDABatch` or `LDARep` objects.

Details

While LDAPrototype marks the overall shortcut for performing multiple LDA runs and choosing the Prototype of them, getPrototype just hooks up at determining the Prototype. The generation of multiple LDAs has to be done before use of this function. The function is flexible enough to use it at at least two steps/parts of the analysis: After generating the LDAs (no matter whether as LDABatch or LDARep object) or after determing the pairwise SCLOP values.

To save memory a lot of interim calculations are discarded by default.

If you use parallel computation, no progress bar is shown.

For details see the details sections of the workflow functions.

Value

[named list] with entries

id: [character(1)] See above.
protoid: [character(1)] Name (ID) of the determined Prototype LDA.
lda: List of LDA objects of the determined Prototype LDA and - if keepLDAs is TRUE - all considered LDAs.
jobs: [data.table] with parameter specifications for the LDAs.
param: [named list] with parameter specifications for limit.rel [0,1], limit.abs [integer(1)] and atLeast [integer(1)]. See above for explanation.
topics: [named matrix] with the count of vocabularies (row wise) in topics (column wise).
sims: [lower triangular named matrix] with all pairwise jaccard similarities of the given topics.
wordslimit: [integer] with counts of words determined as relevant based on limit.rel and limit.abs.
wordsconsidered: [integer] with counts of considered words for similarity calculation. Could differ from wordslimit, if atLeast is greater than zero.
sclop: [symmetrical named matrix] with all pairwise S-CLOP scores of the given LDA runs.

Examples

res = LDARep(docs = reuters_docs, vocab = reuters_vocab,
   n = 4, K = 10, num.iterations = 30)
topics = mergeTopics(res, vocab = reuters_vocab)
jacc = jaccardTopics(topics, atLeast = 2)
dend = dendTopics(jacc)
sclop = SCLOP.pairwise(jacc)

getPrototype(lda = getLDA(res), sclop = sclop)

proto = getPrototype(res, vocab = reuters_vocab, keepSims = TRUE,
   limit.abs = 20, atLeast = 10)
proto
getPrototype(proto) # = getLDA(proto)
getConsideredWords(proto)
# > 10 if there is more than one word which is the 10-th often word (ties)
getRelevantWords(proto)
getSCLOP(proto)

JonasRieger/ldaPrototype documentation built on Feb. 5, 2023, 6:45 p.m.