find_topics: Perform topic estimation on a themetadata object

Description Usage Arguments Details Value References See Also Examples

View source: R/find_topics.R

Description

Given a themetadata object, this function converts the OTU counts across samples into a document format and then fits a structural topic model by wrapping the stm function from package stm.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
find_topics(
  themetadata_object,
  K,
  sigma_prior = 0,
  model = NULL,
  iters = 500,
  tol = 1e-05,
  batches = 1,
  init_type = c("Spectral", "LDA", "Random"),
  seed = themetadata_object$seed,
  verbose = FALSE,
  verbose_n = 5,
  control = list()
)

Arguments

themetadata_object

(required) Ouput of prepare_data.

K

(required) A positive integer indicating the number of topics to be estimated.

sigma_prior

Scalar between 0 and 1. This sets the strength of regularization towards a diagonalized covariance matrix. Setting the value above 0 can be useful if topics are becoming too highly correlated. Defaults to 0.

model

Prefit STM model object to restart an existing model.

iters

Maximum number of EM iterations. Defaults to 500.

tol

Convergence tolerance. Defaults to 1e-5.

batches

Number of groups for memorized inference. Defaults to 1.

init_type

Type of initialization procedure. Defaults to Spectral

seed

Seed for the random number generator to reproduce previous results.

verbose

Logical flag to print progress information. Defaults to FALSE.

verbose_n

Integer determining the intervals at which labels are printed.

control

List of additional parameters control portions of the optimization. See details.

Details

Topics are estimated via stm from the stm package. The focus of the themetagenomics pipeline is leveraging both abundance and predicted functional information of 16S rRNA sequencing; hence, the pipeline calls for the use of only "prevalence" information (to use stm terminology). This wrapper therefore removes any options pertaining to "content." If the user is interested in exploring the content component of the STM, then the stm package itself is the ideal place to start. Given that only the prevalence component can be manipulated using find_topics, the following additional parameters can be passed to control as a list (adapted from stm documentation):

gamma.enet

Scalara between 0 and 1 that controls the degree of L1 and L2 regularization, where 0 and 1 correspond to ridge and lasso regression. Defaults to 1.

gamma.ic.k

Method to select the regularization parameter where 2 corresponds to AIC and log(n) is equivalent to BIC. Defaults to 2.

gamma.maxits

Maximum number of iterations for estimating prevalence. Defaults to 1000.

nits

For LDA initialization, the number of Gibbs sampling iterations. Defaults to 50.

burnin

For LDA initialization, the number of burnin iterations. Defaults to 25.

alpha

For LDA initialization, the samples over topics distribution hyperparameter.

eta

For LDA initialization, the topics over words distribution hyperparameter.

rp.s

For spectral initialization, scalar between 0 and 1 that controls the degree sparsity of random projections. Defaults to .05

rp.p

For spectral initialization, the dimensionality of random projections. Defaults to 3000.

rp.d.group.size

For spectral initialization, the block size. Defaults to 2000.

maxV

For spectral initialization, the maximum number of words used during initialization.

Value

An object of class topics containing

fit

STM object containing topic model fit

docs

Abundance table in document form of length equal to the number of samples. Each element contains 2-row array, where row 1 contains the the vocabulary index of a given taxon and row 2 contains its abundance in that document

vocab

Character vector containing vocabulary of taxa IDs, where their position corresponds to the document indexes

otu_table

Original otu_table

tax_table

Original tax_table

metadata

Original metadata

ref

Original covariate references

modelframe

Original modelframe

splineinfo

Original splineinfo

References

Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B., & Rand, D.G. (2014). Structural topic models for open-ended survey responses. Am. J. Pol. Sci. 58, 1064–1082.

See Also

glmnet stm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
formula <- ~DIAGNOSIS
refs <- 'Not IBD'

dat <- prepare_data(otu_table=GEVERS$OTU,rows_are_taxa=FALSE,tax_table=GEVERS$TAX,
                    metadata=GEVERS$META,formula=formula,refs=refs,
                    cn_normalize=TRUE,drop=TRUE)

## Not run: 
topics <- find_topics(dat,K=15)

## End(Not run)

EESI/themetagenomics documentation built on May 10, 2020, 1:40 a.m.