Description Usage Arguments Details Value References See Also Examples
This function fits a Latent Dirichlet Allocation (LDA) to single-cell RNA-seq data.
1 2 3 |
data |
A matrix of (non-negative) RNA-seq expression levels where each row is a gene and each column is the cell sequenced. |
method |
LDA inference method to use. Can be any unique prefix of ‘maptpx’, ‘Gibbs’ or ‘VEM’ (defaults to ‘maptpx’) |
k.topics |
Integer (optional). Number of topics to fit in the model. If |
log.scale |
Boolean (optional). Whether the data should be log-scaled. |
sd.filter |
Numeric or |
tot.iter, tol |
Numeric parameters (optional) forwarded to the chosen LDA inference method's contol class. |
Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups (topics) that explain why some parts of the data are similar [Blei, 2003]. Each topic is modelled as a (Dirichlet) distribution over observations and each set of observations is also modelled as a (Dirichlet) distribution over topics. In lieu of the traditional NLP context of word occurence counts in documents, our model uses RNA-seq observation counts in single cells. Three separate LDA inference methods can be used at the moment:
Gibbs uses Collapsed Gibbs Sampling method (implemented by Xuan-Hieu Phan and co-authors in the topicmodels package [Phan, 2008]) to infer the parameters of the Dirichlet distributions for a given number of topics. It gives high accuracy but is very time-consuming to run on large number of cells and genes.
VEM uses Variational Expectation-Maximisation (as described in [Hoffman, 2010]). This method tends to converge faster than Gibbs collapsed sampling, albeit with lower accuracy.
maptpx uses the method described in [Taddy, 2011] and implemented in package maptpx to estimate the parameters of the topic model for increasing number of topics (using previous estimates as a starting point for larger topic numbers). The best model (/number of topics) is selected based on Bayes factor over the Null model. Although potentially less accurate, this method provides the fastest way to train and select from a large number of models, when the number of topics is not well known.
When in doubt, the function can be ran with its default parameter values and should produce a usable LDA model in reasonable time (using the ‘maptpx’ inference method). The model can be further refined for a specific number of topics with slower methods. While larger models (using large number of topics) might fit the data well, there is a high risk of overfitting and it is recommended to use the smallest possible number of topics that still explains the observations well. Anecdotally, a typical number of topics for cell differentiation data (from pluripotent to fully specialised) would seem to be around 4 or 5.
A LDA model fitted for data
, of class LDA-class (for methods 'Gibbs' or 'VEM') or topics (for 'maptpx')
Blei, Ng, and Jordan. “Latent dirichlet allocation.” the Journal of machine Learning research 3 (2003): 993-1022.
Hoffman, Blei and Bach (2010). “Online Learning for Latent Dirichlet Allocation.” In J Lafferty, CKI Williams, J Shawe-Taylor, R Zemel, A Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 856-864. MIT Press, Cambridge, MA.
Hornik and Gr<c3><bc>n. “topicmodels: An R package for fitting topic models.” Journal of Statistical Software 40.13 (2011): 1-30.
Phan, Nguyen and Horiguchi. “Learning to classify short and sparse text & web with hidden topics from large-scale data collections.” Proceedings of the 17th international conference on World Wide Web. ACM, 2008.
Taddy. “On estimation and selection for topic models.” arXiv preprint arXiv:1109.4518 (2011).
LDA, topics, LDA_Gibbscontrol-class, CTM_VEMcontrol-class
1 2 3 4 5 6 7 8 9 10 11 12 13 | # Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
library(HSMMSingleCell)
data(HSMM_expr_matrix)
# Run LDA inference using 'maptpx' method for k = 4:
lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")
# Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")
# Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.