compute.lda: LDA model inference

Description Usage Arguments Details Value References See Also Examples

View source: R/cellTree.R

Description

This function fits a Latent Dirichlet Allocation (LDA) to single-cell RNA-seq data.

Usage

1
2
3
compute.lda(data, method = "maptpx", k.topics = if (method == "maptpx") 2:15
  else 4, log.scale = TRUE, sd.filter = 0.5, tot.iter = if (method ==
  "Gibbs") 200 else 1e+06, tol = if (method == "maptpx") 0.05 else 10^-5)

Arguments

data

A matrix of (non-negative) RNA-seq expression levels where each row is a gene and each column is the cell sequenced.

method

LDA inference method to use. Can be any unique prefix of ‘maptpx’, ‘Gibbs’ or ‘VEM’ (defaults to ‘maptpx’)

k.topics

Integer (optional). Number of topics to fit in the model. If method is ‘maptpx’, k.topics can be a vector of possible topic numbers and the the best model (evaluated on Bayes factor vs a null single topic model) will be returned.

log.scale

Boolean (optional). Whether the data should be log-scaled.

sd.filter

Numeric or FALSE (optional). Standard-deviation threshold below which genes should be removed from the data (no filtering if set to FALSE).

tot.iter, tol

Numeric parameters (optional) forwarded to the chosen LDA inference method's contol class.

Details

Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups (topics) that explain why some parts of the data are similar [Blei, 2003]. Each topic is modelled as a (Dirichlet) distribution over observations and each set of observations is also modelled as a (Dirichlet) distribution over topics. In lieu of the traditional NLP context of word occurence counts in documents, our model uses RNA-seq observation counts in single cells. Three separate LDA inference methods can be used at the moment:

When in doubt, the function can be ran with its default parameter values and should produce a usable LDA model in reasonable time (using the ‘maptpx’ inference method). The model can be further refined for a specific number of topics with slower methods. While larger models (using large number of topics) might fit the data well, there is a high risk of overfitting and it is recommended to use the smallest possible number of topics that still explains the observations well. Anecdotally, a typical number of topics for cell differentiation data (from pluripotent to fully specialised) would seem to be around 4 or 5.

Value

A LDA model fitted for data, of class LDA-class (for methods 'Gibbs' or 'VEM') or topics (for 'maptpx')

References

See Also

LDA, topics, LDA_Gibbscontrol-class, CTM_VEMcontrol-class

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
library(HSMMSingleCell)
data(HSMM_expr_matrix)

# Run LDA inference using 'maptpx' method for k = 4:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")


# Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")

# Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")

cellTree documentation built on Nov. 8, 2020, 5:05 p.m.