vis: Launch in interactive visualize to explore topic effects
In themetagenomics: Exploring Thematic Structure and Predicted Functionality of 16s rRNA Amplicon Data

Description Usage Arguments Details References See Also Examples

Launch in interactive visualize to explore topic effects

vis(object, ...)

## S3 method for class 'effects'
vis(
  object,
  topic_effects,
  type = c("taxa", "binary", "continuous", "functions"),
  seed = object$seed$next_seed,
  ...
)

## S3 method for class 'binary'
vis(object, taxa_grp_n = 7, ...)

## S3 method for class 'continuous'
vis(object, lambda_step = 0.1, taxa_reg_n = 8, ...)

## S3 method for class 'functions'
vis(
  object,
  topic_effects,
  beta_min = 1e-05,
  ui_level = 0.8,
  gene_min = 0,
  pw_min = 20,
  ...
)

## S3 method for class 'taxa'
vis(
  object,
  taxa_bar_n = 30,
  top_n = 7,
  method = c("huge", "simple"),
  corr_thresh = 0.01,
  lambda_step = 0.01,
  ...
)

## S3 method for class 'topics'
vis(
  object,
  taxa_bar_n = 30,
  top_n = 7,
  method = c("huge", "simple"),
  corr_thresh = 0.01,
  lambda_step = 0.01,
  ...
)

`object`	(required) Output of `find_topics`, or `est.topics`, or `est.functions`.
`...`	Additional arguments for methods.
`topic_effects`	Output of `est.topics`.
`type`	Type of visualization to perform.
`seed`	Seed for the random number generator to reproduce previous results.
`taxa_grp_n`	Number of taxa group names to display (remaining are renamed to other). Defaults to 7.
`lambda_step`	Value designating the lambda stepsize for calculating taxa relevance. Recommended to be between .01 and .1. Defaults to .1.
`taxa_reg_n`	Number of most relevant taxa within topic to regress. Defaults to 8.
`beta_min`	Minimum probability in topics over taxa distribution to set to 0. Defaults to 1e-5.
`ui_level`	Uncertainty level for plot intervals. Defaults to .8.
`gene_min`	Mininum count for gene set table. Defaults to 0.
`pw_min`	Maximum number of pathways to show in heatmap. for Defaults to 20.
`taxa_bar_n`	Number of taxa to show in the frequency bar plot. Defaults to 30.
`top_n`	Number of taxonomic groups to colorize in the frequency bar plot. Defaults to 7.
`method`	Method for estimating topic correlations links. Defaults to huge.
`corr_thresh`	Threshold to set correlations to 0 when method is set to simple. Defaults to .01.

Taxa

Integrates the samples over topics p(s|k) and topics over taxa p(k|t) distributions from the STM, the topic correlations from the p(s|k) component, the covariate effects from the p(s|k) component, and their relationship with the raw taxonomic abundances. The covariate effects for each topic are shown as a scatterplot of posterior weights with error bars corresponding the global approximation of uncertainty. If the covariate chosen is binary, then the posterior regression weights with uncertainty intervals are shown. This is analogous to the mean difference between factor levels in the posterior predictive distribution. For continuous covariates, the points again represent the mean regression weights (i.e., the posterior slope estimate of the covariate). If, however, a spline or polynomial expansion was used, then the figure shows the posterior estimates of the standard deviation of the predicted topic probabilities from the posterior predictive distribution. Colors indicate whether a given point was positive (red) or negative (blue) and did not enclose 0 at a user defined uncertainty interval.

The ordination figure maintains the color coding just described. The ordination is performed on p(k|t) via either PCoA (using either Jensen-Shannon, Euclidean, Hellinger, Bray-Curtis, Jaccard, or Chi-squared distance) or t-SNE. The latter iterates through decreasing perplexity values (starting at 30) until the algorithm succeeds. The top 2 or 3 axes can be shown. The radius of the topic points corresponds to the topic frequencies marginalized over taxa.

The bar plot behaves in accordance with LDAvis. When no topics are chosen, the overall taxa frequencies are shown. These frequencies do not equal the abundances found in the initial abundance table. Instead, they show p(k|t) multiplied by the marginal topic distribution (in counts). To determine the initial order in which taxa are shown, these two distributions are compared via Kullback-Liebler divergence and then weighted by the overall taxa frequency. The coloration of the bars indicates the taxonomic group the individual taxa belong to. The groups shown are determined based on the abundance of that group in the raw abundance table. When a topic is selected, the relative frequency of a given taxa in that topic is shown in red.

λ controls relevance of taxa within a topic, which in turn is used to adjust the order in which the taxa are shown when a topic is selected. Relevance is essentially a weighted sum between the probability of taxa in a given topic and the probability of taxa in a given topic relative to the overall frequency of that taxa. Adjusting λ influences the relative weighting such that

r = λ x log p(t|k) + λ x log p(t|k)/p(x)

The correlation graph shows the topic correlations from p(s|k) ~ MVN(mu,sigma). Again, the coloration described above is conserved. The size of the nodes reflects the magnitude of the covariate posterior regression weight, whereas the width of the edges represents the value of the positive correlation between the connected nodes. By default, the graph estimates are determined using the the huge package, which first performs a nonparanormal transformation of p(s|k), followed by a Meinhuasen and Buhlman procedure. Alternatively, by choosing the simple method, the correlations are simply a thresholded MAP estimate of p(s|k).

Binary

Integrates the topics over taxa p(k|t) distribution from the STM, binary covariate effects from the p(s|k) component, and their relationship with the raw taxonomic abundances. The covariate effects for each topic are shown as a scatterplot of posterior weights with error bars corresponding the global approximation of uncertainty. If the covariate chosen is binary, then the posterior regression weights with uncertainty intervals are shown. This is analogous to the mean difference between factor levels in the posterior predictive distribution. For continuous covariates, the points again represent the mean regression weights (i.e., the posterior slope estimate of the covariate). Colors indicate whether a given point was positive (red) or negative (blue) and did not enclose 0 at a user defined uncertainty interval.

Selecting a topic estimate generates violin plots showing the p(s|k) distribution, split based on chosen binary covariate effects. The slider allows the user to threshold the number of points shown, based on their values in p(s|k). Highlighting points in the violin plots generates bar plots that show their abundances (or relative abundances) in the raw abundance table.

Continuous

Integrates the samples over topics p(t|s) and the topics over taxa p(k|t) distributions from the STM, binary and continuous covariate effects from the p(s|k) component, and their relationship with the raw taxonomic abundances. The covariate effects for each topic are shown as a scatterplot of posterior weights with error bars corresponding the global approximation of uncertainty. If the covariate chosen is binary, then the posterior regression weights with uncertainty intervals are shown. This is analogous to the mean difference between factor levels in the posterior predictive distribution. For continuous covariates, the points again represent the mean regression weights (i.e., the posterior slope estimate of the covariate). If, however, a spline or polynomial expansion was used, then the figure shows the posterior estimates of the standard deviation of the predicted topic probabilities from the posterior predictive distribution.

Selecting a topic estimate generates three panels. The top panel shows the posterior estimates of the selected continuous covariate. If binary covariates were present in the model formula, then the continuous effect given the binary covariate is shown as two regression lines, along with their corresponding uncertainty intervals. The points show the true p(k|s) values determined by the STM as a function of the selected continuous covariate. The middle panel then shows the raw abundances (or relative abundances) of most relavent taxa. Relavence can be control by adjusting λ where

r = λ x log p(t|k) + λ x log p(t|k)/p(x)

If binary covariates were provided in the model formula, selected split will split the regressions based on the selected covariate. Each figure overlays a linear best fit (red) and loess fit (red) to facilitate interpretation. The bottom panel shows these taxa combined.

Functions

Integrates the taxa over topics p(t|k) and gene functions over topics p(g|k) distributions, along with and the covariate effects from the p(s|k) component. The covariate effects for each topic are shown as a scatterplot of posterior weights with error bars corresponding the global approximation of uncertainty. If the covariate chosen is binary, then the posterior regression weights with uncertainty intervals are shown. This is analogous to the mean difference between factor levels in the posterior predictive distribution. For continuous covariates, the points again represent the mean regression weights (i.e., the posterior slope estimate of the covariate). If, however, a spline or polynomial expansion was used, then the figure shows the posterior estimates of the standard deviation of the predicted topic probabilities from the posterior predictive distribution. Colors indicate whether a given point was positive (red) or negative (blue) and did not enclose 0 at a user defined uncertainty interval.

The upper heatmap shows p(t|k), clustered via Wards method on a user chosen distance metric. Topics are ranked to right based on the weights from the aforementioned scatterplot. The lower heatmap shows the weights for the pathway-topic interaction from the multilevel Bayesian model. Positive and negative weight estimates that do not enclose zero at a chosen uncertainty level are marked with red and blue crosses, respectively. The pathway ordering is done via Wards method on Euclidean distance. Upon selected a cell within the pathway-topic heatmap, a table of genes is returned, ranking the genes in terms of abundance that belong to a given pathway-topic combination.

Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B., & Rand, D.G. (2014). Structural topic models for open-ended survey responses. Am. J. Pol. Sci. 58, 1064–1082.

Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proc. Work. Interact. Lang. Learn. Vis. Interfaces.

Zhao, T., & Liu., H. (2012) The huge Package for High-dimensional Undirected Graph Estimation in R. Journal of Machine Learning Research.

igraph_to_networkD3, huge, topicCorr, Rtsne

formula <- ~DIAGNOSIS
refs <- 'Not IBD'

dat <- prepare_data(otu_table=GEVERS$OTU,rows_are_taxa=FALSE,tax_table=GEVERS$TAX,
                    metadata=GEVERS$META,formula=formula,refs=refs,
                    cn_normalize=TRUE,drop=TRUE)

## Not run: 
vis(topic_effects,type='taxa')
vis(topic_effects,type='binary')

## End(Not run)

formula <- ~PCDAI

dat <- prepare_data(otu_table=GEVERS$OTU,rows_are_taxa=FALSE,tax_table=GEVERS$TAX,
                    metadata=GEVERS$META,formula=formula,refs=refs,
                    cn_normalize=TRUE,drop=TRUE)

## Not run: 
vis(topic_effects,type='continuous')

functions <- predict(topics,reference_path='/references/ko_13_5_precalculated.tab.gz')

function_effects <- est(functions,level=3,
                        iters=500,method='hmc',
                        prior=c('laplace','t','laplace'))

vis(function_effects,topic_effects)

## End(Not run)