Unifrac!
Test stability of linear models over reduced space
Synthetic data - continous mixture of three different populations - simplex. We know the exact reduced dimensions (at least for cmdscale)
Language - avoid "sample" whenever possible - "bootstrap" - the process, "draw" (from bootstrap), "observation" (a row in the data), "read" for a single element of an observation/draw, "
Show how all forms of consistency behave as I increase the dimension of the latent space (they should quickly grow to 1)
Parallel coordinates plots to show more dimensions
Find a threshold (number of samples kept) where the orginal MDS structure ceases to be recognizable
How to evaluate this??? Maybe residual error after procrustes? But what about qualitative structure?
Use Enterotype example data from phyloseq - they are know to have had issues
Manuscript outline:
Name: metagenboot: Visual Diagnostics for Metagenomic Datasets via Dirichlet-Multinomial Bootstrapping
jackknife beta diversity (in QIIME jackknifed_beta_diversity.py) - works by simply stacking the PCoA plots over one another (maybe works because the jackknife = removing one sample at a time so the changes are unlikely to be large) + confidence ellipses
Dunthorn 17: We see that the two clusters for Panama are probably not real
Confidence ellipses (as used in QIIME) likely misleading as the model-based resamples (or even rarefied samples) are nowhere resembling an ellipse.
Why not do a big MDS over all samples? This has two problems: One practical, because NMDS (at least the vegan implementation) tends to have convergence issues for large datasets and takes looooong to compute. Second is theoretical: I've just artificially added similar points to the dataset. The more samples I take, the more the optimization function of NMDS will be rewarded for keeping the bootstrap samples close to the original, even when this means representing the distances between the original points poorly. And this is actually what happens, at least to an extent - for example when using the DESeq2 model to resample Tijana's data, NMDS ran on the big matrix resulted in the original observations neatly grouped by the DESeq2 predictors, making the plot hugely different from what you see when you run NMDS on the originals alone.
Falsifying approach: Reliable clustering survives high-noise resampling. Homogenity must be visible after low-noise resampling.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.