In Djeppschmidt/Model.Microbiome: Environment And Community Simulation Package

Introduction

ModelMicrobiome is a novel platform for testing the performance and assumptions underlying ecological inference of microbiomes. Whereas previous benchmarking and modeling efforts have used case-control frameworks, and/or used randomized real data, none have tested the accuracy of ecological inference following a sequence count normalization procedure. Previous efforts have focused on the ability to accurately detect a change in relative abudance (i.e. differential abundance analysis). But these tests do not reflect how well the ecological relationships are preserved through the normalization process (e.g. relationship between the taxa and their environment or among taxa). This software does not attempt to address bias that is introduced by the sequencing, DNA purification, or field sampling efforts. It only is meant as a tool to better understand the effect that any post-hoc sequence count normalization has on our ability to infer underlying ecological relationships.

To further clarify the limits of this software, it is not intended to be applied to metagenomics analyses that derive from shotgun sequence data. There are database and sequencing biases that are particular to whole genome sequencing that must be addressed separately and that may confound the inferences made from this software. THIS SOFTWARE IS NOT DESIGNED TO ADDRESS SHOTGUN SEQUENCING OR ION TORRENT ETC. SEQUENCING. This software is designed specifically for amplicon (i.e. metabarcoding) studies of a single gene locus used to infer population structure. For more detailed information about how the indexes are constructed, refer to (DOI: ... )

This software was written under R version 3.6.1. R version 4 has been published with major updates to some of the core mechanics. Model.MicrobiomeV2 will be updated for R 4.0 soon.

This tutorial will address the primary functionalities of Model.Microbiome, and provide an example of how it can be used in the course of an experimental workflow.

Install Model.Microbiome

First, let's install Model.Microbiome:

# update anything from my local computer (for me as I work, not you...) ####
#setwd("/Users/maullabmacbookpro/Documents/GitHub/Model.Microbiome")
#library(devtools)
#library(roxygen2)
#document()

#now install ####
devtools::install_github("djeppschmidt/Model.Microbiome")

Now load all the other packages we need (or download and load if you haven't already):

# load required packages ####

library(Model.Microbiome)
library(reshape2)
library(ggplot2)
library(vegan)
library(dplyr)
library(plyr)
library(phyloseq)
library(viridis)
library(ranacapa)
library(edgeR)
library(limma)
library(GLDEX)
library(stats)
library(igraph)
library(car)
library(DESeq2)
library(kableExtra)

The core functionality of Model.Microbiome is to create a model microbial community, then simulate sequencing by subsampling from it. This simulated count data can then be normalized using any normalization method. This is handled by the run.analysis2() function. I recommend running this function with SuppressWarnings(). It is normal for the linear model functions that Model.Microbiome references to give warnings whenever the fit is perfect; these warnings stack up across the simulation and become overwhelming.

Building the model community works sequenially by: 1. creating an environment made out of 5 environmental parameters, 30 sites, and 6 statistically distinctive "experimental" conditions. a. environmetnal paramters 1-3 change in mean and variance according to the experimental condition. b. environmental parameters 4 and 5 are unrelated to the experimental design, but do affect taxon abundance c. the sampling environment consists of 30 samples, divided into 6 experimental conditions/treatment (or 5 reps per condition) d. the environmental parameters have the same mean and variance within each condition/treatment every time the software is run; however, the values of any given run are randomly drawn from this distribution. Therefore no two runs will have exactly the same environment, but they will represent exactly the same experiment.

Choosing which species exist in each sample. a. The user sets how many taxa will be chosen per sample b. the user sets how many taxa will be chosen per experimental treatments c. the user sets how many taxa will be chosen to be globally distributed

NOTE: SEE SECTION DETAILS FOR CAVEATS ABOUT THIS SELECTION PROCESS!!

Community assembly. a. taxon abundances are calculated; each species has a specific relationship with the environmental variables. b. OTU table generated, and merged into a phyloseq object with metadata on the environment, etc.

NOTE: While Model.Microbiome does construct networks and taxon associations in downstream analyses, it does not model those relationships explicitly; no species function refers explicitly to another species.

Community is sampled a. this is a subsampling routine that creates the "sequence output" table
Sequence Normalization a. Model.Microbiome applies user-supplied normalization functions to the subsampled data table, and generates metrics to aid with interpretation. It can accommodate any number of arbitrary methods.

knitr::include_graphics("/Users/maullabmacbookpro/Documents/GitHub/Model.Microbiome/Model_Process.jpg")

So let's look at the function. It has several inputs necessary:

commonN: Number of taxa that have a global distribution groupN: Number of taxa that are group-specific singleN: Number of taxa that are sample-specific

NOTE: there are 700 species functions that the software selects from. For each of these values, it uses a random selection routine. It includes resampling among the groups. This means that:

THERE IS A CHANCE THAT A TAXON MAY END UP REPRESENTING SEVERAL GROUPS, OR SAMPLES.

D: mean sampling depth (i.e. simulating sequencing) V: variance of sampling depth.

method: names of the functions to be used for count normalization (see below for details)

Let's look at an example using a narmalization method included with Model.Microbiome.This implementation creates a single output object that contains the phyloseq objects created before, and a number of different statistics. In order to get it to work, we need at least one methods function:

method<-c("QSeq")

model<-suppressWarnings(run.analysis2(commonN=30, groupN=20, singleN=5, D=500, V=250, method))

Note that rarefaction curves are generated automatically for the reference condition in each simulation run. This helps us judge whether our subsampling level is under sampling in general. It also allows us to examine the diversity of each samople; we can see that there are distinctions in diversity among the experimental conditions.

Designing a Normaliation Fuction

The QSeq method is a native Model.Microbiome method, but let's look at how it was constructed to give a framework for how the user can create functions that work with Model.Microbiome:

# these functions are already provided as a part of Model.Microbiome
# ps is a phyloseq object for input
# val is the mean value around which all taxa will be scaled
# scale is a vector of total counts of the reference samples used for normalization
make.scaled2<-function(ps, val, scale){
  scaled<-data.frame(mapply(`*`, data.frame(as.matrix(otu_table(transform_sample_counts(ps, function(x) x/sum(x))))), scale * val))
  names<-rownames(data.frame(as.matrix(otu_table(ps))))
  rownames(scaled)<-names
  scaled<-round(scaled)

  p2<-ps
  otu_table(p2)<- otu_table(scaled, taxa_are_rows=T)
  p2
}

# wrapper function to apply the normalization function in Model.Microbiome
QSeq<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=2*mean(sample_sums(ps)), scale)
    out
}

Notice that this is actually one function nested within another. The wrapper helps the function to play nice with Model.Microbiome. It's easy to replicate wrappers to provide the underlying function with different inputs if we want to test the different potential conditions. The wrapper takes the un-normalized count table as input, extracts the information needed to implement the normalization function, then applies the normalization function. The normalization function should return a phyloseq object. For example, our QSeq function is meant to replicate the normalization technique called quantitative sequencing. This technique combines quantitative data from QPCR with sequence count distributions to normalize the sequence counts in each sample by the relative total sequence counts of that sample compared to other samples. To do this though, we need to decide what the mean value for the sample-wise abundance normalization. We could make several versions of the QSeq function to test what is the optimal mean sample-wise total abundance relative to the sampling depth. To do this we change the value parameter as follows:

QSeq0.5<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=0.5*mean(sample_sums(ps)), scale)
    out
}

QSeq1<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=mean(sample_sums(ps)), scale)
    out
}

QSeq2<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=2*mean(sample_sums(ps)), scale)
    out
}

QSeq3<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=3*mean(sample_sums(ps)), scale)
    out
}

QSeq10<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=10*mean(sample_sums(ps)), scale)
    out
}

QSeq100<-function(ps){
    scale<-sample_data(ps)$DensityF
    out<-make.scaled2(ps, val=100*mean(sample_sums(ps)), scale)
    out
}

In this way we end up with a number of different functions that can be called independently, but that work on the same underlying mechanic. This is important if we have a method that we want to optimize in some way.

If we want to use all of our methods in our simulation study, then we simply do:

method<-c("QSeq0.5", "QSeq1", "QSeq2", "QSeq3", "QSeq10", "QSeq100")

Then we implement it in the analysis pipeline:

model<-suppressWarnings(run.analysis2(commonN=30, groupN=20, singleN=5, D=500, V=250, method))

We can call the phyloseq object from any one of the normalization (including the reference data and the raw sequence counts):

model$model$comm
model$raw$comm
model$QSeq0.5$comm
model$QSeq1$comm
model$QSeq2$comm
model$QSeq3$comm
model$QSeq10$comm

Model.Microbiome also automatically calculates several metrics. For example, it calculates automatically an experiment-wide PERMANOVA for each method and each factor. So for example if we look at the reference, we have a seperate permanova for the experimental category:

model$model$PERMANOVA$Category

And each of the environmental factors:

model$model$PERMANOVA$F1
model$model$PERMANOVA$F2
model$model$PERMANOVA$F3
model$model$PERMANOVA$F4
model$model$PERMANOVA$F5

We can compare any of these across treatments:

model$model$PERMANOVA$F1 # reference condition!
model$raw$PERMANOVA$F1 # not normalized
model$QSeq0.5$PERMANOVA$F1 # lowest mean value
model$QSeq1$PERMANOVA$F1  #...
model$QSeq2$PERMANOVA$F1 #...
model$QSeq3$PERMANOVA$F1 #...
model$QSeq10$PERMANOVA$F1
model$QSeq100$PERMANOVA$F1# highest mean value

You'll notice that the QSeq PERMANOVA tables are very similar to the reference PERMANOVA table, and that the un-normalized table is a bit different. This is great! It tells us that we are recapturing some lost information from the subsampling.

Model.Microbiome uses a few metrics to understand bias that is introduced through the normalization process. All the metrics in Model.Microbiome consist of the ouput from an analysis conducted on a normalized dataset (treatment) divided by the same analysis conducted on the reference community. For example, if we are interested in the relationship between a taxon an the environment, we could construct a linear model where abundance of taxon x depends on environment A. The R-squared of this model would represent the amount of information captured by the model. If we applied this model to both the reference community, and the normalized community, then we have a measure of the "actual" relationship and the inferred relationship after our normalization, respectively. The ratio of these values gives us a comparable value where 1 represents a perfect match; greater than 1 indicates the normalization method leads to overestimating the explanatory power of the environmental factors on the abundance of the taxon; and a value of less than one indicates that the normalization underestimates the explanatory value of the environmental parameter. All of the indices are based on this interpretation principle.

Here is a brief description of each index:

Loss of information index (LII) is based on a linear model approach, but instead of regressing the taxa against the environment and comparing the outputs, it regresses the treatment (normalized data) against the reference (pre-sampling). Thus the R-squared value of the model is a direct measure of how well reference abundance predicts the normalized abundance. The LII output has several components;

$Index : index value which is the sum from all taxa of (1 - the linear regression r-squared values) divided by the total number of taxa. This is a single value for the whole run. Lower scores equate to less overall information lost. The LII can be interrpeted as "mean information lost per taxon"

$R : a named list of the R squared values that are summed in the Index, one per taxon in the run.

$diff : a list of (1 - the r-squared values) for the Index value, one for each taxon. This is useful to identify which taxa are deviating substantially, and help with troubleshooting the normalization method, and identifying why any patterns might exist.

All of these can be accessed from the output object (see output object map).

The second set of metrics consist of linear regressions of the taxa (or whole community) against environmental and experimental design predictors. This test allows us to diagnose the effect that our normalization protocol has on our ability to infer relationships between the taxa and the environment / experimental condition. The primary outputs of this process are then taken as a ratio to the reference for interpretation.

$lmRatiotab : list of taxa and their R-squared values between taxon and the environment, normalized by the reference

$lmRatiotab.model : list of taxa and their R-squared values between taxon and the experimental design, normalized by the reference

lmRatiotab or lmRatiotabl.model values that are close to 1 represent near perfect retention of the ecological relationship. Values that approach 0 or are above 1.5 represent substantial loss of accuracy. There should be no negative values.

There are also metrics that reflect beta diversity. The R-Squared value for a PERMANOVA analysis of the experimental conditions, and environmental parameters against a Bray-Curtis transformed count table are used to calculate a R-Ration between the reference community and the treatment communities. Also, modularity is calculated based on a correlation matrix using a number of different construction methods. These methods will be addressed in more detail later.

knitr::include_graphics("/Users/maullabmacbookpro/Documents/GitHub/Model.Microbiome/Output_Map.jpg")

The speciesMeta object stores a few taxon-specific metrics that help us interpret the outcomes;

$prevalence is a taxon-wise measure of how many sites the taxon actually occures in (in the reference community).

$mean_abundance is a taxon-wise measure of the mean abundance of each taxon across all sites (in the reference community)

$sd_abundance is a taxon-wise measure of the standard deviation in abundance of each taxon across all sites (again, reference community).

$M.Eval is the mean of the relative abundance of the taxon at each site divided by the sampling depth of that site. This is meant to represent an unweigthed estimation of the likelihood that a taxon will be sampled at any site, given that they exist in that site. Lower values mean the taxon is generally less abundant, and less likely to be sampled in any of the sites.

These taxon-specific data can be useful for exploring patterns when it's time to interpret an experimental run. Let's look at some example plots that demonstrate the utility of these metadata:

First, let's examine the relationship between the expected likelihood of sampling a taxon and the amount of information that is lost. In this plot, the method (QSeq0.5) axis represents 1-(r-squared of treatment vs reference) abundance values for each taxon in the community. Thus higher values equate to more information lost:

ggplot(model$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(QSeq0.5), col=prevalence))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

The color represents the prevalence, we can see that in this case, there are many taxa that lose no information (along the bottom of the graph). Most of these are taxa that only occur in one sample. Let's color the samples differently:

ggplot(model$model$SpeciesMeta, aes(x=M.Eval,y=QSeq0.5, col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()


ggplot(model$model$SpeciesMeta, aes(x=log(M.Eval),y=log(QSeq0.5), col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

We can see that taxa with high mean values are much more likely to retain their information through the sampling.

ggplot(model$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(QSeq0.5), col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(model$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(QSeq1), col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(model$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(QSeq10), col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(model$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(QSeq100), col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

We can see that in particular there is a shift in the amount of information that is lost in the middle M.Eval ranges; at high E.Val, there is relatively less information lost and so the values are more stable. At the low end, the lost information effects are dominated by whether or not the taxa happen to be detected; in the middle, taxa are regularly detected but their abundance is very dependent on sampling depth of the sample. Thus this is the area where we regain the most information.

ggplot(model$model$SpeciesMeta, aes(x=prevalence,y=QSeq0.5, col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(model$model$SpeciesMeta, aes(x=prevalence,y=QSeq1, col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()
ggplot(model$model$SpeciesMeta, aes(x=prevalence,y=QSeq10, col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()
ggplot(model$model$SpeciesMeta, aes(x=prevalence,y=QSeq100, col=log(mean_abundance)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(model$model$SpeciesMeta, aes(x=log(mean_abundance),y=QSeq0.5, col=log(prevalence)))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

We can see that taxa that have higher prevalence also tend to retain more information; the one caveat is the taxa that exist in only one site. As long as they are detected, they retain perfect information.

We can explore to see how well species abundance and prevalence explain the LII:

summary(lm(model$model$SpeciesMeta$QSeq0.5~model$model$SpeciesMeta$prevalence*log(model$model$SpeciesMeta$mean_abundance)))

# look at prediction of prevalence
summary(lm(model$model$SpeciesMeta$QSeq0.5~model$model$SpeciesMeta$prevalence*log(model$model$SpeciesMeta$mean_abundance)))

summary(lm(model$model$SpeciesMeta$QSeq100~model$model$SpeciesMeta$prevalence*log(model$model$SpeciesMeta$mean_abundance)))

Pay close attention to the R-squared values of the model. We can see that the prevalence and mean abundance explain the highest amount of variance in the LII model. This doesn't necessarily mean that this model performs better; imagine if the relative patterns of all taxa are perfectly preserved, then there would be no relationship between the mean abundance, prevalence and information that is lost through sampling. Conversely, if we have a really high predictive power between the abundance, prevalence, and lost information, it doesn't necessarily mean that the normalization is creating more bias in the dataset as there are likely real effects of relative abundance on the likelihood of detection in subsampling. This is to say that the statistics and the plots should be examined within the context of how the normalization procedure is known to operate in order to determine if it is creating bias. In this instance, QSeq0.5 decreases the R-squared of the model because some taxa are rounded down to 0, creating an anomaly that is removed at QSeq10 and QSeq100.

A note about lmRatio and lmRatio.model: As a reminder, the lmRatio and lmRatio.model values are the ratio of r-squared values of a model using normalized versus reference count tables. lmRatio is a model that regresses taxa against the environmental variables; lmRatio.model is a model that regresses taxa against the categorical experimental condition. values of 1 are interpreted to be a perfect retention of relationship through sampling and normalization; values between 0 and 1 are underestimation of the strength of the relationship; values over 1 are overestimation. So, let's look at how mean abundance relates to accuracy of the linear model on a taxon-by-taxon basis:

plot(log(model$model$SpeciesMeta$mean_abundance),model$QSeq0.5$lmRatiotab, col=gray(model$model$SpeciesMeta$M.Eval), 
  main="QSeq0.5 R-ratio of Taxa vs Environment", xlab="Taxon Prevalence", ylab="R-Ratio")

plot(log(model$model$SpeciesMeta$mean_abundance),model$QSeq100$lmRatiotab, col=gray(model$model$SpeciesMeta$M.Eval), 
  main="QSeq100 R-ratio of Taxa vs Environment", xlab="Taxon Prevalence", ylab="R-Ratio")

Notice that QSeq0.5 has many more taxa that have completely lost the relationship between taxa and the environment, and a few where the relationship is over estimated. QSeq100 has none that have lost their relationship completely. While QSeq has more taxa that have an over-estimated relationship with the environmental variable, the accuracy of the estimate is still high, with most of those taxa still being within 10% error. Here is the same plots but for the categorical variables:

plot(log(model$model$SpeciesMeta$mean_abundance),model$QSeq0.5$lmRatiotab.model, col=gray(model$model$SpeciesMeta$M.Eval), 
   main="QSeq0.5 R-ratio of Taxa vs Category", xlab="Taxon Prevalence", ylab="R-Ratio")

plot(log(model$model$SpeciesMeta$mean_abundance),model$QSeq100$lmRatiotab.model, col=gray(model$model$SpeciesMeta$M.Eval), 
   main="QSeq100 R-ratio of Taxa vs Category", xlab="Taxon Prevalence", ylab="R-Ratio")

We can clearly see from both linear model R-Ratio plots that there are a number of taxa for which we lost completely any relationship. These were likely taxa that were rounded out of the community in QSeq0.5 but not QSeq100. We can also see that overall the general pattern of relationships is maintained across the rest of the taxa. The rate of overestimation of the relationship between taxa and explanitory variables also remains the same

You may notice that certain taxa have an R squared of 0; some of them improve as we increase the overall scaling factor. This happens when taxa are very low abundance in a particular sample. If their count abundance is less than 0.5, they get rounded down to 0. So ideally we want to pick a scaling value that lets us keep all the information from sampling (i.e. such that in all samples, the minimum taxon has a count of at least 1). This is an important observation if we want to use this method on a real dataset because it gives us a criteria for choosing what our scaling factor will be.

We can see qualitatively that our higher QSeq methods increase the accuracy of our taxon-specific models by a small margin. We can evaluate this statistically:

levenetest<-data.frame("ratio"=c(model$QSeq0.5$lmRatiotab, model$QSeq100$lmRatiotab), "method"=c(rep("raw", length(model$raw$lmRatiotab)), rep("QSeq", length(model$QSeq3$lmRatiotab))))

car::leveneTest(ratio ~ method, data = levenetest)

var(model$QSeq0.5$lmRatiotab)
var(model$QSeq100$lmRatiotab)

This indicates that there is greater variation in the lmRatio values in the QSeq0.5 versus the QSeq100. Remember, that the deviation in value from 1 represents lost accuracy in the information being retained; we can therefore interpret higher variance as being lower accuracy in the relationships being modeled after the transformation. This method provides a solid evidence-based approach for optimizing normalization methods.

It might be instructional at this point to look at the same distribution without any normalization, to see how well the normalization improves our inference about the relationship of taxa to their environment:

plot(model$model$SpeciesMeta$prevalence,model$raw$lmRatiotab, col=gray(model$model$SpeciesMeta$M.Eval), 
  main="Not normalized vs Environment", xlab="Taxon Prevalence", ylab="Rsquared-Ratio")

plot(model$model$SpeciesMeta$prevalence,model$QSeq100$lmRatiotab, col=gray(model$model$SpeciesMeta$M.Eval), 
  main="QSeq100 vs Environment", xlab="Taxon Prevalence", ylab="Rsquared-Ratio")

plot(model$model$SpeciesMeta$prevalence,model$raw$lmRatiotab.model, col=gray(model$model$SpeciesMeta$M.Eval), 
  main="Not normalized vs Category", xlab="Taxon Prevalence", ylab="Rsquared-Ratio")

plot(model$model$SpeciesMeta$prevalence,model$QSeq100$lmRatiotab.model, col=gray(model$model$SpeciesMeta$M.Eval), 
  main="QSeq100 vs Category", xlab="Taxon Prevalence", ylab="Rsquared-Ratio")

We can see that sometimes the differences are subtle, but other times normalization can remove large artifacts.

Now let's look at how relationships among taxa are affected by normalization (or lack therof). We can access the R-Ratio for a correlation among taxa. Typically the first thing that we do for a network analysis is to correlate taxa to one another. Model.Microbiome does this automatically, using a pearson correlation. All we need to do is access this information using the getTaxCor.Tab function. This function outputs an object that contains a table of variances and median correlation ratios. These are calculated in a similar fashion as the correlation ratios of taxa to the environment;

let a be the correlation coefficient between two taxa in the reference. let b be the correlation coefficient between two taxa after normalization treatment.

The correlation ratio is defined as b/a

NOTE: because occasionally a = 0, resulting in an undefined result we have chosen to convert all a = 0 -> a = min(a/10) in order to remove infinite values, retain the information that these instances result in mis-estimation of the relationship between taxa, and without punishing the misestimation too harshly (an infinite value pushes the mean to infinite, even if only one infinite value exists in the dataset)

Thus we are extracting the variance of all ratios for each normalization method; and the median ratio of all normalization methods

head(model$raw$taxCor.Ratio)

Negative values indicate a negative relationship between the relationship among taxa in the reference compared to the treatment. Ideally all relationships should be 1 if the information is perfectly preserved. We can subset the dataset by specific taxa, and use metadata about those taxa to better understand why the relationships change. If we look at taxon 153, we see that it exists in 22 sites. We can look at how the prevalence and abundance of other taxa relate to the predicted relationship between the taxa. First let's identify low prevalence and high prevalence taxa.

# create values for subsetting
nm<-rownames(model$model$SpeciesMeta)
nm.h<-nm[model$model$SpeciesMeta$prevalence>20]
nm.l<-nm[model$model$SpeciesMeta$prevalence==1]

table(rownames(model$model$SpeciesMeta)==model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.l[[1]],1])
# plot vs prevalence
plot(model$model$SpeciesMeta$prevalence,model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.h[[1]],3], xlab="Prevalence", ylab="taxcor.Ratio")

plot(model$model$SpeciesMeta$mean_abundance,model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.h[[1]],3], xlab="mean Abundance", ylab="taxcor.Ratio")

plot(model$model$SpeciesMeta$sd_abundance,model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.h[[1]],3], xlab="SD Abundance", ylab="taxcor.Ratio")

We can see in general, the largest errors come from where the prevalence and mean abundance are both low. However, in this example the one difference is a highly prevalent, but low abundance taxon. We can examine how this pattern holds up on a low prevalence taxon.

# check taxa are in the right order
table(rownames(model$model$SpeciesMeta)==model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.l[[1]],1])
# plot vs prevalence
plot(model$model$SpeciesMeta$prevalence,model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.l[[1]],3], xlab="Prevalence", ylab="taxcor.Ratio")

plot(model$model$SpeciesMeta$mean_abundance,model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.l[[1]],3], xlab="mean Abundance", ylab="taxcor.Ratio")

plot(model$model$SpeciesMeta$sd_abundance,model$QSeq0.5$taxCor.Ratio[model$QSeq0.5$taxCor.Ratio$Var2==nm.l[[1]],3], xlab="SD Abundance", ylab="taxcor.Ratio")

It is important to note that there are some negative values. The Taxcor is a pearson correlation. The other metrics use a R-sqaured value, which precludes negatives. In this case there is an interpretation: the relationship between taxa is opposite in the reference than in the normalized data. This would be interpreted as potentially a false positive relationship if the relationship is strong enough after normalization.

Model.Microbiome includes several R-ratio metrics for whole community patterns. These are based on the PERMANOVA r-squared value. Model.Microbiome conducts a PERMANOVA of each environmental factor and the categorical factor against a Bray-Curtis transformation of the count table. These are done separately because order of addition of the factors in the model influences the model outcomes; it is easiest to get comparable results run each factor in isolation. For example:

model$raw$PERMANOVA$F1

The R squared part of the table is taken as a ratio of the reference:

model$raw$PERMANOVA$F1Rratio

We can examine how well each of the environmental and categorical factors are maintained:

model$raw$PERMANOVA$CategoryRratio

model$raw$PERMANOVA$F2Rratio

model$raw$PERMANOVA$F3Rratio

model$raw$PERMANOVA$F4Rratio

model$raw$PERMANOVA$F5Rratio

And of course, we can plot an ordination:

bray_NMDS = ordinate(model$QSeq10$comm, "NMDS", "bray")
plot_ordination(model$QSeq10$comm, bray_NMDS, "samples", color="Factor")

We can compare the ordination

bray_NMDS.ref = ordinate(model$model$comm, "NMDS", "bray")
plot_ordination(model$model$comm, bray_NMDS.ref, "samples", color="Factor")

bray_NMDS.raw = ordinate(model$raw$comm, "NMDS", "bray")
plot_ordination(model$raw$comm, bray_NMDS.raw, "samples", color="Factor")

bray_NMDS.Q = ordinate(model$QSeq10$comm, "NMDS", "bray")
plot_ordination(model$QSeq10$comm, bray_NMDS.Q, "samples", color="Factor")

While it looks qualitatively like the QSeq ordination might be more accurate to the reference, it's hard to tell. We can look at the R-Squared Ratio values to figure out which is more accurate. Here's a spot test:

# raw
model$raw$PERMANOVA$F4Rratio
# QSeq
model$QSeq10$PERMANOVA$F4Rratio

# raw
model$raw$PERMANOVA$CategoryRratio
# QSeq
model$QSeq10$PERMANOVA$CategoryRratio

The category variables look nearly identical, whereas the environmental variables are much better modeled by the QSeq approach. In the next section, we'll look at how to run multiple simulations to aggregate and statistically evaluate the differences of these approaches within the modeling environment.

Conducting Experiments: Summarizing Outputs

Model.Microbiome has a wrapper that allows the user to iteratively generate random communities with the same experimental conditions. The inputs are the same as before, plus one aditional input to tell the function how many times to iterate. In this case we'll do 10. We use suppressWarnings because the linear model often produces warnings when it has a perfect fit, and these can clutter the output.

model10<-suppressWarnings(BENCHMARK.MM(reps=10, commonN=30, groupN=20, singleN=5, D=500, V=250, method))

model10 is now a list object with 10 items, each one as a replicated simulation. This is the basis for us to be able to extract performance metrics and do statistical analyses to compare different normalization methods. Let's look at some analyses that are possible when we can automate community generation.

We first extract the LII indicator with the function Summarize.LII(), then we melt the dataframe and do a statistical test on it:

model10.LIIsummary<-Summarize.LII(model10, method)
model10.LIIsummary<-melt(model10.LIIsummary, value.name = "Value")

summary(aov(Value~Var2 +Error(Var1/Var2), data=model10.LIIsummary)) # blocks of method are nested within each run
TukeyHSD(aov(Value~Var2, data=model10.LIIsummary)) # ignoring the blocks bc TukeyHSD can't handle them. Still tells us which is most significant
plot(TukeyHSD(aov(Value~Var2, data=model10.LIIsummary))) # visual of confidence intervals

Clearly, the worst performing method is QSeq0.5. This is likely because it effectively leads to undersampling because the detection threshold is so high. Ideally we should choose a threshold value for this method that is sufficiently high enough that all sequences that pass QC in our pipeline are included in the sampling, and not rounded to 0 in any sample.

Let's look at how all these functionalities can be tied together into a comprehensive test of the strengths (or weaknesses) of a normalization procedure.

We can look at the difference in PERMANOVA outputs:

model10.PCat<-Summarize.PERMANOVA.Rratio(model10, method, "CategoryRratio")

model10.PF1<-Summarize.PERMANOVA.Rratio(model10, method, "F1Rratio")
model10.PF2<-Summarize.PERMANOVA.Rratio(model10, method, "F2Rratio")
model10.PF3<-Summarize.PERMANOVA.Rratio(model10, method, "F3Rratio")
model10.PF4<-Summarize.PERMANOVA.Rratio(model10, method, "F4Rratio")
model10.PF5<-Summarize.PERMANOVA.Rratio(model10, method, "F5Rratio")

model10.PCat
model10.PF1
model10.PF2
model10.PF3
model10.PF4
model10.PF5

NOTE: In general, there are many outputs that a researcher may want to generate and summarize that we have not anticipated. Here are some functions for extracting the R-ratio from the linear models to give a template on the extraction method that works well with the object. This should give a means for researchers to extract and summarize any metric in any way they want. We will use both of these functions later in the analysis.

# get lmRatio ####
Summarize.lmRatiotab.Mean<-function(trt, method){
  Ftab<-matrix(NA, nrow = length(trt), ncol = length(method)) # make matrix
  for(i in 1:length(trt)){
    for(j in 1:length(method)){
      Ftab[i,j]<-median(trt[[i]][[method[j]]]$lmRatiotab, na.rm=T)
    #print(sum(trt[[i]][j]$PERMANOVA$aov.tab$F.Model))
  }
}
  rownames(Ftab)<-names(trt)
  colnames(Ftab)<-method
  Ftab
}

Summarize.lmRatiotab.Var<-function(trt, method){
  Ftab<-matrix(NA, nrow = length(trt), ncol = length(method)) # make matrix
  for(i in 1:length(trt)){
    for(j in 1:length(method)){
      Ftab[i,j]<-var(trt[[i]][[method[j]]]$lmRatiotab, na.rm=T)
    #print(sum(trt[[i]][j]$PERMANOVA$aov.tab$F.Model))
  }
}
  rownames(Ftab)<-names(trt)
  colnames(Ftab)<-method
  Ftab
}

The following examples are an actual experiment run to identify biases associated with QSeq protocol, and to optimize QSeq for a number of different interpretive lenses.

Exp. 1: Normalization vs Structure

Let's say that we want to understand how the effect size of geographic structure within a community interacts with the performance of our normalization technique (or in the case of experimental design, what are the direct assemply effects of the treatment). Model.Microbiome gives us the power to vary the geographic structure/direct effects of the treatment on the community by determining the proportion of each sample that explicitly is determined by the grouping function:

# biogeography begin ####
model10.A<-suppressWarnings(BENCHMARK.MM(reps=10, commonN=53, groupN=1, singleN=1, D=500, V=250, method)) # essentially no filtration by group

model10.B<-suppressWarnings(BENCHMARK.MM(reps=10, commonN=35, groupN=15, singleN=5, D=500, V=250, method)) # ~27% filtration by the group

model10.C<-suppressWarnings(BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=250, method)) # ~36% filtration by the group

model10.D<-suppressWarnings(BENCHMARK.MM(reps=10, commonN=10, groupN=30, singleN=15, D=500, V=250, method)) # ~55% filtration by the group

First, let's look at how the structure of the data might interact with our normalization strategy to influence how well information is retained. For this we will extract the LII value. As a reminder, LII correlates each taxon in the normalized dataset against it's abundance in the reference dataset. The index itself is a sum of 1-R-squared; this means that low LII values mean that overall the R-squared of the taxa against their references were close to 1 (or little information was lost). Conversely, higher values indicate that a lot of information was lost.

# get LII Biogeography ####
method2<- c("raw", method) # add raw data to the methods list to extract it too

# extract the LII from each level of sparsity ####
SummaryLII.model10.A<-as.data.frame(Summarize.LII(model10.A,method2))
SummaryLII.model10.B<-as.data.frame(Summarize.LII(model10.B, method2)) 
SummaryLII.model10.C<-as.data.frame(Summarize.LII(model10.C, method2))
SummaryLII.model10.D<-as.data.frame(Summarize.LII(model10.D, method2))

# prepare data for merging datasets
SummaryLII.model10.A$Level<-c(rep(1,10))
SummaryLII.model10.A<-melt(SummaryLII.model10.A, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.B$Level<-c(rep(2,10))
SummaryLII.model10.B<-melt(SummaryLII.model10.B, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C$Level<-c(rep(3,10))
SummaryLII.model10.C<-melt(SummaryLII.model10.C, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.D$Level<-c(rep(4,10))
SummaryLII.model10.D<-melt(SummaryLII.model10.D, id.vars = "Level", measure.vars = method2)

# merge datasets
model10Levels<-do.call("rbind", list(SummaryLII.model10.A,SummaryLII.model10.B,SummaryLII.model10.C,SummaryLII.model10.D))

# summarize for plotting
model10Levels.summary2<-data_summary2(model10Levels, varname="value", groupnames=c("Level", "variable"))
model10Levels.summary1<-data_summary(model10Levels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(model10Levels.summary2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("LII Value (Median +/- min/max)")+
  theme_classic()

ggplot(model10Levels.summary1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("LII Value (Mean +/- standard dev)")+
  theme_classic()

Here we can see clearly that the degree of structure in the dataset interacts with the normalization method to affect it's accuracy. In general, less information is lost in QSeq methods, with the clear distinction being that QSeq0.5 performs worse than non normalizing under certain conditions. We can confirm this with a statistical test:

Sparsity.LII<-summary(aov(model10Levels$value~model10Levels$Level*model10Levels$variable))
Sparsity.LII
TukeyHSD(aov(model10Levels$value~model10Levels$Level*model10Levels$variable))

Conclusion: we can see clearly that the degree of structure in the dataset interacts with the normalization method to affect how much information is retained on a taxon-by-taxon basis. In all cases the QSeq normalization method out performs un-normalized data at low structure, but at high levels of structure only QSeq2 and QSeq3 parameters continue to consistently outperform the raw variables. Remember that LII reports the degradation of R squared and so is a direct measure of lost information. But it doesn't provide direct information about how well we can detect environmental drivers for the abundance of taxa.

In this case interpretation is a little more challenging, because we are less interested in the mean value, and more interested in the variance of the values. This is because our metric is a ratio of the R-squared (variance explained) of each taxon by the environment to the same measure of the taxon in the reference dataset. This tells us the relative accuracy with which the normalized dataset is predicting the relationship of each taxon to the environment. Ideally the mean value should be 1 (a perfect fit). But it is possible for the mean to be 1, but the deviation to be substantial if it is balanced in the positive and negative direction. In other words, if the error in the positive and negative direction are equal in size, then even if there is large error in those directions, the mean value will still be 1 (a perfect fit). Thus we might have a perfect fit of the mean, yet have no taxa that is accurately modeled. To account for this, we put larger emphasis on the variance using a levene's test of equality of variance; but because variances may not be equally balanced in the positive and negative direction we are also still interested in changes in the average value. The two are necessary for accurate interpretation. We find especially for this analysis that outliers unduely affect the mean, so we prefer median for understanding the most common effect

Summarize.lmRatiotab.Var(model10.A, method2)



SM.lmRatio.model10.A<-as.data.frame(Summarize.lmRatiotab.Mean(model10.A,method2))
SV.lmRatio.model10.A<-as.data.frame(Summarize.lmRatiotab.Var(model10.A,method2))
SM.lmRatio.model10.B<-as.data.frame(Summarize.lmRatiotab.Mean(model10.B,method2))
SV.lmRatio.model10.B<-as.data.frame(Summarize.lmRatiotab.Var(model10.B,method2))
SM.lmRatio.model10.C<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C,method2))
SV.lmRatio.model10.C<-as.data.frame(Summarize.lmRatiotab.Var(model10.C,method2))
SM.lmRatio.model10.D<-as.data.frame(Summarize.lmRatiotab.Mean(model10.D,method2))
SV.lmRatio.model10.D<-as.data.frame(Summarize.lmRatiotab.Var(model10.D,method2))

# prepare data for merging datasets

SM.lmRatio.model10.A$Level<-c(rep(1,10))
SV.lmRatio.model10.A$Level<-c(rep(1,10))
SM.lmRatio.model10.B$Level<-c(rep(2,10))
SV.lmRatio.model10.B$Level<-c(rep(2,10))
SM.lmRatio.model10.C$Level<-c(rep(3,10))
SV.lmRatio.model10.C$Level<-c(rep(3,10))
SM.lmRatio.model10.D$Level<-c(rep(4,10))
SV.lmRatio.model10.D$Level<-c(rep(4,10))

SM.lmRatio.model10.A<-melt(SM.lmRatio.model10.A, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.A<-melt(SV.lmRatio.model10.A, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.B<-melt(SM.lmRatio.model10.B, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.B<-melt(SV.lmRatio.model10.B, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C<-melt(SM.lmRatio.model10.C, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C<-melt(SV.lmRatio.model10.C, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.D<-melt(SM.lmRatio.model10.D, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.D<-melt(SV.lmRatio.model10.D, id.vars = "Level", measure.vars = method2)

# merge datasets
SM.lmRatio.model10Levels<-do.call("rbind", list(SM.lmRatio.model10.A,SM.lmRatio.model10.B,SM.lmRatio.model10.C,SM.lmRatio.model10.D))
SV.lmRatio.model10Levels<-do.call("rbind", list(SV.lmRatio.model10.A,SV.lmRatio.model10.B,SV.lmRatio.model10.C,SV.lmRatio.model10.D))


# summarize for plotting
SM.lmRatio.model10.p1<-data_summary(SM.lmRatio.model10Levels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10.p1<-data_summary(SV.lmRatio.model10Levels, varname="value", groupnames=c("Level", "variable"))
# plot
SM.lmRatio.model10.p2<-data_summary2(SM.lmRatio.model10Levels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10.p2<-data_summary2(SV.lmRatio.model10Levels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(SM.lmRatio.model10.p1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Mean lmRatio (Mean +/- sd)")+
  theme_classic()

ggplot(SV.lmRatio.model10.p1, aes(x=Level, y=value, group = variable, color=variable))+scale_colour_viridis_d(direction=-1)+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Variance lmRatio (Mean +/- sd)")+
  theme_classic()


ggplot(SM.lmRatio.model10.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Median lmRatio (Median +/- min/max)")+
  theme_classic()

ggplot(SV.lmRatio.model10.p2, aes(x=Level, y=log(value), group = variable, color=variable))+scale_colour_viridis_d(direction=-1)+
  geom_errorbar(aes(ymin=log(low), ymax=log(high)), width=.1, position=position_dodge(.3)) +
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("log Variance lmRatio (Median +/- min/max)")+
  theme_classic()

... and run statistics to see the effect:

Sparsity.lmEnvM<-summary(aov(SM.lmRatio.model10Levels$value~SM.lmRatio.model10Levels$Level*SM.lmRatio.model10Levels$variable))
Sparsity.lmEnvM

TukeyHSD(aov(SM.lmRatio.model10Levels$value~SM.lmRatio.model10Levels$Level*SM.lmRatio.model10Levels$variable))

Sparsity.lmEnvV<-summary(aov(SV.lmRatio.model10Levels$value~SV.lmRatio.model10Levels$Level*SV.lmRatio.model10Levels$variable))
Sparsity.lmEnvV

TukeyHSD(aov(SV.lmRatio.model10Levels$value~SV.lmRatio.model10Levels$Level*SV.lmRatio.model10Levels$variable))

Conclusion: there is an interaction between the normalization method and the internal structure of the community on our ability to accurately describe the relationship between taxa and the environmental gradients. According to this analysis we should prefer QSeq3,10 and 100.

What about our ability to detect the experimental design? Following the same methodology:

Summarize.lmRatiotabModel.Mean<-function(trt, method){
  Ftab<-matrix(NA, nrow = length(trt), ncol = length(method)) # make matrix
  for(i in 1:length(trt)){
    for(j in 1:length(method)){
      Ftab[i,j]<-median(trt[[i]][[method[j]]]$lmRatiotab.model, na.rm=T)
    #print(sum(trt[[i]][j]$PERMANOVA$aov.tab$F.Model))
  }
}
  rownames(Ftab)<-names(trt)
  colnames(Ftab)<-method
  Ftab
}

Summarize.lmRatiotabModel.Var<-function(trt, method){
  Ftab<-matrix(NA, nrow = length(trt), ncol = length(method)) # make matrix
  for(i in 1:length(trt)){
    for(j in 1:length(method)){
      Ftab[i,j]<-var(trt[[i]][[method[j]]]$lmRatiotab.model, na.rm=T)
    #print(sum(trt[[i]][j]$PERMANOVA$aov.tab$F.Model))
  }
}
  rownames(Ftab)<-names(trt)
  colnames(Ftab)<-method
  Ftab
}


SM.lmRatioM.model10.A<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.A,method2))
SV.lmRatioM.model10.A<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.A,method2))
SM.lmRatioM.model10.B<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.B,method2))
SV.lmRatioM.model10.B<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.B,method2))
SM.lmRatioM.model10.C<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C,method2))
SV.lmRatioM.model10.C<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C,method2))
SM.lmRatioM.model10.D<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.D,method2))
SV.lmRatioM.model10.D<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.D,method2))

# prepare data for merging datasets

SM.lmRatioM.model10.A$Level<-c(rep(1,10))
SV.lmRatioM.model10.A$Level<-c(rep(1,10))
SM.lmRatioM.model10.B$Level<-c(rep(2,10))
SV.lmRatioM.model10.B$Level<-c(rep(2,10))
SM.lmRatioM.model10.C$Level<-c(rep(3,10))
SV.lmRatioM.model10.C$Level<-c(rep(3,10))
SM.lmRatioM.model10.D$Level<-c(rep(4,10))
SV.lmRatioM.model10.D$Level<-c(rep(4,10))

SM.lmRatioM.model10.A<-melt(SM.lmRatioM.model10.A, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.A<-melt(SV.lmRatioM.model10.A, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.B<-melt(SM.lmRatioM.model10.B, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.B<-melt(SV.lmRatioM.model10.B, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.C<-melt(SM.lmRatioM.model10.C, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.C<-melt(SV.lmRatioM.model10.C, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.D<-melt(SM.lmRatioM.model10.D, id.vars = "Level", measure.vars = method)
SV.lmRatioM.model10.D<-melt(SV.lmRatioM.model10.D, id.vars = "Level", measure.vars = method2)

# merge datasets
SM.lmRatioM.model10Levels<-do.call("rbind", list(SM.lmRatioM.model10.A,SM.lmRatioM.model10.B,SM.lmRatioM.model10.C,SM.lmRatioM.model10.D))
SV.lmRatioM.model10Levels<-do.call("rbind", list(SV.lmRatioM.model10.A,SV.lmRatioM.model10.B,SV.lmRatioM.model10.C,SV.lmRatioM.model10.D))


# summarize for plotting

SM.lmRatioM.model10.p1<-data_summary(SM.lmRatioM.model10Levels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatioM.model10.p1<-data_summary(SV.lmRatioM.model10Levels, varname="value", groupnames=c("Level", "variable"))
SM.lmRatioM.model10.p2<-data_summary2(SM.lmRatioM.model10Levels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatioM.model10.p2<-data_summary2(SV.lmRatioM.model10Levels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(SM.lmRatioM.model10.p1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d()+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Median lmRatio.model (Mean +/- sd)")+
  theme_classic()

ggplot(SV.lmRatioM.model10.p1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Variance lmRatio.model (Mean +/- sd)")+

  theme_classic()
ggplot(SM.lmRatioM.model10.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d()+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Median lmRatio.model (Median +/- min/max)")+
  theme_classic()

ggplot(SV.lmRatioM.model10.p2, aes(x=Level, y=log(value), group = variable, color=variable))+
  geom_errorbar(aes(ymin=log(low), ymax=log(high)), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Variance lmRatio.model (Median +/- min/max)")+

  theme_classic()

And again we do the stats:

Sparsity.lmModelM<-summary(aov(SM.lmRatioM.model10Levels$value~SM.lmRatioM.model10Levels$Level*SM.lmRatioM.model10Levels$variable))
Sparsity.lmModelM

TukeyHSD(aov(SM.lmRatioM.model10Levels$value~SM.lmRatioM.model10Levels$variable)) 

Sparsity.lmModelV<-summary(aov(SV.lmRatioM.model10Levels$value~SV.lmRatioM.model10Levels$Level*SV.lmRatioM.model10Levels$variable))
Sparsity.lmModelV

TukeyHSD(aov(SV.lmRatioM.model10Levels$value~SV.lmRatioM.model10Levels$Level*SV.lmRatioM.model10Levels$variable))

An emerging pattern that we see here is that raw data without any processing generally performs worst when there is less structure and environmental variables can act more as a filter. Also, in our normalization method QSeq, higher sampling variables that don't omit low abundance taxa perform better. Higher abundance will continue to perform even better.

Let's check the accuracy of the model:

# outputs an object with both variance and median
TaxRatio.model10.A<-getTaxCor.Tab(model10.A,method2)
TaxRatio.model10.B<-getTaxCor.Tab(model10.B,method2)
TaxRatio.model10.C<-getTaxCor.Tab(model10.C,method2)
TaxRatio.model10.D<-getTaxCor.Tab(model10.D,method2)

#separate median and variance tables, and melt
V.TaxRatio.model10.A<-melt(TaxRatio.model10.A$V.tax)
M.TaxRatio.model10.A<-melt(TaxRatio.model10.A$Median.tax)
V.TaxRatio.model10.B<-melt(TaxRatio.model10.B$V.tax)
M.TaxRatio.model10.B<-melt(TaxRatio.model10.B$Median.tax)
V.TaxRatio.model10.C<-melt(TaxRatio.model10.C$V.tax)
M.TaxRatio.model10.C<-melt(TaxRatio.model10.C$Median.tax)
V.TaxRatio.model10.D<-melt(TaxRatio.model10.D$V.tax)
M.TaxRatio.model10.D<-melt(TaxRatio.model10.D$Median.tax)

# prepare data for merging datasets
V.TaxRatio.model10.A<-as.data.frame(V.TaxRatio.model10.A)
M.TaxRatio.model10.A<-as.data.frame(M.TaxRatio.model10.A)
V.TaxRatio.model10.B<-as.data.frame(V.TaxRatio.model10.B)
M.TaxRatio.model10.B<-as.data.frame(M.TaxRatio.model10.B)
V.TaxRatio.model10.C<-as.data.frame(V.TaxRatio.model10.C)
M.TaxRatio.model10.C<-as.data.frame(M.TaxRatio.model10.C)
V.TaxRatio.model10.D<-as.data.frame(V.TaxRatio.model10.D)
M.TaxRatio.model10.D<-as.data.frame(M.TaxRatio.model10.D)

V.TaxRatio.model10.A$Level<-c(rep(1,nrow(V.TaxRatio.model10.A)))
V.TaxRatio.model10.B$Level<-c(rep(2,nrow(V.TaxRatio.model10.B)))
V.TaxRatio.model10.C$Level<-c(rep(3,nrow(V.TaxRatio.model10.C)))
V.TaxRatio.model10.D$Level<-c(rep(4,nrow(V.TaxRatio.model10.D)))
M.TaxRatio.model10.A$Level<-c(rep(1,nrow(M.TaxRatio.model10.A)))
M.TaxRatio.model10.B$Level<-c(rep(2,nrow(M.TaxRatio.model10.B)))
M.TaxRatio.model10.C$Level<-c(rep(3,nrow(M.TaxRatio.model10.C)))
M.TaxRatio.model10.D$Level<-c(rep(4,nrow(M.TaxRatio.model10.D)))



# merge datasets
V.TaxRatio.model10Levels<-do.call("rbind", list(V.TaxRatio.model10.A,V.TaxRatio.model10.B,V.TaxRatio.model10.C,V.TaxRatio.model10.D))
M.TaxRatio.model10Levels<-do.call("rbind", list(M.TaxRatio.model10.A,M.TaxRatio.model10.B,M.TaxRatio.model10.C,M.TaxRatio.model10.D))


# summarize for plotting
V.TaxRatio.model10.p1<-data_summary(V.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
M.TaxRatio.model10.p1<-data_summary(M.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
V.TaxRatio.model10.p2<-data_summary2(V.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
M.TaxRatio.model10.p2<-data_summary2(M.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
# plot
ggplot(V.TaxRatio.model10.p1, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Variance taxRatio (Mean +/- sd)")+
  theme_classic()

ggplot(M.TaxRatio.model10.p1, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Mean taxRatio (Mean +/- sd)")+
  theme_classic()

ggplot(V.TaxRatio.model10.p2, aes(x=Level, y=log10(value), group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=log10(low), ymax=log10(high)), width=.1, position=position_dodge(.3)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Variance taxRatio (Median +/- min/max)")+
  theme_classic()

ggplot(M.TaxRatio.model10.p2, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("log Median taxRatio (Median +/- min/max)")+
  theme_classic()

Note: In general, the results of this portion of the simulation show that QSeq0.5 underperforms compared to all other QSeq methods; that QSeq methods generally are better at estimating the mean value than un-normalized (have less systemic under or over estimation), but that often at lower degrees of structure the un-normalized dataset has less variance around the mean. But, this portion of the simulation tends to be less stable, and occasionally fails to produce a significant result. Therefore I will give guidelines on how to interpret the outputs, but not discuss specific conclusions.

Generally variance should approach zero, and the mean should approach 1. The greater the deviation of the variance from zero (and the greater the range in variance values), the less accuracy on a taxon-by-taxon basis in estimating relationships among taxa. The farther the mean deviates from 1, the greater the systemic bias (over or underestimation) of the strength of the relationship. Let's check the stats:

Sparsity.TaxM<-summary(aov(M.TaxRatio.model10Levels$value~M.TaxRatio.model10Levels$Level*M.TaxRatio.model10Levels$Var2))
Sparsity.TaxM

TukeyHSD(aov(M.TaxRatio.model10Levels$value~M.TaxRatio.model10Levels$Var2))

Sparsity.TaxV<-summary(aov(V.TaxRatio.model10Levels$value~V.TaxRatio.model10Levels$Level*V.TaxRatio.model10Levels$Var2))
Sparsity.TaxV

TukeyHSD(aov(V.TaxRatio.model10Levels$value~V.TaxRatio.model10Levels$Level*V.TaxRatio.model10Levels$Var2))

#leveneTest(value~as.factor(Level), data=V.TaxRatio.model10.p)

Now let's look at a common community level statistic. PERMANOVA outputs:

# get permanova geography ####


model10A.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.A,method2, "CategoryRratio"))
model10A.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F1Rratio"))
model10A.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F2Rratio"))
model10A.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F3Rratio"))
model10A.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F4Rratio"))
model10A.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.A,method2, "F5Rratio"))


model10B.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.B, method2, "CategoryRratio"))
model10B.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.B, method2, "F1Rratio"))
model10B.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.B, method2, "F2Rratio"))
model10B.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.B, method2, "F3Rratio"))
model10B.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.B, method2, "F4Rratio"))
model10B.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.B, method2, "F5Rratio"))


model10C.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C, method2, "CategoryRratio"))
model10C.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C, method2, "F1Rratio"))
model10C.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C, method2, "F2Rratio"))
model10C.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C, method2, "F3Rratio"))
model10C.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C, method2, "F4Rratio"))
model10C.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C, method2, "F5Rratio"))


model10D.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.D, method2, "CategoryRratio"))
model10D.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.D, method2, "F1Rratio"))
model10D.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.D, method2, "F2Rratio"))
model10D.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.D, method2, "F3Rratio"))
model10D.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.D, method2, "F4Rratio"))
model10D.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.D, method2, "F5Rratio"))

model10A.Permanova.model$Level<-c(rep(1,10))
model10A.Permanova.env1$Level<-c(rep(1,10))
model10A.Permanova.env2$Level<-c(rep(1,10))
model10A.Permanova.env3$Level<-c(rep(1,10))
model10A.Permanova.env4$Level<-c(rep(1,10))
model10A.Permanova.env5$Level<-c(rep(1,10))

model10B.Permanova.model$Level<-c(rep(2,10))
model10B.Permanova.env1$Level<-c(rep(2,10))
model10B.Permanova.env2$Level<-c(rep(2,10))
model10B.Permanova.env3$Level<-c(rep(2,10))
model10B.Permanova.env4$Level<-c(rep(2,10))
model10B.Permanova.env5$Level<-c(rep(2,10))

model10C.Permanova.model$Level<-c(rep(3,10))
model10C.Permanova.env1$Level<-c(rep(3,10))
model10C.Permanova.env2$Level<-c(rep(3,10))
model10C.Permanova.env3$Level<-c(rep(3,10))
model10C.Permanova.env4$Level<-c(rep(3,10))
model10C.Permanova.env5$Level<-c(rep(3,10))

model10D.Permanova.model$Level<-c(rep(4,10))
model10D.Permanova.env1$Level<-c(rep(4,10))
model10D.Permanova.env2$Level<-c(rep(4,10))
model10D.Permanova.env3$Level<-c(rep(4,10))
model10D.Permanova.env4$Level<-c(rep(4,10))
model10D.Permanova.env5$Level<-c(rep(4,10))

permanova.model10Levels.model<-do.call("rbind", list(model10A.Permanova.model,model10B.Permanova.model,model10C.Permanova.model,model10D.Permanova.model))
permanova.model10Levels.env1<-do.call("rbind", list(model10A.Permanova.env1,model10B.Permanova.env1,model10C.Permanova.env1,model10D.Permanova.env1))
permanova.model10Levels.env2<-do.call("rbind", list(model10A.Permanova.env2,model10B.Permanova.env2,model10C.Permanova.env2,model10D.Permanova.env2))
permanova.model10Levels.env3<-do.call("rbind", list(model10A.Permanova.env3,model10B.Permanova.env3,model10C.Permanova.env3,model10D.Permanova.env3))
permanova.model10Levels.env4<-do.call("rbind", list(model10A.Permanova.env4,model10B.Permanova.env4,model10C.Permanova.env4,model10D.Permanova.env4))
permanova.model10Levels.env5<-do.call("rbind", list(model10A.Permanova.env5,model10B.Permanova.env5,model10C.Permanova.env5,model10D.Permanova.env5))

permanova.model10Levels.model<-melt(permanova.model10Levels.model, id.vars = "Level", measure.vars = method2)
permanova.model10Levels.env1<-melt(permanova.model10Levels.env1, id.vars = "Level", measure.vars = method2)
permanova.model10Levels.env2<-melt(permanova.model10Levels.env2, id.vars = "Level", measure.vars = method2)
permanova.model10Levels.env3<-melt(permanova.model10Levels.env3, id.vars = "Level", measure.vars = method2)
permanova.model10Levels.env4<-melt(permanova.model10Levels.env4, id.vars = "Level", measure.vars = method2)
permanova.model10Levels.env5<-melt(permanova.model10Levels.env5, id.vars = "Level", measure.vars = method2)

permanova.model10Levels.modelP<-data_summary2(permanova.model10Levels.model, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env1P<-data_summary2(permanova.model10Levels.env1, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env2P<-data_summary2(permanova.model10Levels.env2, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env3P<-data_summary2(permanova.model10Levels.env3, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env4P<-data_summary2(permanova.model10Levels.env4, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env5P<-data_summary2(permanova.model10Levels.env5, varname="value", groupnames=c("Level", "variable"))


ggplot(permanova.model10Levels.modelP, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova modelRatio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10Levels.env1P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env1Ratio (Median +/- min/max)")+
  theme_classic()


ggplot(permanova.model10Levels.env2P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env2Ratio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10Levels.env3P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env3Ratio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10Levels.env4P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env4Ratio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10Levels.env5P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env5Ratio (Median +/- min/max)")+
  theme_classic()

permanova.model10Levels.modelP1<-data_summary(permanova.model10Levels.model, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env1P1<-data_summary(permanova.model10Levels.env1, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env2P1<-data_summary(permanova.model10Levels.env2, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env3P1<-data_summary(permanova.model10Levels.env3, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env4P1<-data_summary(permanova.model10Levels.env4, varname="value", groupnames=c("Level", "variable"))
permanova.model10Levels.env5P1<-data_summary(permanova.model10Levels.env5, varname="value", groupnames=c("Level", "variable"))


ggplot(permanova.model10Levels.modelP1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova modelRatio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10Levels.env1P1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env1Ratio (Mean +/- sd)")+
  theme_classic()


ggplot(permanova.model10Levels.env2P1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env2Ratio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10Levels.env3P1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env3Ratio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10Levels.env4P1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env4Ratio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10Levels.env5P1, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.3)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.3))+
  geom_point(position=position_dodge(.3))+
  xlab("Degree of Structure")+
  ylab("Permanova env5Ratio (Mean +/- sd)")+
  theme_classic()

PERMANOVA results are very consistent: the statistical model is typically captured equally well across normalization methods. However, QSeq methods significantly increase the accuracy of relating specific environmental gradients to the community structure.

Statistics:

Sparsity.PM<-summary(aov(permanova.model10Levels.model$value~permanova.model10Levels.model$Level*permanova.model10Levels.model$variable))
Sparsity.PF1<-summary(aov(permanova.model10Levels.env1$value~permanova.model10Levels.env1$Level*permanova.model10Levels.env1$variable))
Sparsity.PF2<-summary(aov(permanova.model10Levels.env2$value~permanova.model10Levels.env2$Level*permanova.model10Levels.env2$variable))
Sparsity.PF3<-summary(aov(permanova.model10Levels.env3$value~permanova.model10Levels.env3$Level*permanova.model10Levels.env3$variable))
Sparsity.PF4<-summary(aov(permanova.model10Levels.env4$value~permanova.model10Levels.env4$Level*permanova.model10Levels.env4$variable))
Sparsity.PF5<-summary(aov(permanova.model10Levels.env5$value~permanova.model10Levels.env5$Level*permanova.model10Levels.env5$variable))

Sparsity.PM
Sparsity.PF1
Sparsity.PF2
Sparsity.PF3
Sparsity.PF4
Sparsity.PF5

leveneTest(permanova.model10Levels.model$value~permanova.model10Levels.model$variable)
leveneTest(permanova.model10Levels.env1$value~permanova.model10Levels.env1$variable)
leveneTest(permanova.model10Levels.env2$value~permanova.model10Levels.env2$variable)
leveneTest(permanova.model10Levels.env3$value~permanova.model10Levels.env3$variable)
leveneTest(permanova.model10Levels.env4$value~permanova.model10Levels.env4$variable)
leveneTest(permanova.model10Levels.env5$value~permanova.model10Levels.env5$variable)

Expt 2: Normalization vs Sequencing Depth

Let's say we want to know the effect of undersampling interacts with our normalization protocol to affect our ability to detect relationships between taxa and the environment. We can vary the sequencing depth:

# make sequencing depth ####
#method<-c("QSeq0.5", "QSeq1", "QSeq2", "QSeq3", "QSeq10")
model10.C1<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=100, V=50, method)

model10.C2<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=200, V=100, method)

model10.C3<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=250, method)

model10.C4<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=1000, V=500, method)

model10.C5<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=2000, V=1000, method)

LII output:

SummaryLII.model10.C1<-as.data.frame(Summarize.LII(model10.C1,method2))
SummaryLII.model10.C2<-as.data.frame(Summarize.LII(model10.C2, method2)) 
SummaryLII.model10.C3<-as.data.frame(Summarize.LII(model10.C3, method2))
SummaryLII.model10.C4<-as.data.frame(Summarize.LII(model10.C4, method2))
SummaryLII.model10.C5<-as.data.frame(Summarize.LII(model10.C5, method2))
# prepare data for merging datasets
SummaryLII.model10.C1$Level<-c(rep(100,10))
SummaryLII.model10.C1<-melt(SummaryLII.model10.C1, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C2$Level<-c(rep(200,10))
SummaryLII.model10.C2<-melt(SummaryLII.model10.C2, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C3$Level<-c(rep(500,10))
SummaryLII.model10.C3<-melt(SummaryLII.model10.C3, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C4$Level<-c(rep(1000,10))
SummaryLII.model10.C4<-melt(SummaryLII.model10.C4, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C5$Level<-c(rep(2000,10))
SummaryLII.model10.C5<-melt(SummaryLII.model10.C5, id.vars = "Level", measure.vars = method2)
# merge datasets
model10.SeqLevels<-do.call("rbind", list(SummaryLII.model10.C1,SummaryLII.model10.C2,SummaryLII.model10.C3,SummaryLII.model10.C4,SummaryLII.model10.C5))

# summarize for plotting
model10.SeqLevels.summary<-data_summary(model10.SeqLevels, varname="value", groupnames=c("Level", "variable"))
model10.SeqLevels.summary2<-data_summary2(model10.SeqLevels, varname="value", groupnames=c("Level", "variable"))

# plot
ggplot(model10.SeqLevels.summary, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sequencing Depth")+
  ylab("LII Value (Mean +/- sd)")+
  theme_classic()

ggplot(model10.SeqLevels.summary2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sequencing Depth")+
  ylab("LII Value (Median +/- min/max)")+
  theme_classic()

These results are consistent with our previous experiment, where QSeq2-100 lose the least information; QSeq0.5 generally performs the worst, and un-normalized data only loses more information than QSeq0.5 and under certain conditions QSeq1. All methods perform best at highest sequencing depth.

Depth.LII<-summary(aov(model10.SeqLevels$value~model10.SeqLevels$Level*model10.SeqLevels$variable))
Depth.LII
TukeyHSD(aov(model10.SeqLevels$value~model10.SeqLevels$Level*model10.SeqLevels$variable))

Now let's look at how well the environmental gradients predicts taxon abundance:

# sequencing depth env. lm ratio ####
SM.lmRatio.model10.C1<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C1,method2))
SV.lmRatio.model10.C1<-as.data.frame(Summarize.lmRatiotab.Var(model10.C1,method2))
SM.lmRatio.model10.C2<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C2,method2))
SV.lmRatio.model10.C2<-as.data.frame(Summarize.lmRatiotab.Var(model10.C2,method2))
SM.lmRatio.model10.C3<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C3,method2))
SV.lmRatio.model10.C3<-as.data.frame(Summarize.lmRatiotab.Var(model10.C3,method2))
SM.lmRatio.model10.C4<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C4,method2))
SV.lmRatio.model10.C4<-as.data.frame(Summarize.lmRatiotab.Var(model10.C4,method2))
SM.lmRatio.model10.C5<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C5,method2))
SV.lmRatio.model10.C5<-as.data.frame(Summarize.lmRatiotab.Var(model10.C5,method2))
# prepare data for merging datasets

SM.lmRatio.model10.C1$Level<-c(rep(100,10))
SV.lmRatio.model10.C1$Level<-c(rep(100,10))
SM.lmRatio.model10.C2$Level<-c(rep(200,10))
SV.lmRatio.model10.C2$Level<-c(rep(200,10))
SM.lmRatio.model10.C3$Level<-c(rep(500,10))
SV.lmRatio.model10.C3$Level<-c(rep(500,10))
SM.lmRatio.model10.C4$Level<-c(rep(1000,10))
SV.lmRatio.model10.C4$Level<-c(rep(1000,10))
SM.lmRatio.model10.C5$Level<-c(rep(2000,10))
SV.lmRatio.model10.C5$Level<-c(rep(2000,10))

SM.lmRatio.model10.C1<-melt(SM.lmRatio.model10.C1, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C1<-melt(SV.lmRatio.model10.C1, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C2<-melt(SM.lmRatio.model10.C2, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C2<-melt(SV.lmRatio.model10.C2, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C3<-melt(SM.lmRatio.model10.C3, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C3<-melt(SV.lmRatio.model10.C3, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C4<-melt(SM.lmRatio.model10.C4, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C4<-melt(SV.lmRatio.model10.C4, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C5<-melt(SM.lmRatio.model10.C5, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C5<-melt(SV.lmRatio.model10.C5, id.vars = "Level", measure.vars = method2)

# merge datasets
SM.lmRatio.model10.CLevels<-do.call("rbind", list(SM.lmRatio.model10.C1,SM.lmRatio.model10.C2,SM.lmRatio.model10.C3,SM.lmRatio.model10.C4, SM.lmRatio.model10.C5))
SV.lmRatio.model10.CLevels<-do.call("rbind", list(SV.lmRatio.model10.C1,SV.lmRatio.model10.C2,SV.lmRatio.model10.C3,SV.lmRatio.model10.C4, SV.lmRatio.model10.C5))


# summarize for plotting
SM.lmRatio.model10C.p<-data_summary(SM.lmRatio.model10.CLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10C.p<-data_summary(SV.lmRatio.model10.CLevels, varname="value", groupnames=c("Level", "variable"))
# plot

SM.lmRatio.model10C.p2<-data_summary2(SM.lmRatio.model10.CLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10C.p2<-data_summary2(SV.lmRatio.model10.CLevels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(SM.lmRatio.model10C.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sampling Depth")+
  ylab("Median lmRatio (Mean +/- sd)")+
  theme_classic()

ggplot(SV.lmRatio.model10C.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sampling Depth")+
  ylab("Variance lmRatio (Mean +/- sd)")+
  theme_classic()

ggplot(SM.lmRatio.model10C.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sampling Depth")+
  ylab("Median lmRatio (Median +/- min/max)")+
  theme_classic()

ggplot(SV.lmRatio.model10C.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sampling Depth")+
  ylab("Variance lmRatio (Median +/- min/max)")+
  theme_classic()

Typcially QSeq0.5 performs poorly, while other methods are essentially comparable. Often at low sequencing depth, un-normalized data has a performance that is somewhere between the QSeq1 and QSeq0.5 data. In some runs of the simulation it is a significant effect, in others it is not.

Depth.lmEnvM<-summary(aov(SM.lmRatio.model10.CLevels$value~SM.lmRatio.model10.CLevels$Level*SM.lmRatio.model10.CLevels$variable))
Depth.lmEnvM
TukeyHSD(aov(SM.lmRatio.model10.CLevels$value~SM.lmRatio.model10.CLevels$Level*SM.lmRatio.model10.CLevels$variable))

Depth.lmEnvV<-summary(aov(SV.lmRatio.model10.CLevels$value~SV.lmRatio.model10.CLevels$Level*SV.lmRatio.model10.CLevels$variable))
Depth.lmEnvV

TukeyHSD(aov(SV.lmRatio.model10.CLevels$value~SV.lmRatio.model10.CLevels$Level*SV.lmRatio.model10.CLevels$variable))

leveneTest(value~as.factor(Level)*variable, data=SV.lmRatio.model10.CLevels)

And same steps but for the categorical variables:

# lm ratio model sequencing depth ####
SM.lmRatioM.model10.C1<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C1,method2))
SV.lmRatioM.model10.C1<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C1,method2))
SM.lmRatioM.model10.C2<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C2,method2))
SV.lmRatioM.model10.C2<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C2,method2))
SM.lmRatioM.model10.C3<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C3,method2))
SV.lmRatioM.model10.C3<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C3,method2))
SM.lmRatioM.model10.C4<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C4,method2))
SV.lmRatioM.model10.C4<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C4,method2))
SM.lmRatioM.model10.C5<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C5,method2))
SV.lmRatioM.model10.C5<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C5,method2))
# prepare data for merging datasets

SM.lmRatioM.model10.C1$Level<-c(rep(100,10))
SV.lmRatioM.model10.C1$Level<-c(rep(100,10))
SM.lmRatioM.model10.C2$Level<-c(rep(200,10))
SV.lmRatioM.model10.C2$Level<-c(rep(200,10))
SM.lmRatioM.model10.C3$Level<-c(rep(500,10))
SV.lmRatioM.model10.C3$Level<-c(rep(500,10))
SM.lmRatioM.model10.C4$Level<-c(rep(1000,10))
SV.lmRatioM.model10.C4$Level<-c(rep(1000,10))
SM.lmRatioM.model10.C5$Level<-c(rep(2000,10))
SV.lmRatioM.model10.C5$Level<-c(rep(2000,10))

SM.lmRatioM.model10.C1<-melt(SM.lmRatioM.model10.C1, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.C1<-melt(SV.lmRatioM.model10.C1, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.C2<-melt(SM.lmRatioM.model10.C2, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.C2<-melt(SV.lmRatioM.model10.C2, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.C3<-melt(SM.lmRatioM.model10.C3, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.C3<-melt(SV.lmRatioM.model10.C3, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.C4<-melt(SM.lmRatioM.model10.C4, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.C4<-melt(SV.lmRatioM.model10.C4, id.vars = "Level", measure.vars = method2)
SM.lmRatioM.model10.C5<-melt(SM.lmRatioM.model10.C5, id.vars = "Level", measure.vars = method2)
SV.lmRatioM.model10.C5<-melt(SV.lmRatioM.model10.C5, id.vars = "Level", measure.vars = method2)

# merge datasets
SM.lmRatioM.model10CLevels<-do.call("rbind", list(SM.lmRatioM.model10.C1,SM.lmRatioM.model10.C2,SM.lmRatioM.model10.C3,SM.lmRatioM.model10.C4, SM.lmRatioM.model10.C5))
SV.lmRatioM.model10CLevels<-do.call("rbind", list(SV.lmRatioM.model10.C1,SV.lmRatioM.model10.C2,SV.lmRatioM.model10.C3,SV.lmRatioM.model10.C4, SV.lmRatioM.model10.C5))


# summarize for plotting
SM.lmRatioM.model10C.p<-data_summary(SM.lmRatioM.model10CLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatioM.model10C.p<-data_summary(SV.lmRatioM.model10CLevels, varname="value", groupnames=c("Level", "variable"))
SM.lmRatioM.model10C.p2<-data_summary2(SM.lmRatioM.model10CLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatioM.model10C.p2<-data_summary2(SV.lmRatioM.model10CLevels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(SM.lmRatioM.model10C.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sequencing Depth")+
  ylab("Median lmRatio.model (Mean +/- sd)")+
  theme_classic()

ggplot(SV.lmRatioM.model10C.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sequencing Depth")+
  ylab("log Variance lmRatio.model (Mean +/- sd)")+
  theme_classic()

ggplot(SM.lmRatioM.model10C.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sequencing Depth")+
  ylab("Median lmRatio.model (Median +/- min/max)")+
  theme_classic()

ggplot(SV.lmRatioM.model10C.p2, aes(x=Level, y=log(value), group = variable, color=variable))+
  geom_errorbar(aes(ymin=log(low), ymax=log(high)), width=.1, position=position_dodge(30)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Sequencing Depth")+
  ylab("log Variance lmRatio.model (Median +/- min/max)")+
  theme_classic()

It is typical for the categorical model to have greater agreement across the methods than the environmental gradients.

Depth.lmModelM<-summary(aov(SM.lmRatioM.model10CLevels$value~SM.lmRatioM.model10CLevels$Level*SM.lmRatioM.model10CLevels$variable))
Depth.lmModelM
TukeyHSD(aov(SM.lmRatioM.model10CLevels$value~SM.lmRatioM.model10CLevels$Level*SM.lmRatioM.model10CLevels$variable))

Depth.lmModelV<-summary(aov(SV.lmRatioM.model10CLevels$value~SV.lmRatioM.model10CLevels$Level*SV.lmRatioM.model10CLevels$variable))
Depth.lmModelV
TukeyHSD(aov(SV.lmRatioM.model10CLevels$value~SV.lmRatioM.model10CLevels$Level*SV.lmRatioM.model10CLevels$variable))

Let's look at the taxon relationships:

TaxRatio.model10.C1<-getTaxCor.Tab(model10.C1,method2)
TaxRatio.model10.C2<-getTaxCor.Tab(model10.C2,method2)
TaxRatio.model10.C3<-getTaxCor.Tab(model10.C3,method2)
TaxRatio.model10.C4<-getTaxCor.Tab(model10.C4,method2)
TaxRatio.model10.C5<-getTaxCor.Tab(model10.C5,method2)

V.TaxRatio.model10.C1<-melt(TaxRatio.model10.C1$V.tax)
M.TaxRatio.model10.C1<-melt(TaxRatio.model10.C1$Median.tax)
V.TaxRatio.model10.C2<-melt(TaxRatio.model10.C2$V.tax)
M.TaxRatio.model10.C2<-melt(TaxRatio.model10.C2$Median.tax)
V.TaxRatio.model10.C3<-melt(TaxRatio.model10.C3$V.tax)
M.TaxRatio.model10.C3<-melt(TaxRatio.model10.C3$Median.tax)
V.TaxRatio.model10.C4<-melt(TaxRatio.model10.C4$V.tax)
M.TaxRatio.model10.C4<-melt(TaxRatio.model10.C4$Median.tax)
V.TaxRatio.model10.C5<-melt(TaxRatio.model10.C5$V.tax)
M.TaxRatio.model10.C5<-melt(TaxRatio.model10.C5$Median.tax)

V.TaxRatio.model10.C1<-as.data.frame(V.TaxRatio.model10.C1)
M.TaxRatio.model10.C1<-as.data.frame(M.TaxRatio.model10.C1)
V.TaxRatio.model10.C2<-as.data.frame(V.TaxRatio.model10.C2)
M.TaxRatio.model10.C2<-as.data.frame(M.TaxRatio.model10.C2)
V.TaxRatio.model10.C3<-as.data.frame(V.TaxRatio.model10.C3)
M.TaxRatio.model10.C3<-as.data.frame(M.TaxRatio.model10.C3)
V.TaxRatio.model10.C4<-as.data.frame(V.TaxRatio.model10.C4)
M.TaxRatio.model10.C4<-as.data.frame(M.TaxRatio.model10.C4)
V.TaxRatio.model10.C5<-as.data.frame(V.TaxRatio.model10.C5)
M.TaxRatio.model10.C5<-as.data.frame(M.TaxRatio.model10.C5)

# prepare data for merging datasets

V.TaxRatio.model10.C1$Level<-c(rep(100,nrow(V.TaxRatio.model10.C1)))
V.TaxRatio.model10.C2$Level<-c(rep(200,nrow(V.TaxRatio.model10.C2)))
V.TaxRatio.model10.C3$Level<-c(rep(500,nrow(V.TaxRatio.model10.C3)))
V.TaxRatio.model10.C4$Level<-c(rep(1000,nrow(V.TaxRatio.model10.C4)))
V.TaxRatio.model10.C5$Level<-c(rep(2000,nrow(V.TaxRatio.model10.C5)))
M.TaxRatio.model10.C1$Level<-c(rep(100,nrow(M.TaxRatio.model10.C1)))
M.TaxRatio.model10.C2$Level<-c(rep(200,nrow(M.TaxRatio.model10.C2)))
M.TaxRatio.model10.C3$Level<-c(rep(500,nrow(M.TaxRatio.model10.C3)))
M.TaxRatio.model10.C4$Level<-c(rep(1000,nrow(M.TaxRatio.model10.C4)))
M.TaxRatio.model10.C5$Level<-c(rep(2000,nrow(M.TaxRatio.model10.C5)))


# merge datasets
V.TaxRatio.model10Levels<-do.call("rbind", list(V.TaxRatio.model10.C1,V.TaxRatio.model10.C2,V.TaxRatio.model10.C3,V.TaxRatio.model10.C4,V.TaxRatio.model10.C5))
M.TaxRatio.model10Levels<-do.call("rbind", list(M.TaxRatio.model10.C1,M.TaxRatio.model10.C2,M.TaxRatio.model10.C3,M.TaxRatio.model10.C4, M.TaxRatio.model10.C5))


# summarize for plotting
V.TaxRatio.model10.Seq.p<-data_summary(V.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
M.TaxRatio.model10.Seq.p<-data_summary(M.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
V.TaxRatio.model10.Seq.p2<-data_summary2(V.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
M.TaxRatio.model10.Seq.p2<-data_summary2(M.TaxRatio.model10Levels, varname="value", groupnames=c("Level", "Var2"))
# plot
ggplot(V.TaxRatio.model10.Seq.p, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(100)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(100))+
  geom_point(position=position_dodge(100))+
  xlab("Sequencing Depth")+
  ylab("Variance taxRatio (Mean +/- sd)")+
  theme_classic()

ggplot(M.TaxRatio.model10.Seq.p, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(100)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(100))+
  geom_point(position=position_dodge(100))+
  xlab("Sequencing Depth")+
  ylab("Median taxRatio (Mean +/- sd)")+
  theme_classic()
ggplot(V.TaxRatio.model10.Seq.p2, aes(x=Level, y=log10(value), group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=log10(low), ymax=log10(high)), width=.1, position=position_dodge(100)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(100))+
  geom_point(position=position_dodge(100))+
  xlab("Sequencing Depth")+
  ylab("log10 Variance taxRatio (Median +/- min/max)")+
  theme_classic()

ggplot(M.TaxRatio.model10.Seq.p2, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(100)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(100))+
  geom_point(position=position_dodge(100))+
  xlab("Sequencing Depth")+
  ylab("Median taxRatio (Median +/- min/max)")+
  theme_classic()

Remember that variance should approach zero and mean should approach 1 in a perfect world. Greater deviations from these values represent less accurate retention of taxon-taxon relationships.

Depth.TaxM<-summary(aov(M.TaxRatio.model10Levels$value~M.TaxRatio.model10Levels$Level*M.TaxRatio.model10Levels$Var2))
Depth.TaxM
TukeyHSD(aov(M.TaxRatio.model10.Seq.p$value~M.TaxRatio.model10.Seq.p$Level*M.TaxRatio.model10.Seq.p$Var2))

Depth.TaxV<-summary(aov(V.TaxRatio.model10Levels$value~V.TaxRatio.model10Levels$Level*V.TaxRatio.model10Levels$Var2))
Depth.TaxV
TukeyHSD(aov(V.TaxRatio.model10Levels$value~V.TaxRatio.model10Levels$Level*V.TaxRatio.model10Levels$Var2))

model10C1.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1,method2, "CategoryRratio"))
model10C1.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1, method2, "F1Rratio"))
model10C1.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1, method2, "F2Rratio"))
model10C1.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1, method2, "F3Rratio"))
model10C1.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1, method2, "F4Rratio"))
model10C1.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1,method2, "F5Rratio"))


model10C2.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C2, method2, "CategoryRratio"))
model10C2.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C2, method2, "F1Rratio"))
model10C2.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C2, method2, "F2Rratio"))
model10C2.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C2, method2, "F3Rratio"))
model10C2.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C2, method2, "F4Rratio"))
model10C2.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C2, method2, "F5Rratio"))


model10C3.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C3, method2, "CategoryRratio"))
model10C3.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C3, method2, "F1Rratio"))
model10C3.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C3, method2, "F2Rratio"))
model10C3.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C3, method2, "F3Rratio"))
model10C3.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C3, method2, "F4Rratio"))
model10C3.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C3, method2, "F5Rratio"))


model10C4.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C4, method2, "CategoryRratio"))
model10C4.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C4, method2, "F1Rratio"))
model10C4.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C4, method2, "F2Rratio"))
model10C4.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C4, method2, "F3Rratio"))
model10C4.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C4, method2, "F4Rratio"))
model10C4.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C4, method2, "F5Rratio"))

model10C5.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C5, method2, "CategoryRratio"))
model10C5.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C5, method2, "F1Rratio"))
model10C5.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C5, method2, "F2Rratio"))
model10C5.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C5, method2, "F3Rratio"))
model10C5.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C5, method2, "F4Rratio"))
model10C5.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C5, method2, "F5Rratio"))

model10C1.Permanova.model$Level<-c(rep(100,10))
model10C1.Permanova.env1$Level<-c(rep(100,10))
model10C1.Permanova.env2$Level<-c(rep(100,10))
model10C1.Permanova.env3$Level<-c(rep(100,10))
model10C1.Permanova.env4$Level<-c(rep(100,10))
model10C1.Permanova.env5$Level<-c(rep(100,10))

model10C2.Permanova.model$Level<-c(rep(200,10))
model10C2.Permanova.env1$Level<-c(rep(200,10))
model10C2.Permanova.env2$Level<-c(rep(200,10))
model10C2.Permanova.env3$Level<-c(rep(200,10))
model10C2.Permanova.env4$Level<-c(rep(200,10))
model10C2.Permanova.env5$Level<-c(rep(200,10))

model10C3.Permanova.model$Level<-c(rep(500,10))
model10C3.Permanova.env1$Level<-c(rep(500,10))
model10C3.Permanova.env2$Level<-c(rep(500,10))
model10C3.Permanova.env3$Level<-c(rep(500,10))
model10C3.Permanova.env4$Level<-c(rep(500,10))
model10C3.Permanova.env5$Level<-c(rep(500,10))

model10C4.Permanova.model$Level<-c(rep(1000,10))
model10C4.Permanova.env1$Level<-c(rep(1000,10))
model10C4.Permanova.env2$Level<-c(rep(1000,10))
model10C4.Permanova.env3$Level<-c(rep(1000,10))
model10C4.Permanova.env4$Level<-c(rep(1000,10))
model10C4.Permanova.env5$Level<-c(rep(1000,10))

model10C5.Permanova.model$Level<-c(rep(2000,10))
model10C5.Permanova.env1$Level<-c(rep(2000,10))
model10C5.Permanova.env2$Level<-c(rep(2000,10))
model10C5.Permanova.env3$Level<-c(rep(2000,10))
model10C5.Permanova.env4$Level<-c(rep(2000,10))
model10C5.Permanova.env5$Level<-c(rep(2000,10))

permanova.model10CLevels.model<-do.call("rbind", list(model10C1.Permanova.model,model10C2.Permanova.model,model10C3.Permanova.model,model10C4.Permanova.model,model10C5.Permanova.model))
permanova.model10CLevels.env1<-do.call("rbind", list(model10C1.Permanova.env1,model10C2.Permanova.env1,model10C3.Permanova.env1,model10C4.Permanova.env1,model10C5.Permanova.env1))
permanova.model10CLevels.env2<-do.call("rbind", list(model10C1.Permanova.env2,model10C2.Permanova.env2,model10C3.Permanova.env2,model10C4.Permanova.env2,model10C5.Permanova.env2))
permanova.model10CLevels.env3<-do.call("rbind", list(model10C1.Permanova.env3,model10C2.Permanova.env3,model10C3.Permanova.env3,model10C4.Permanova.env3,model10C5.Permanova.env3))
permanova.model10CLevels.env4<-do.call("rbind", list(model10C1.Permanova.env4,model10C2.Permanova.env4,model10C3.Permanova.env4,model10C4.Permanova.env4,model10C5.Permanova.env4))
permanova.model10CLevels.env5<-do.call("rbind", list(model10C1.Permanova.env5,model10C2.Permanova.env5,model10C3.Permanova.env5,model10C4.Permanova.env5,model10C5.Permanova.env5))

permanova.model10CLevels.model<-melt(permanova.model10CLevels.model, id.vars = "Level", measure.vars = method2)
permanova.model10CLevels.env1<-melt(permanova.model10CLevels.env1, id.vars = "Level", measure.vars = method2)
permanova.model10CLevels.env2<-melt(permanova.model10CLevels.env2, id.vars = "Level", measure.vars = method2)
permanova.model10CLevels.env3<-melt(permanova.model10CLevels.env3, id.vars = "Level", measure.vars = method2)
permanova.model10CLevels.env4<-melt(permanova.model10CLevels.env4, id.vars = "Level", measure.vars = method2)
permanova.model10CLevels.env5<-melt(permanova.model10CLevels.env5, id.vars = "Level", measure.vars = method2)

permanova.model10CLevels.modelP<-data_summary2(permanova.model10CLevels.model, varname="value", groupnames=c("Level", "variable"))
permanova.model10CLevels.env1P<-data_summary2(permanova.model10CLevels.env1, varname="value", groupnames=c("Level", "variable"))
permanova.model10CLevels.env2P<-data_summary2(permanova.model10CLevels.env2, varname="value", groupnames=c("Level", "variable"))
permanova.model10CLevels.env3P<-data_summary2(permanova.model10CLevels.env3, varname="value", groupnames=c("Level", "variable"))
permanova.model10CLevels.env4P<-data_summary2(permanova.model10CLevels.env4, varname="value", groupnames=c("Level", "variable"))
permanova.model10CLevels.env5P<-data_summary2(permanova.model10CLevels.env5, varname="value", groupnames=c("Level", "variable"))


ggplot(permanova.model10CLevels.modelP, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Degree of Structure")+
  ylab("Permanova modelRatio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10CLevels.env1P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Degree of Structure")+
  ylab("Permanova env1Ratio (Median +/- min/max)")+
  theme_classic()


ggplot(permanova.model10CLevels.env2P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Degree of Structure")+
  ylab("Permanova env2Ratio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10CLevels.env3P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Degree of Structure")+
  ylab("Permanova env3Ratio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10CLevels.env4P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Degree of Structure")+
  ylab("Permanova env4Ratio (Median +/- min/max)")+
  theme_classic()

ggplot(permanova.model10CLevels.env5P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(30)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(30))+
  geom_point(position=position_dodge(30))+
  xlab("Degree of Structure")+
  ylab("Permanova env5Ratio (Median +/- min/max)")+
  theme_classic()

Again, results are similar to the sparsity example where the experimental design is equally strongly detected across normalization methods, but that un-normalized data significantly underperforms in capturing the relationship between community structure and the environmental gradients.

Depth.PM<-summary(aov(permanova.model10CLevels.model$value~permanova.model10CLevels.model$Level*permanova.model10CLevels.model$variable))

TukeyHSD(aov(permanova.model10CLevels.model$value~permanova.model10CLevels.model$Level*permanova.model10CLevels.model$variable))

Depth.PF1<-summary(aov(permanova.model10CLevels.env1$value~permanova.model10CLevels.env1$Level*permanova.model10CLevels.env1$variable))

TukeyHSD(aov(permanova.model10CLevels.env1$value~permanova.model10CLevels.env1$Level*permanova.model10CLevels.env1$variable))

Depth.PF2<-summary(aov(permanova.model10CLevels.env2$value~permanova.model10CLevels.env2$Level*permanova.model10CLevels.env2$variable))

TukeyHSD(aov(permanova.model10CLevels.env2$value~permanova.model10CLevels.env2$Level*permanova.model10CLevels.env2$variable))

Depth.PF3<-summary(aov(permanova.model10CLevels.env3$value~permanova.model10CLevels.env3$Level*permanova.model10CLevels.env3$variable))

TukeyHSD(aov(permanova.model10CLevels.env3$value~permanova.model10CLevels.env3$Level*permanova.model10CLevels.env3$variable))

Depth.PF4<-summary(aov(permanova.model10CLevels.env4$value~permanova.model10CLevels.env4$Level*permanova.model10CLevels.env4$variable))

TukeyHSD(aov(permanova.model10CLevels.env4$value~permanova.model10CLevels.env4$Level*permanova.model10CLevels.env4$variable))

Depth.PF5<-summary(aov(permanova.model10CLevels.env5$value~permanova.model10CLevels.env5$Level*permanova.model10CLevels.env5$variable))

TukeyHSD(aov(permanova.model10CLevels.env5$value~permanova.model10CLevels.env5$Level*permanova.model10CLevels.env5$variable))

Depth.PM
Depth.PF1
Depth.PF2
Depth.PF3
Depth.PF4
Depth.PF5

leveneTest(permanova.model10CLevels.model$value~permanova.model10CLevels.model$variable)
leveneTest(permanova.model10CLevels.env1$value~permanova.model10CLevels.env1$variable)
leveneTest(permanova.model10CLevels.env2$value~permanova.model10CLevels.env2$variable)
leveneTest(permanova.model10CLevels.env3$value~permanova.model10CLevels.env3$variable)
leveneTest(permanova.model10CLevels.env4$value~permanova.model10CLevels.env4$variable)
leveneTest(permanova.model10CLevels.env5$value~permanova.model10CLevels.env5$variable)

Expt 3: Normalization vs Sequencing Variance

By now you shuld be able to interpret the results; so I won't explain everything (they are very consistent across experiments)

# sequencing variance ####
model10.C1V1<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=50, method)

model10.C1V2<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=100, method)

model10.C1V3<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=250, method)

model10.C1V4<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=500, method)

model10.C1V5<-BENCHMARK.MM(reps=10, commonN=20, groupN=20, singleN=15, D=500, V=1000, method)

LII output:

# sequencing variants LII ####

SummaryLII.model10.C1V1<-as.data.frame(Summarize.LII(model10.C1V1, method2)) # actually C3 ...
SummaryLII.model10.C1V2<-as.data.frame(Summarize.LII(model10.C1V2, method2)) 
SummaryLII.model10.C1V3<-as.data.frame(Summarize.LII(model10.C1V3, method2))
SummaryLII.model10.C1V4<-as.data.frame(Summarize.LII(model10.C1V4, method2))
SummaryLII.model10.C1V5<-as.data.frame(Summarize.LII(model10.C1V5, method2))
# prepare data for merging datasets
SummaryLII.model10.C1V1$Level<-c(rep(0.1,10))
SummaryLII.model10.C1V1<-melt(SummaryLII.model10.C1V1, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C1V2$Level<-c(rep(0.2,10))
SummaryLII.model10.C1V2<-melt(SummaryLII.model10.C1V2, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C1V3$Level<-c(rep(0.5,10))
SummaryLII.model10.C1V3<-melt(SummaryLII.model10.C1V3, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C1V4$Level<-c(rep(1,10))
SummaryLII.model10.C1V4<-melt(SummaryLII.model10.C1V4, id.vars = "Level", measure.vars = method2)
SummaryLII.model10.C1V5$Level<-c(rep(2,10))
SummaryLII.model10.C1V5<-melt(SummaryLII.model10.C1V5, id.vars = "Level", measure.vars = method2)
# merge datasets
model10.SeqVarLevels<-do.call("rbind", list(SummaryLII.model10.C1V1,SummaryLII.model10.C1V2,SummaryLII.model10.C1V3,SummaryLII.model10.C1V4,SummaryLII.model10.C1V5))

# summarize for plotting
model10.SeqVarLevels.summary<-data_summary(model10.SeqVarLevels, varname="value", groupnames=c("Level", "variable"))
model10.SeqVarLevels.summary2<-data_summary2(model10.SeqVarLevels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(model10.SeqVarLevels.summary, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.1))+
  geom_point(position=position_dodge(.1))+
  xlab("Sequencing Variance")+
  ylab("LII Value (Mean +/- sd)")+
  theme_classic()

ggplot(model10.SeqVarLevels.summary2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.1))+
  geom_point(position=position_dodge(.1))+
  xlab("Sequencing Variance")+
  ylab("LII Value (Median +/- min/max)")+
  theme_classic()

Variance.LII<-summary(aov(model10.SeqVarLevels$value~model10.SeqVarLevels$Level*model10.SeqVarLevels$variable))
Variance.LII
TukeyHSD(aov(model10.SeqVarLevels$value~model10.SeqVarLevels$Level*model10.SeqVarLevels$variable))

linear model outputs:

# sequencing variants lm ####

SM.lmRatio.model10.C1V1<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C1V1,method2))
SV.lmRatio.model10.C1V1<-as.data.frame(Summarize.lmRatiotab.Var(model10.C1V1,method2))
SM.lmRatio.model10.C1V2<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C1V2,method2))
SV.lmRatio.model10.C1V2<-as.data.frame(Summarize.lmRatiotab.Var(model10.C1V2,method2))
SM.lmRatio.model10.C1V3<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C1V3,method2))
SV.lmRatio.model10.C1V3<-as.data.frame(Summarize.lmRatiotab.Var(model10.C1V3,method2))
SM.lmRatio.model10.C1V4<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C1V4,method2))
SV.lmRatio.model10.C1V4<-as.data.frame(Summarize.lmRatiotab.Var(model10.C1V4,method2))
SM.lmRatio.model10.C1V5<-as.data.frame(Summarize.lmRatiotab.Mean(model10.C1V5,method2))
SV.lmRatio.model10.C1V5<-as.data.frame(Summarize.lmRatiotab.Var(model10.C1V5,method2))
# prepare data for merging datasets

SM.lmRatio.model10.C1V1$Level<-c(rep(0.1,10))
SV.lmRatio.model10.C1V1$Level<-c(rep(0.1,10))
SM.lmRatio.model10.C1V2$Level<-c(rep(0.2,10))
SV.lmRatio.model10.C1V2$Level<-c(rep(0.2,10))
SM.lmRatio.model10.C1V3$Level<-c(rep(0.5,10))
SV.lmRatio.model10.C1V3$Level<-c(rep(0.5,10))
SM.lmRatio.model10.C1V4$Level<-c(rep(1,10))
SV.lmRatio.model10.C1V4$Level<-c(rep(1,10))
SM.lmRatio.model10.C1V5$Level<-c(rep(2,10))
SV.lmRatio.model10.C1V5$Level<-c(rep(2,10))

SM.lmRatio.model10.C1V1<-melt(SM.lmRatio.model10.C1V1, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C1V1<-melt(SV.lmRatio.model10.C1V1, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C1V2<-melt(SM.lmRatio.model10.C1V2, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C1V2<-melt(SV.lmRatio.model10.C1V2, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C1V3<-melt(SM.lmRatio.model10.C1V3, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C1V3<-melt(SV.lmRatio.model10.C1V3, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C1V4<-melt(SM.lmRatio.model10.C1V4, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C1V4<-melt(SV.lmRatio.model10.C1V4, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.C1V5<-melt(SM.lmRatio.model10.C1V5, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.C1V5<-melt(SV.lmRatio.model10.C1V5, id.vars = "Level", measure.vars = method2)

# merge datasets
SM.lmRatio.model10.CVLevels<-do.call("rbind", list(SM.lmRatio.model10.C1V1,SM.lmRatio.model10.C1V2,SM.lmRatio.model10.C1V3,SM.lmRatio.model10.C1V4, SM.lmRatio.model10.C1V5))
SV.lmRatio.model10.CVLevels<-do.call("rbind", list(SV.lmRatio.model10.C1V1,SV.lmRatio.model10.C1V2,SV.lmRatio.model10.C1V3,SV.lmRatio.model10.C1V4, SV.lmRatio.model10.C1V5))


# summarize for plotting
SM.lmRatio.model10CV.p<-data_summary(SM.lmRatio.model10.CVLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10CV.p<-data_summary(SV.lmRatio.model10.CVLevels, varname="value", groupnames=c("Level", "variable"))
SM.lmRatio.model10CV.p2<-data_summary2(SM.lmRatio.model10.CVLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10CV.p2<-data_summary2(SV.lmRatio.model10.CVLevels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(SM.lmRatio.model10CV.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Median lmRatio (Mean +/- sd)")+
  theme_classic()

ggplot(SV.lmRatio.model10CV.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Variance lmRatio (Mean +/- sd)")+
  theme_classic()
ggplot(SM.lmRatio.model10CV.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Median lmRatio (Median +/- min/max)")+
  theme_classic()

ggplot(SV.lmRatio.model10CV.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Variance lmRatio (Median +/- min/max)")+
  theme_classic()

Variance.lmEnvM<-summary(aov(SM.lmRatio.model10.CVLevels$value~SM.lmRatio.model10.CVLevels$Level*SM.lmRatio.model10.CVLevels$variable))
Variance.lmEnvM
TukeyHSD(aov(SM.lmRatio.model10.CVLevels$value~SM.lmRatio.model10.CVLevels$Level*SM.lmRatio.model10.CVLevels$variable))

Variance.lmEnvV<-summary(aov(SV.lmRatio.model10.CVLevels$value~SV.lmRatio.model10.CVLevels$Level*SV.lmRatio.model10.CVLevels$variable))
Variance.lmEnvV
TukeyHSD(aov(SV.lmRatio.model10.CVLevels$value~SV.lmRatio.model10.CVLevels$Level*SV.lmRatio.model10.CVLevels$variable))

#sequence variants lm model ####
SM.lmRatio.model10.MC1V1<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C1V1,method2))
SV.lmRatio.model10.MC1V1<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C1V1,method2))
SM.lmRatio.model10.MC1V2<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C1V2,method2))
SV.lmRatio.model10.MC1V2<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C1V2,method2))
SM.lmRatio.model10.MC1V3<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C1V3,method2))
SV.lmRatio.model10.MC1V3<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C1V3,method2))
SM.lmRatio.model10.MC1V4<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C1V4,method2))
SV.lmRatio.model10.MC1V4<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C1V4,method2))
SM.lmRatio.model10.MC1V5<-as.data.frame(Summarize.lmRatiotabModel.Mean(model10.C1V5,method2))
SV.lmRatio.model10.MC1V5<-as.data.frame(Summarize.lmRatiotabModel.Var(model10.C1V5,method2))
# prepare data for merging datasets

SM.lmRatio.model10.MC1V1$Level<-c(rep(0.1,10))
SV.lmRatio.model10.MC1V1$Level<-c(rep(0.1,10))
SM.lmRatio.model10.MC1V2$Level<-c(rep(0.2,10))
SV.lmRatio.model10.MC1V2$Level<-c(rep(0.2,10))
SM.lmRatio.model10.MC1V3$Level<-c(rep(0.5,10))
SV.lmRatio.model10.MC1V3$Level<-c(rep(0.5,10))
SM.lmRatio.model10.MC1V4$Level<-c(rep(1,10))
SV.lmRatio.model10.MC1V4$Level<-c(rep(1,10))
SM.lmRatio.model10.MC1V5$Level<-c(rep(2,10))
SV.lmRatio.model10.MC1V5$Level<-c(rep(2,10))

SM.lmRatio.model10.MC1V1<-melt(SM.lmRatio.model10.MC1V1, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.MC1V1<-melt(SV.lmRatio.model10.MC1V1, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.MC1V2<-melt(SM.lmRatio.model10.MC1V2, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.MC1V2<-melt(SV.lmRatio.model10.MC1V2, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.MC1V3<-melt(SM.lmRatio.model10.MC1V3, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.MC1V3<-melt(SV.lmRatio.model10.MC1V3, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.MC1V4<-melt(SM.lmRatio.model10.MC1V4, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.MC1V4<-melt(SV.lmRatio.model10.MC1V4, id.vars = "Level", measure.vars = method2)
SM.lmRatio.model10.MC1V5<-melt(SM.lmRatio.model10.MC1V5, id.vars = "Level", measure.vars = method2)
SV.lmRatio.model10.MC1V5<-melt(SV.lmRatio.model10.MC1V5, id.vars = "Level", measure.vars = method2)

# merge datasets
SM.lmRatio.model10.MCVLevels<-do.call("rbind", list(SM.lmRatio.model10.MC1V1,SM.lmRatio.model10.MC1V2,SM.lmRatio.model10.MC1V3,SM.lmRatio.model10.MC1V4, SM.lmRatio.model10.MC1V5))
SV.lmRatio.model10.MCVLevels<-do.call("rbind", list(SV.lmRatio.model10.MC1V1,SV.lmRatio.model10.MC1V2,SV.lmRatio.model10.MC1V3,SV.lmRatio.model10.MC1V4, SV.lmRatio.model10.MC1V5))


# summarize for plotting
SM.lmRatio.model10MCV.p<-data_summary(SM.lmRatio.model10.MCVLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10MCV.p<-data_summary(SV.lmRatio.model10.MCVLevels, varname="value", groupnames=c("Level", "variable"))

SM.lmRatio.model10MCV.p2<-data_summary2(SM.lmRatio.model10.MCVLevels, varname="value", groupnames=c("Level", "variable"))
SV.lmRatio.model10MCV.p2<-data_summary2(SV.lmRatio.model10.MCVLevels, varname="value", groupnames=c("Level", "variable"))
# plot
ggplot(SM.lmRatio.model10MCV.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Median lmRatio (Mean +/- sd)")+
  theme_classic()

ggplot(SV.lmRatio.model10MCV.p, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Variance lmRatio (Mean +/- sd)")+
  theme_classic()

ggplot(SM.lmRatio.model10MCV.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Median lmRatio (Median +/- min/max)")+
  theme_classic()

ggplot(SV.lmRatio.model10MCV.p2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(.01)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(.01))+
  geom_point(position=position_dodge(.01))+
  xlab("Sequencing Variance")+
  ylab("Variance lmRatio (Median +/- min/max)")+
  theme_classic()

Variance.lmModelM<-summary(aov(SM.lmRatio.model10.MCVLevels$value~SM.lmRatio.model10.MCVLevels$Level*SM.lmRatio.model10.MCVLevels$variable))
Variance.lmModelM
TukeyHSD(aov(SM.lmRatio.model10.MCVLevels$value~SM.lmRatio.model10.MCVLevels$Level*SM.lmRatio.model10.MCVLevels$variable))

Variance.lmModelV<-summary(aov(SV.lmRatio.model10.MCVLevels$value~SV.lmRatio.model10.MCVLevels$Level*SV.lmRatio.model10.MCVLevels$variable))
Variance.lmModelV
TukeyHSD(aov(SV.lmRatio.model10.MCVLevels$value~SV.lmRatio.model10.MCVLevels$Level*SV.lmRatio.model10.MCVLevels$variable))

Taxon correlation outputs:

# sequencing variants taxcor ####

TaxRatio.model10.C1V1<-getTaxCor.Tab(model10.C1V1,method2)
TaxRatio.model10.C1V2<-getTaxCor.Tab(model10.C1V2,method2)
TaxRatio.model10.C1V3<-getTaxCor.Tab(model10.C1V3,method2)
TaxRatio.model10.C1V4<-getTaxCor.Tab(model10.C1V4,method2)
TaxRatio.model10.C1V5<-getTaxCor.Tab(model10.C1V5,method2)

V.TaxRatio.model10.C1V1<-melt(TaxRatio.model10.C1V1$V.tax)
M.TaxRatio.model10.C1V1<-melt(TaxRatio.model10.C1V1$Median.tax)
V.TaxRatio.model10.C1V2<-melt(TaxRatio.model10.C1V2$V.tax)
M.TaxRatio.model10.C1V2<-melt(TaxRatio.model10.C1V2$Median.tax)
V.TaxRatio.model10.C1V3<-melt(TaxRatio.model10.C1V3$V.tax)
M.TaxRatio.model10.C1V3<-melt(TaxRatio.model10.C1V3$Median.tax)
V.TaxRatio.model10.C1V4<-melt(TaxRatio.model10.C1V4$V.tax)
M.TaxRatio.model10.C1V4<-melt(TaxRatio.model10.C1V4$Median.tax)
V.TaxRatio.model10.C1V5<-melt(TaxRatio.model10.C1V5$V.tax)
M.TaxRatio.model10.C1V5<-melt(TaxRatio.model10.C1V5$Median.tax)

V.TaxRatio.model10.C1V1<-as.data.frame(V.TaxRatio.model10.C1V1)
M.TaxRatio.model10.C1V1<-as.data.frame(M.TaxRatio.model10.C1V1)
V.TaxRatio.model10.C1V2<-as.data.frame(V.TaxRatio.model10.C1V2)
M.TaxRatio.model10.C1V2<-as.data.frame(M.TaxRatio.model10.C1V2)
V.TaxRatio.model10.C1V3<-as.data.frame(V.TaxRatio.model10.C1V3)
M.TaxRatio.model10.C1V3<-as.data.frame(M.TaxRatio.model10.C1V3)
V.TaxRatio.model10.C1V4<-as.data.frame(V.TaxRatio.model10.C1V4)
M.TaxRatio.model10.C1V4<-as.data.frame(M.TaxRatio.model10.C1V4)
V.TaxRatio.model10.C1V5<-as.data.frame(V.TaxRatio.model10.C1V5)
M.TaxRatio.model10.C1V5<-as.data.frame(M.TaxRatio.model10.C1V5)

# prepare data for merging datasets

V.TaxRatio.model10.C1V1$Level<-c(rep(0.1,nrow(V.TaxRatio.model10.C1V1)))
V.TaxRatio.model10.C1V2$Level<-c(rep(0.2,nrow(V.TaxRatio.model10.C1V2)))
V.TaxRatio.model10.C1V3$Level<-c(rep(0.5,nrow(V.TaxRatio.model10.C1V3)))
V.TaxRatio.model10.C1V4$Level<-c(rep(1,nrow(V.TaxRatio.model10.C1V4)))
V.TaxRatio.model10.C1V5$Level<-c(rep(2,nrow(V.TaxRatio.model10.C1V5)))
M.TaxRatio.model10.C1V1$Level<-c(rep(0.1,nrow(M.TaxRatio.model10.C1V1)))
M.TaxRatio.model10.C1V2$Level<-c(rep(0.2,nrow(M.TaxRatio.model10.C1V2)))
M.TaxRatio.model10.C1V3$Level<-c(rep(0.5,nrow(M.TaxRatio.model10.C1V3)))
M.TaxRatio.model10.C1V4$Level<-c(rep(1,nrow(M.TaxRatio.model10.C1V4)))
M.TaxRatio.model10.C1V5$Level<-c(rep(2,nrow(M.TaxRatio.model10.C1V5)))


# merge datasets
V.TaxRatio.model10VLevels<-do.call("rbind", list(V.TaxRatio.model10.C1V1,V.TaxRatio.model10.C1V2,V.TaxRatio.model10.C1V3,V.TaxRatio.model10.C1V4,V.TaxRatio.model10.C1V5))
M.TaxRatio.model10VLevels<-do.call("rbind", list(M.TaxRatio.model10.C1V1,M.TaxRatio.model10.C1V2,M.TaxRatio.model10.C1V3,M.TaxRatio.model10.C1V4, M.TaxRatio.model10.C1V5))


# summarize for plotting
V.TaxRatio.model10.SeqV.p<-data_summary(V.TaxRatio.model10VLevels, varname="value", groupnames=c("Level", "Var2"))
M.TaxRatio.model10.SeqV.p<-data_summary(M.TaxRatio.model10VLevels, varname="value", groupnames=c("Level", "Var2"))
# plot

V.TaxRatio.model10.SeqV.p2<-data_summary2(V.TaxRatio.model10VLevels, varname="value", groupnames=c("Level", "Var2"))
M.TaxRatio.model10.SeqV.p2<-data_summary2(M.TaxRatio.model10VLevels, varname="value", groupnames=c("Level", "Var2"))
# plot
ggplot(V.TaxRatio.model10.SeqV.p, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Sequencing Depth")+
  ylab("log10 Variance taxRatio (Mean +/- sd)")+
  theme_classic()

ggplot(M.TaxRatio.model10.SeqV.p, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Sequencing Depth")+
  ylab("Median taxRatio (Mean +/- sd)")+
  theme_classic()

ggplot(V.TaxRatio.model10.SeqV.p2, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.1)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Sequencing Depth")+
  ylab("log10 Variance taxRatio (Median +/- min/max)")+
  theme_classic()

ggplot(M.TaxRatio.model10.SeqV.p2, aes(x=Level, y=value, group = Var2, color=Var2))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.1)) +
  scale_color_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Sequencing Depth")+
  ylab("Median taxRatio (Median +/- min/max)")+
  theme_classic()

Variance.TaxV<-summary(aov(V.TaxRatio.model10VLevels$value~V.TaxRatio.model10VLevels$Level*V.TaxRatio.model10VLevels$Var2))
Variance.TaxV
TukeyHSD(aov(V.TaxRatio.model10VLevels$value~V.TaxRatio.model10VLevels$Level*V.TaxRatio.model10VLevels$Var2))

Variance.TaxM<-summary(aov(M.TaxRatio.model10VLevels$value~M.TaxRatio.model10VLevels$Level*M.TaxRatio.model10VLevels$Var2))
Variance.TaxM
TukeyHSD(aov(M.TaxRatio.model10VLevels$value~M.TaxRatio.model10VLevels$Var2))

PERMANOVA outputs:

# sequencing variants PERMANOVA ####

model10C1V1.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V1,method2, "CategoryRratio"))
model10C1V1.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V1, method2, "F1Rratio"))
model10C1V1.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V1, method2, "F2Rratio"))
model10C1V1.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V1, method2, "F3Rratio"))
model10C1V1.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V1, method2, "F4Rratio"))
model10C1V1.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V1,method2, "F5Rratio"))


model10C1V2.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V2, method2, "CategoryRratio"))
model10C1V2.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V2, method2, "F1Rratio"))
model10C1V2.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V2, method2, "F2Rratio"))
model10C1V2.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V2, method2, "F3Rratio"))
model10C1V2.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V2, method2, "F4Rratio"))
model10C1V2.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V2, method2, "F5Rratio"))


model10C1V3.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V3, method2, "CategoryRratio"))
model10C1V3.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V3, method2, "F1Rratio"))
model10C1V3.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V3, method2, "F2Rratio"))
model10C1V3.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V3, method2, "F3Rratio"))
model10C1V3.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V3, method2, "F4Rratio"))
model10C1V3.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V3, method2, "F5Rratio"))


model10C1V4.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V4, method2, "CategoryRratio"))
model10C1V4.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V4, method2, "F1Rratio"))
model10C1V4.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V4, method2, "F2Rratio"))
model10C1V4.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V4, method2, "F3Rratio"))
model10C1V4.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V4, method2, "F4Rratio"))
model10C1V4.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V4, method2, "F5Rratio"))

model10C1V5.Permanova.model<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V5, method2, "CategoryRratio"))
model10C1V5.Permanova.env1<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V5, method2, "F1Rratio"))
model10C1V5.Permanova.env2<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V5, method2, "F2Rratio"))
model10C1V5.Permanova.env3<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V5, method2, "F3Rratio"))
model10C1V5.Permanova.env4<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V5, method2, "F4Rratio"))
model10C1V5.Permanova.env5<-as.data.frame(Summarize.PERMANOVA.Rratio(model10.C1V5, method2, "F5Rratio"))

model10C1V1.Permanova.model$Level<-c(rep(0.1,10))
model10C1V1.Permanova.env1$Level<-c(rep(.100,10))
model10C1V1.Permanova.env2$Level<-c(rep(.100,10))
model10C1V1.Permanova.env3$Level<-c(rep(.100,10))
model10C1V1.Permanova.env4$Level<-c(rep(.100,10))
model10C1V1.Permanova.env5$Level<-c(rep(.100,10))

model10C1V2.Permanova.model$Level<-c(rep(.200,10))
model10C1V2.Permanova.env1$Level<-c(rep(.200,10))
model10C1V2.Permanova.env2$Level<-c(rep(.200,10))
model10C1V2.Permanova.env3$Level<-c(rep(.200,10))
model10C1V2.Permanova.env4$Level<-c(rep(.200,10))
model10C1V2.Permanova.env5$Level<-c(rep(.200,10))

model10C1V3.Permanova.model$Level<-c(rep(.500,10))
model10C1V3.Permanova.env1$Level<-c(rep(.500,10))
model10C1V3.Permanova.env2$Level<-c(rep(.500,10))
model10C1V3.Permanova.env3$Level<-c(rep(.500,10))
model10C1V3.Permanova.env4$Level<-c(rep(.500,10))
model10C1V3.Permanova.env5$Level<-c(rep(.500,10))

model10C1V4.Permanova.model$Level<-c(rep(1.000,10))
model10C1V4.Permanova.env1$Level<-c(rep(1.000,10))
model10C1V4.Permanova.env2$Level<-c(rep(1.000,10))
model10C1V4.Permanova.env3$Level<-c(rep(1.000,10))
model10C1V4.Permanova.env4$Level<-c(rep(1.000,10))
model10C1V4.Permanova.env5$Level<-c(rep(1.000,10))

model10C1V5.Permanova.model$Level<-c(rep(2.000,10))
model10C1V5.Permanova.env1$Level<-c(rep(2.000,10))
model10C1V5.Permanova.env2$Level<-c(rep(2.000,10))
model10C1V5.Permanova.env3$Level<-c(rep(2.000,10))
model10C1V5.Permanova.env4$Level<-c(rep(2.000,10))
model10C1V5.Permanova.env5$Level<-c(rep(2.000,10))

permanova.model10CVLevels.model<-do.call("rbind", list(model10C1V1.Permanova.model,model10C1V2.Permanova.model,model10C1V3.Permanova.model,model10C1V4.Permanova.model,model10C1V5.Permanova.model))
permanova.model10CVLevels.env1<-do.call("rbind", list(model10C1V1.Permanova.env1,model10C1V2.Permanova.env1,model10C1V3.Permanova.env1,model10C1V4.Permanova.env1,model10C1V5.Permanova.env1))
permanova.model10CVLevels.env2<-do.call("rbind", list(model10C1V1.Permanova.env2,model10C1V2.Permanova.env2,model10C1V3.Permanova.env2,model10C1V4.Permanova.env2,model10C1V5.Permanova.env2))
permanova.model10CVLevels.env3<-do.call("rbind", list(model10C1V1.Permanova.env3,model10C1V2.Permanova.env3,model10C1V3.Permanova.env3,model10C1V4.Permanova.env3,model10C1V5.Permanova.env3))
permanova.model10CVLevels.env4<-do.call("rbind", list(model10C1V1.Permanova.env4,model10C1V2.Permanova.env4,model10C1V3.Permanova.env4,model10C1V4.Permanova.env4,model10C1V5.Permanova.env4))
permanova.model10CVLevels.env5<-do.call("rbind", list(model10C1V1.Permanova.env5,model10C1V2.Permanova.env5,model10C1V3.Permanova.env5,model10C1V4.Permanova.env5,model10C1V5.Permanova.env5))

permanova.model10CVLevels.model<-melt(permanova.model10CVLevels.model, id.vars = "Level", measure.vars = method2)
permanova.model10CVLevels.env1<-melt(permanova.model10CVLevels.env1, id.vars = "Level", measure.vars = method2)
permanova.model10CVLevels.env2<-melt(permanova.model10CVLevels.env2, id.vars = "Level", measure.vars = method2)
permanova.model10CVLevels.env3<-melt(permanova.model10CVLevels.env3, id.vars = "Level", measure.vars = method2)
permanova.model10CVLevels.env4<-melt(permanova.model10CVLevels.env4, id.vars = "Level", measure.vars = method2)
permanova.model10CVLevels.env5<-melt(permanova.model10CVLevels.env5, id.vars = "Level", measure.vars = method2)

permanova.model10CVLevels.modelP<-data_summary(permanova.model10CVLevels.model, varname="value", groupnames=c("Level", "variable"))
permanova.model10CVLevels.env1P<-data_summary(permanova.model10CVLevels.env1, varname="value", groupnames=c("Level", "variable"))
permanova.model10CVLevels.env2P<-data_summary(permanova.model10CVLevels.env2, varname="value", groupnames=c("Level", "variable"))
permanova.model10CVLevels.env3P<-data_summary(permanova.model10CVLevels.env3, varname="value", groupnames=c("Level", "variable"))
permanova.model10CVLevels.env4P<-data_summary(permanova.model10CVLevels.env4, varname="value", groupnames=c("Level", "variable"))
permanova.model10CVLevels.env5P<-data_summary(permanova.model10CVLevels.env5, varname="value", groupnames=c("Level", "variable"))


ggplot(permanova.model10CVLevels.modelP, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.01)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.01))+
  geom_point(position=position_dodge(0.01))+
  xlab("Relative Variance (variance / mean)")+
  ylab("Permanova modelRatio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10CVLevels.env1P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Relative Variance (variance / mean)")+
  ylab("Permanova env1Ratio (Mean +/- sd)")+
  theme_classic()


ggplot(permanova.model10CVLevels.env2P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Relative Variance (variance / mean)")+
  ylab("Permanova env2Ratio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10CVLevels.env3P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Relative Variance (variance / mean)")+
  ylab("Permanova env3Ratio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10CVLevels.env4P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Relative Variance (variance / mean)")+
  ylab("Permanova env4Ratio (Mean +/- sd)")+
  theme_classic()

ggplot(permanova.model10CVLevels.env5P, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(0.1)) +
  scale_colour_viridis_d(direction=-1)+
  geom_line(position=position_dodge(0.1))+
  geom_point(position=position_dodge(0.1))+
  xlab("Relative Variance (variance / mean)")+
  ylab("Permanova env5Ratio (Mean +/- sd)")+
  theme_classic()

Variance.PM<-summary(aov(permanova.model10CVLevels.model$value~permanova.model10CVLevels.model$Level*permanova.model10CVLevels.model$variable))
Variance.PM
TukeyHSD(aov(permanova.model10CVLevels.model$value~permanova.model10CVLevels.model$Level*permanova.model10CVLevels.model$variable))

Variance.PF1<-summary(aov(permanova.model10CVLevels.env1$value~permanova.model10CVLevels.env1$Level*permanova.model10CVLevels.env1$variable))
Variance.PF1
TukeyHSD(aov(permanova.model10CVLevels.env1$value~permanova.model10CVLevels.env1$Level*permanova.model10CVLevels.env1$variable))

Variance.PF2<-summary(aov(permanova.model10CVLevels.env2$value~permanova.model10CVLevels.env2$Level*permanova.model10CVLevels.env2$variable))
Variance.PF2
TukeyHSD(aov(permanova.model10CVLevels.env2$value~permanova.model10CVLevels.env2$Level*permanova.model10CVLevels.env2$variable))

Variance.PF3<-summary(aov(permanova.model10CVLevels.env3$value~permanova.model10CVLevels.env3$Level*permanova.model10CVLevels.env3$variable))
Variance.PF3
TukeyHSD(aov(permanova.model10CVLevels.env3$value~permanova.model10CVLevels.env3$Level*permanova.model10CVLevels.env3$variable))

Variance.PF4<-summary(aov(permanova.model10CVLevels.env4$value~permanova.model10CVLevels.env4$Level*permanova.model10CVLevels.env4$variable))
Variance.PF4
TukeyHSD(aov(permanova.model10CVLevels.env4$value~permanova.model10CVLevels.env4$Level*permanova.model10CVLevels.env4$variable))

Variance.PF5<-summary(aov(permanova.model10CVLevels.env5$value~permanova.model10CVLevels.env5$Level*permanova.model10CVLevels.env5$variable))
Variance.PF5
TukeyHSD(aov(permanova.model10CVLevels.env5$value~permanova.model10CVLevels.env5$Level*permanova.model10CVLevels.env5$variable))

leveneTest(permanova.model10CVLevels.model$value~permanova.model10CVLevels.model$variable)
leveneTest(permanova.model10CVLevels.env1$value~permanova.model10CVLevels.env1$variable)
leveneTest(permanova.model10CVLevels.env2$value~permanova.model10CVLevels.env2$variable)
leveneTest(permanova.model10CVLevels.env3$value~permanova.model10CVLevels.env3$variable)
leveneTest(permanova.model10CVLevels.env4$value~permanova.model10CVLevels.env4$variable)
leveneTest(permanova.model10CVLevels.env5$value~permanova.model10CVLevels.env5$variable)

Conclusions:

1) Performance of all methods is highest at low sparsity, high sampling depth, and low sampling variance.

2) Model.Microbiome provides an empirical framework for optimizing sequence count normalization methods prior to making ecological inferences; in our example the parameters for QSeq0.5 resulted in a consistent underperformance compared to other parameters.

Network Analysis

Model.Microbiome constructs networks and network plots using two heuristics for creating the network thresholds. It also includes some functions to simplify accessing the network statistics that have been calculated, and plots that are generated. These are illustrated in the example below. These networks are built using two alternative approaches. Whenever we build a network, we need to establish a threshold to determine what relationships we consider significant to include. Our two methods include one that is a static threshold value, set at a correlation coefficient of 0.8; the other is a dynamic threshold heuristic that identifies the maximum threshold value that results in a network with 250 edges. The reason we include the secon method is because the network statistics for two graphs with different number of edges are in principle not comparable. However, we have found that in ecological settings that while the metrics we use might change, the overall interpretation of each plot does not. We also find that similar to other measures of beta diversity (here interpreted as changing species associations across sites), that statistics derived from the network are sensitive to network construction method, but are robust to normalization methods of the community. Let's take a look an an example using the community structure example from earlier in the tutorial:

method3<-c("model", "raw", "QSeq0.5", "QSeq10", "QSeq100")

plotstats.A<-getNetStats(model10.A, method3) # get statistics

plotstats.A$D.module.table # modularity value table for dynamic threshold
plotstats.A$S.module.table # modularity value table for static threshold

A close look at the module table suggests that the characteristics of the un-noralized dataset are driving the network statistics of the normalized set. For example, if we look at rep1, the values for QSeq0.5 and QSeq10 are both close or identical to the values for the raw dataset. This means that this normalization technique is not necessarily improving the accuracy of the network statistics. We can see this pattern does not depend on whether the network is constructed using a dynamic or static threshold. We can look at the threshold cutoff for the dynamic network construction to see if there are any patterns:

plotstats.A$D.Threshold.table # table of threshold values for dynamic threshold

As before the QSeq methods do not provide an improvement over the raw dataset. We can use this method to look at differences in internal structure. The pattern is the same:

plotstats.B<-getNetStats(model10.B, method3) # get statistics

plotstats.B$D.module.table # modularity value table for dynamic threshold
plotstats.B$D.Threshold.table # table of threshold values for dynamic threshold
plotstats.B$S.module.table # modularity value table for static threshold
plotstats.B$S.Threshold.table # table of threshold values for static threshold (0.8)

plotstats.C<-getNetStats(model10.C, method3) # get statistics

plotstats.C$D.module.table # modularity value table for dynamic threshold
plotstats.C$D.Threshold.table # table of threshold values for dynamic threshold
plotstats.C$S.module.table # modularity value table for static threshold
plotstats.C$S.Threshold.table # table of threshold values for static threshold (0.8)

plotstats.D<-getNetStats(model10.D, method3) # get statistics

plotstats.D$D.module.table # modularity value table for dynamic threshold
plotstats.D$D.Threshold.table # table of threshold values for dynamic threshold
plotstats.D$S.module.table # modularity value table for static threshold
plotstats.D$S.Threshold.table # table of threshold values for static threshold (0.8)

What this tells us is that this normalization technique does not have a large influence on the overall modularity statistic for the network; it does not improve our ability to identify clusters of associating or dissociating taxa. These metrics may still be useful in the case that methods are identified that do significantly improve the accuracy of replicating the network structure from before sampling; or in identifying cases where a normalization technique significantly decreases the accuracy of replicating the network structure from before sampling.

We can also visualize the plots:

getNetPlot(model10.A[1],method3)

Some figures from the data for the paper:

Figure 4

In each experiment, LII

ggplot(model10Levels.summary1[model10Levels.summary1$variable == "raw"|
                                model10Levels.summary1$variable == "QSeq0.5"|
                                model10Levels.summary1$variable == "QSeq2"|
                                model10Levels.summary1$variable == "QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.2, position=position_dodge(.3), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.3), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.3), size=1)+
  xlab("Degree of Structure")+
  ylab("LII Value (Mean +/- standard dev)")+
  theme_classic()

#model10.SeqLevels.summary2
ggplot(model10.SeqLevels.summary[model10.SeqLevels.summary$variable == "raw"|model10.SeqLevels.summary$variable == "QSeq0.5"|model10.SeqLevels.summary$variable == "QSeq2"|model10.SeqLevels.summary$variable == "QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=100, position=position_dodge(70), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(70), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(70), size=1)+
  xlab("Sampling Depth")+
  ylab("LII Value (Mean +/- standard dev)")+
  theme_classic()
ggplot(model10.SeqVarLevels.summary[model10.SeqVarLevels.summary$variable == "raw"|
                                       model10.SeqVarLevels.summary$variable == "QSeq0.5"|
                                       model10.SeqVarLevels.summary$variable == "QSeq2"|
                                       model10.SeqVarLevels.summary$variable == "QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("LII Value (Mean +/- standard dev)")+
  theme_classic()

Figure 3: rank abundance curve

par(mfrow=c(2,2))
barplot(sort(taxa_sums(model10.A$rep1$model$comm)/1000000, TRUE), 
    las = 2, main ="Rank Abundance Lowest sparsity", ylab="Million Counts")
barplot(sort(taxa_sums(model10.B$rep1$model$comm)/1000000, TRUE), 
    las = 2, main ="Rank Abundance Low sparsity", ylab="Million Counts")
barplot(sort(taxa_sums(model10.C$rep1$model$comm)/1000000, TRUE), 
    las = 2, main ="Rank Abundance  High sparsity", ylab="Million Counts")
barplot(sort(taxa_sums(model10.D$rep1$model$comm)/1000000, TRUE), 
    las = 2, main ="Rank Abundance Highest sparsity", ylab="Million Counts")

figure 5

#permanova.model10CVLevels.modelP
ggplot(permanova.model10CVLevels.modelP[permanova.model10CVLevels.modelP$variable=="raw" |
                                        permanova.model10CVLevels.modelP$variable=="QSeq0.5" |
                                        permanova.model10CVLevels.modelP$variable=="QSeq2" |
                                        permanova.model10CVLevels.modelP$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("PERMANOVA R-Squared Value (Mean +/- standard dev)")+
  ggtitle("Experimental Model")+
  theme_classic()

#permanova.model10CVLevels.modelP
ggplot(permanova.model10CVLevels.env1P[permanova.model10CVLevels.env1P$variable=="raw" |
                                        permanova.model10CVLevels.env1P$variable=="QSeq0.5" |
                                        permanova.model10CVLevels.env1P$variable=="QSeq2" |
                                        permanova.model10CVLevels.env1P$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("PERMANOVA R-Squared Value (Mean +/- standard dev)")+
  ggtitle("Environmental Factor 1")+
  theme_classic()

#permanova.model10CVLevels.modelP
ggplot(permanova.model10CVLevels.env2P[permanova.model10CVLevels.env2P$variable=="raw" |
                                        permanova.model10CVLevels.env2P$variable=="QSeq0.5" |
                                        permanova.model10CVLevels.env2P$variable=="QSeq2" |
                                        permanova.model10CVLevels.env2P$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("PERMANOVA R-Squared Value (Mean +/- standard dev)")+
  ggtitle("Environmental Factor 2")+
  theme_classic()

#permanova.model10CVLevels.modelP
ggplot(permanova.model10CVLevels.env3P[permanova.model10CVLevels.env3P$variable=="raw" |
                                        permanova.model10CVLevels.env3P$variable=="QSeq0.5" |
                                        permanova.model10CVLevels.env3P$variable=="QSeq2" |
                                        permanova.model10CVLevels.env3P$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("PERMANOVA R-Squared Value (Mean +/- standard dev)")+
  ggtitle("Environmental Factor 3")+
  theme_classic()

#permanova.model10CVLevels.modelP
ggplot(permanova.model10CVLevels.env4P[permanova.model10CVLevels.env4P$variable=="raw" |
                                        permanova.model10CVLevels.env4P$variable=="QSeq0.5" |
                                        permanova.model10CVLevels.env4P$variable=="QSeq2" |
                                        permanova.model10CVLevels.env4P$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("PERMANOVA R-Squared Value (Mean +/- standard dev)")+
  ggtitle("Environmental Factor 4")+
  theme_classic()

#permanova.model10CVLevels.modelP
ggplot(permanova.model10CVLevels.env5P[permanova.model10CVLevels.env5P$variable=="raw" |
                                        permanova.model10CVLevels.env5P$variable=="QSeq0.5" |
                                        permanova.model10CVLevels.env5P$variable=="QSeq2" |
                                        permanova.model10CVLevels.env5P$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Variance (variance / mean)")+
  ylab("PERMANOVA R-Squared Value (Mean +/- standard dev)")+
  ggtitle("Environmental Factor 5")+
  theme_classic()

Figure 5

#SV.lmRatio.model10CV.p
ggplot(SM.lmRatio.model10CV.p[SM.lmRatio.model10CV.p$variable=="raw" |
                                        SM.lmRatio.model10CV.p$variable=="QSeq0.5" |
                                        SM.lmRatio.model10CV.p$variable=="QSeq2" |
                                        SM.lmRatio.model10CV.p$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Sampling Variance (variance / mean)")+
  ylab("Linear Model Mean of Variance R-Ratio (Mean +/- standard dev)")+
  ggtitle("Variance of Linear Model R-Ratio")+
  theme_classic()

SV.lmRatio.model10MCV.p

#SV.lmRatio.model10MCV.p
ggplot(SV.lmRatio.model10MCV.p[SV.lmRatio.model10MCV.p$variable=="raw" |
                                        SV.lmRatio.model10MCV.p$variable=="QSeq0.5" |
                                        SV.lmRatio.model10MCV.p$variable=="QSeq2" |
                                        SV.lmRatio.model10MCV.p$variable=="QSeq100",], aes(x=Level, y=value, group = variable, color=variable, size=1))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(.08), size=1) +
  #scale_colour_viridis_d(direction=-1)+
  scale_colour_grey()+
  geom_line(position=position_dodge(.08), size=1, aes(linetype=variable))+
  geom_point(position=position_dodge(.08), size=1)+
  xlab("Relative Sampling Variance (variance / mean)")+
  ylab("Linear Model Variance of Variance R-Ratio (Mean +/- standard dev)")+
  ggtitle("Variance of Linear Model R-Ratio")+
  theme_classic()

summary Table

sum.tab<-data.frame("LII"=c(Sparsity.LII[[1]]$`Pr(>F)`[2],Depth.LII[[1]]$`Pr(>F)`[2],Variance.LII[[1]]$`Pr(>F)`[2]), "P Model"=c(Sparsity.PM[[1]]$`Pr(>F)`[2], Depth.PM[[1]]$`Pr(>F)`[2], Variance.PM[[1]]$`Pr(>F)`[2]),"P Env 1"=c(Sparsity.PF1[[1]]$`Pr(>F)`[2], Depth.PF1[[1]]$`Pr(>F)`[2], Variance.PF1[[1]]$`Pr(>F)`[2]),"P Env 2"=c(Sparsity.PF2[[1]]$`Pr(>F)`[2],Depth.PF2[[1]]$`Pr(>F)`[2], Variance.PF2[[1]]$`Pr(>F)`[2]),"P Env 3"=c(Sparsity.PF3[[1]]$`Pr(>F)`[2], Depth.PF3[[1]]$`Pr(>F)`[2], Variance.PF3[[1]]$`Pr(>F)`[2]),"P Env 4"=c(Sparsity.PF4[[1]]$`Pr(>F)`[2], Depth.PF4[[1]]$`Pr(>F)`[2], Variance.PF4[[1]]$`Pr(>F)`[2]),"P Env 5"=c(Sparsity.PF5[[1]]$`Pr(>F)`[2], Depth.PF5[[1]]$`Pr(>F)`[2], Variance.PF5[[1]]$`Pr(>F)`[2]), "lm Model.M"=c(Sparsity.lmModelM[[1]]$`Pr(>F)`[2], Depth.lmModelM[[1]]$`Pr(>F)`[2], Variance.lmModelM[[1]]$`Pr(>F)`[2]), "lm Model.V"=c(Sparsity.lmModelV[[1]]$`Pr(>F)`[2], Depth.lmModelV[[1]]$`Pr(>F)`[2], Variance.lmModelV[[1]]$`Pr(>F)`[2]), "lm Env.M"=c(Sparsity.lmEnvM[[1]]$`Pr(>F)`[2], Depth.lmEnvM[[1]]$`Pr(>F)`[2], Variance.lmEnvM[[1]]$`Pr(>F)`[2]), "lm Env.V"=c(Sparsity.lmEnvV[[1]]$`Pr(>F)`[2], Depth.lmEnvV[[1]]$`Pr(>F)`[2], Variance.lmEnvV[[1]]$`Pr(>F)`[2]), "Tax.M"=c(Sparsity.TaxM[[1]]$`Pr(>F)`[2], Depth.TaxM[[1]]$`Pr(>F)`[2], Variance.TaxM[[1]]$`Pr(>F)`[2]), "Tax.V"=c(Sparsity.TaxV[[1]]$`Pr(>F)`[2], Depth.TaxV[[1]]$`Pr(>F)`[2], Variance.TaxV[[1]]$`Pr(>F)`[2]))

rownames(sum.tab)<-c("Sparsity", "Depth", "Variance")
sum.tab<-data.frame(t(round(sum.tab, digits=5)))
sum.tab %>% 
  mutate((.), 
       Sparsity=cell_spec(Sparsity, "html", color = ifelse(Sparsity > 0.05, "black", "red")),
       Depth=cell_spec(Depth, "html", color = ifelse(Depth > 0.05, "black", "red")),
       Variance=cell_spec(Variance, "html", color = ifelse(Variance > 0.05, "black", "red")),
       ) %>% kable(format = "html", escape = F, align = "c",caption = "Main effect of normalization method") %>%
  kable_styling("hover", full_width = F) %>% pack_rows("Loss of Information Index", 1, 1) %>%
  pack_rows("PERMANOVA", 2, 7) %>% pack_rows("Linear Model", 8, 11) %>% pack_rows("Taxa", 12, 13)

sum.tab<-data.frame("LII"=c(Sparsity.LII[[1]]$`Pr(>F)`[1],Depth.LII[[1]]$`Pr(>F)`[1],Variance.LII[[1]]$`Pr(>F)`[1]), "P Model"=c(Sparsity.PM[[1]]$`Pr(>F)`[1], Depth.PM[[1]]$`Pr(>F)`[1], Variance.PM[[1]]$`Pr(>F)`[1]),"P Env 1"=c(Sparsity.PF1[[1]]$`Pr(>F)`[1], Depth.PF1[[1]]$`Pr(>F)`[1], Variance.PF1[[1]]$`Pr(>F)`[1]),"P Env 2"=c(Sparsity.PF1[[1]]$`Pr(>F)`[1],Depth.PF1[[1]]$`Pr(>F)`[1], Variance.PF1[[1]]$`Pr(>F)`[1]),"P Env 3"=c(Sparsity.PF3[[1]]$`Pr(>F)`[1], Depth.PF3[[1]]$`Pr(>F)`[1], Variance.PF3[[1]]$`Pr(>F)`[1]),"P Env 4"=c(Sparsity.PF4[[1]]$`Pr(>F)`[1], Depth.PF4[[1]]$`Pr(>F)`[1], Variance.PF4[[1]]$`Pr(>F)`[1]),"P Env 5"=c(Sparsity.PF5[[1]]$`Pr(>F)`[1], Depth.PF5[[1]]$`Pr(>F)`[1], Variance.PF5[[1]]$`Pr(>F)`[1]), "lm Model.M"=c(Sparsity.lmModelM[[1]]$`Pr(>F)`[1], Depth.lmModelM[[1]]$`Pr(>F)`[1], Variance.lmModelM[[1]]$`Pr(>F)`[1]), "lm Model.V"=c(Sparsity.lmModelV[[1]]$`Pr(>F)`[1], Depth.lmModelV[[1]]$`Pr(>F)`[1], Variance.lmModelV[[1]]$`Pr(>F)`[1]), "lm Env.M"=c(Sparsity.lmEnvM[[1]]$`Pr(>F)`[1], Depth.lmEnvM[[1]]$`Pr(>F)`[1], Variance.lmEnvM[[1]]$`Pr(>F)`[1]), "lm Env.V"=c(Sparsity.lmEnvV[[1]]$`Pr(>F)`[1], Depth.lmEnvV[[1]]$`Pr(>F)`[1], Variance.lmEnvV[[1]]$`Pr(>F)`[1]), "Tax.M"=c(Sparsity.TaxM[[1]]$`Pr(>F)`[1], Depth.TaxM[[1]]$`Pr(>F)`[1], Variance.TaxM[[1]]$`Pr(>F)`[1]), "Tax.V"=c(Sparsity.TaxV[[1]]$`Pr(>F)`[1], Depth.TaxV[[1]]$`Pr(>F)`[1], Variance.TaxV[[1]]$`Pr(>F)`[1]))

rownames(sum.tab)<-c("Sparsity", "Depth", "Variance")
sum.tab<-data.frame(t(round(sum.tab, digits=5)))
sum.tab %>% 
  mutate((.), 
       Sparsity=cell_spec(Sparsity, "html", color = ifelse(Sparsity > 0.05, "black", "red")),
       Depth=cell_spec(Depth, "html", color = ifelse(Depth > 0.05, "black", "red")),
       Variance=cell_spec(Variance, "html", color = ifelse(Variance > 0.05, "black", "red")),
       ) %>% kable(format = "html", escape = F, align = "c",caption = "Main effect of experimental condition (gradient)") %>%
  kable_styling("hover", full_width = F) %>% pack_rows("Loss of Information Index", 1, 1) %>%
  pack_rows("PERMANOVA", 2, 7) %>% pack_rows("Linear Model", 8, 11) %>% pack_rows("Taxa", 12, 13)

sum.tab<-data.frame("LII"=c(Sparsity.LII[[1]]$`Pr(>F)`[3],Depth.LII[[1]]$`Pr(>F)`[3],Variance.LII[[1]]$`Pr(>F)`[3]), "P Model"=c(Sparsity.PM[[1]]$`Pr(>F)`[3], Depth.PM[[1]]$`Pr(>F)`[3], Variance.PM[[1]]$`Pr(>F)`[3]),"P Env 1"=c(Sparsity.PF1[[1]]$`Pr(>F)`[3], Depth.PF1[[1]]$`Pr(>F)`[3], Variance.PF1[[1]]$`Pr(>F)`[3]),"P Env 2"=c(Sparsity.PF3[[1]]$`Pr(>F)`[3],Depth.PF3[[1]]$`Pr(>F)`[3], Variance.PF3[[1]]$`Pr(>F)`[3]),"P Env 3"=c(Sparsity.PF3[[1]]$`Pr(>F)`[3], Depth.PF3[[1]]$`Pr(>F)`[3], Variance.PF3[[1]]$`Pr(>F)`[3]),"P Env 4"=c(Sparsity.PF4[[1]]$`Pr(>F)`[3], Depth.PF4[[1]]$`Pr(>F)`[3], Variance.PF4[[1]]$`Pr(>F)`[3]),"P Env 5"=c(Sparsity.PF5[[1]]$`Pr(>F)`[3], Depth.PF5[[1]]$`Pr(>F)`[3], Variance.PF5[[1]]$`Pr(>F)`[3]), "lm Model.M"=c(Sparsity.lmModelM[[1]]$`Pr(>F)`[3], Depth.lmModelM[[1]]$`Pr(>F)`[3], Variance.lmModelM[[1]]$`Pr(>F)`[3]), "lm Model.V"=c(Sparsity.lmModelV[[1]]$`Pr(>F)`[3], Depth.lmModelV[[1]]$`Pr(>F)`[3], Variance.lmModelV[[1]]$`Pr(>F)`[3]), "lm Env.M"=c(Sparsity.lmEnvM[[1]]$`Pr(>F)`[3], Depth.lmEnvM[[1]]$`Pr(>F)`[3], Variance.lmEnvM[[1]]$`Pr(>F)`[3]), "lm Env.V"=c(Sparsity.lmEnvV[[1]]$`Pr(>F)`[3], Depth.lmEnvV[[1]]$`Pr(>F)`[3], Variance.lmEnvV[[1]]$`Pr(>F)`[3]), "Tax.M"=c(Sparsity.TaxM[[1]]$`Pr(>F)`[3], Depth.TaxM[[1]]$`Pr(>F)`[3], Variance.TaxM[[1]]$`Pr(>F)`[3]), "Tax.V"=c(Sparsity.TaxV[[1]]$`Pr(>F)`[3], Depth.TaxV[[1]]$`Pr(>F)`[3], Variance.TaxV[[1]]$`Pr(>F)`[3]))

rownames(sum.tab)<-c("Sparsity", "Depth", "Variance")
sum.tab<-data.frame(t(round(sum.tab, digits=5)))
sum.tab %>% 
  mutate((.), 
       Sparsity=cell_spec(Sparsity, "html", color = ifelse(Sparsity > 0.05, "black", "red")),
       Depth=cell_spec(Depth, "html", color = ifelse(Depth > 0.05, "black", "red")),
       Variance=cell_spec(Variance, "html", color = ifelse(Variance > 0.05, "black", "red")),
       ) %>% kable(format = "html", escape = F, align = "c",caption = "Interaction between test condition and normalization method") %>%
  kable_styling("hover", full_width = F) %>% pack_rows("Loss of Information Index", 1, 1) %>%
  pack_rows("PERMANOVA", 2, 7) %>% pack_rows("Linear Model", 8, 11) %>% pack_rows("Taxa", 12, 13)

#devtools::install_github("ricardo-bion/ggradar", dependencies=TRUE)
#library(ggradar)

# table of output summary

#PERMANOVA.Model
#PERMANOVA.F1
#PERMANOVA.F2
#PERMANOVA.F3
#PERMANOVA.F4
#PERMANOVA.F5

#lmRatio.model(mean)
#lmRatio.model(var)
#lmRatio.env(mean)
#lmRatio.env(var)

#taxRatio

#LII.R
#LII

#getRadar<-function(x, method){
#  Summarize.LII(x, method)
#  
#}
#Summarize.LII(model10.A, method2)
##Summarize.PERMANOVA.Rratio(model10.A, method2, "CategoryRratio")
#Summarize.PERMANOVA.Rratio(model10.A, method2, "F1Rratio")
#Summarize.PERMANOVA.Rratio(model10.A, method2, "F2Rratio")
#Summarize.PERMANOVA.Rratio(model10.A, method2, "F3Rratio")
#Summarize.PERMANOVA.Rratio(model10.A, method2, "F4Rratio")
#Summarize.PERMANOVA.Rratio(model10.A, method2, "F5Rratio")

#Summarize.lmRatiotabModel.Mean(model10.A, method2)
#Summarize.lmRatiotabModel.Var(model10.A, method2)
#Summarize.lmRatiotab.Mean(model10.A, method2)
#Summarize.lmRatiotab.Var(model10.A, method2)

#getTaxCor.Tab(model10.A, method2)

#Summarize.LII(model10.A, method2)

#dtest<-rbind(as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "CategoryRratio")), as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F1Rratio")), as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F2Rratio")), as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F3Rratio")), as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F4Rratio")), as.data.frame(Summarize.PERMANOVA.Rratio(model10.A, method2, "F5Rratio")), as.data.frame(Summarize.lmRatiotabModel.Mean(model10.A, method2)), as.data.frame(Summarize.lmRatiotabModel.Var(model10.A, method2)), as.data.frame(Summarize.lmRatiotab.Mean(model10.A, method2)), as.data.frame(Summarize.lmRatiotab.Var(model10.A, method2)),
#             1-as.data.frame(Summarize.LII(model10.A, method2))) #as.data.frame(getTaxCor.Tab(model10.A, method2)$V.tax), 
#as.data.frame(getTaxCor.Tab(model10.A, method2)$Median.tax),

#dtest<-data.frame("cat"=c(rep("PERMANOVA.Cat", 10),
#                          rep("PERMANOVA.F1", 10),
#                          rep("PERMANOVA.F2", 10),
#                          rep("PERMANOVA.F3", 10),
#                          rep("PERMANOVA.F4", 10),
#                          rep("PERMANOVA.F5", 10), 
 #                         rep("lm.model.median", 10), 
 #                         rep("lm.model.var", 10), 
 #                         rep("lm.median", 10), 
#                          rep("lm.var", 10), 
#                          #rep("Taxcor.Var", 10),
#                          #rep("taxcor.median", 10), 
#                          rep("LII", 10)),
#                  dtest)
#dsum<-ddply(dtest, "cat", summarise,
 #     raw=mean(raw),
 #     QSeq0.5=mean(QSeq0.5),
 #     QSeq1=mean(QSeq1),
  #    QSeq2=mean(QSeq2),
 #     QSeq3=mean(QSeq3),
  #    QSeq10=mean(QSeq10),
 #     QSeq100=mean(QSeq100)
 #     )

#library(tibble)
#dsum%>%mutate_at(vars(-cat),funs(scales::rescale))->test
#test<-as.data.frame(test)
#test<-
#ggradar(test)

#t(test)
#ds2<-melt(dtest)
#ds3<-melt(dsum)
#rownames(dsum)<-dsum$cat
#dsum<-dsum[,-1]
#library(fmsb)
#dsum<-as.data.frame(t(dsum))
#radarchart(ds2, vlcex=0.5, pcol=c("gray8", "gray35","gray48","gray65","gray70","gray87","khaki4"))
#plot(dsum)
#library(ggiraphExtra)
#ggRadar(ds3, aes(x=cat, y=value, group=cat))+facet_wrap(~variable)

We have found that sometimes in the reference there are substantive differences in the network between static and dynamic network construction; but that in subsampled data these differences are reduced. There is considerable variation in outcomes; and it raises a question of what characteristics of reference datasets lead to different outcomes when we apply the two network building methods. This is not a question we will try to answer here.

A couple post-script comments about interpreting metabarcoding output data:

It's important to be aware of what metrics are robust to changes in methodology, and which are not. In general betadiversity is robust because the relationships of the dominant taxa have higher weight in most betadiversity metrics, and so their relationships tend to be easily captured and replicated in all analyses. However, alpha diversity is not like this. We have chosen not to address it in this software because in practice alpha diversity counts are heavily dependent on the type of sequence QC that is employed. We do not model sequence quality and do not try to address sequence QC. In general, we treat analyses that hinge on alpha diversity in microbiome studies as highly suspect because endogenous factors, like the presence of disease, will skew the taxon abundance distribution and affect the accuracy and comparability of an alpha diversity estimate. Alpha diversity estimates often rely on an assumption that either the sampling effort was the same; or that the taxon abundance distribution of one sample can be infered from other samples. So far, both of these assumptions are essentially never met. I'll explain:

Sampling effort is not the same as sequencing depth. The principle of sampling effort is rooted in physical effort because it acts to standardize the counts by area or distance covered. This provides results that accurately represent changes in density of sampled species across sites. It is well understood that sequence data does not do this. However, there is a widely held misconception that taking the same number of sequences from a sample across the dataset (i.e. even rarefaction) does represent a standardization of sampling effort. It does not. It is the equivalent of doing a sampling of the rainforest and the temperate deciduous forest by identifying the first 100 trees we encounter. This approach would tell us nothing of the density of the respective trees; only give us information about the relative position of each tree within the species abundance distribution of each forest. To standardize sampling effort, we need an index of species abundance that actually embodies information about the abundance of the species. For example, if in this survey, we record how far we travel to encounter 100 trees, and finish our count calculation by dividing tree counts by the total distance we had to survey, then then we would have an output value that accurately tells us something about the abundance of specific species independent of the abundance of others. This is the key purpose of normalizing for sampling effort. Since we do not end up with a spatially explicit estimate of taxon abundance by using even rarefaction, it is not in principle a normalization of sampling effort. Approaches that do not infer a taxon density are difficult to reliably use for estimating alpha diveristy.

The taxon abundance distribution is often infered across samples. There are instances where this may be appropriate, however in most environmental microbiome studies it is not. Endogenous factors like the presence of a pathogen can result in a large change in the species abundance distribution at a single site. It may be possible, given ecological knowledge of specific members, to make such an assumption based on the known behavior and interactions of members of the community. But no software currently accounts for the ecological role of microbes in attempting to infer the species abundance curve of undersampled experimental units. Therefore, the current approaches to infering species abundance curves are inherently unreliable, and blind to the role that species interactions have on structuring the species abundance curve.

This is all to say: alpha diversity metrics in microbiome metabarcoding studies should be taken with a hefty grain of salt, and the analysis of metabarcoding studies should probably not hinge on the alpha diversity estimates.

Djeppschmidt/Model.Microbiome documentation built on Sept. 17, 2022, 9:15 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Djeppschmidt/Model.Microbiome
Environment And Community Simulation Package

In Djeppschmidt/Model.Microbiome: Environment And Community Simulation Package

Introduction

Install Model.Microbiome

Designing a Normaliation Fuction

Conducting Experiments: Summarizing Outputs

Exp. 1: Normalization vs Structure

Expt 2: Normalization vs Sequencing Depth

Expt 3: Normalization vs Sequencing Variance

Network Analysis

Figure 4

Figure 3: rank abundance curve

figure 5

Figure 5

summary Table

R Package Documentation

Browse R Packages

We want your feedback!

Djeppschmidt/Model.Microbiome Environment And Community Simulation Package

In Djeppschmidt/Model.Microbiome: Environment And Community Simulation Package

Introduction

Install Model.Microbiome

Designing a Normaliation Fuction

Conducting Experiments: Summarizing Outputs

Exp. 1: Normalization vs Structure

Expt 2: Normalization vs Sequencing Depth

Expt 3: Normalization vs Sequencing Variance

Network Analysis

Figure 4

Figure 3: rank abundance curve

figure 5

Figure 5

summary Table

R Package Documentation

Browse R Packages

We want your feedback!

Djeppschmidt/Model.Microbiome
Environment And Community Simulation Package