In Djeppschmidt/Model.Microbiome: Environment And Community Simulation Package

knitr::opts_chunk$set(echo = TRUE)

R Markdown

This tutorial will walk you through the use of the Model.Microbiome package for modeling, and how it can be used to test new software. The primary function of Model.Microbiome is to create a series of randomly generated communities. These communities are subsampled to simulate the sequencing process. Then any normalization effort may be applied to these data, and compared for accuracy. This model software does not attempt to model PCR, sequencing, or any other bias that may be introduced on the benchtop or by the DNA isolation procedure. These should be addressed separately. This software only attempts to address inferential bias that is introduced during the sequence count normalization process.

Table of contents: Introduction to the model and it's basic function/structure: Exploring effects of sampling depth: Exploring effects of sparcity: Analysis from the paper: Toolkit for new functions:

First, let's install Model.Microbiome:

devtools::install_github("djeppschmidt/Model.Microbiome")

Now load all the other packages we need (or download and load if you haven't already):

# load required packages ####

library(Model.Microbiome)
library(reshape2)
library(ggplot2)
library(vegan)
library(dplyr)
library(plyr)
library(phyloseq)
library(viridis)
library(ranacapa)
library(edgeR)
library(limma)
library(GLDEX)
library(stats)

The central function of this package creates an array (or list of objects) that include a data table of taxon abundance that represents the community being sampled; a subsampled data table (i.e. the corresponding sequencing output), and any outputs from normalization procedures that the user may choose. These community datasets are organized as phyloseq objects that contain the otu table, taxa names, and environmental data. Each of the datasets created has the same experimental design; there are 5 replicates of six experimental conditions. The community data is generated using a stepwise process. First, the environmental metadata is defined. There are five environmental parameters; three of which are related to the experimental design. These are randomly generated each time a new community is made; the specific values are drawn from a normal distribution with specific mean and variance for each experimental condition. The last two environmental variables have a mean and variance that is not captured by the categorical experimental design (ie is pulled from the same distribution across the dataset). The end result is a table of "environmental data" for each sample that includes values that are reflected in the experimental design, and those that are not. This is important as environmental studies are often designed with a tacit understanding of the environmental gradient; in such cases important environmental variables may or may not behave as anticipated by the researcher. Including variables that may be of interest but are not a-priori accurately predicted by the researcher is important for ensuring that statistical softwares do not over-fit taxon abundance to the experimental design and lose resolution on unanticipated environmental drivers.

Model.Microbiome includes 700 individual species equations that relate species abundance to the five environmental parameters. After the environment is created, the community is assembled. In this step, the user may vary the proportion of taxa that may be detected at a global scale; within a group; and within a particular sample. For example, we might say that 20 taxa may exist globally, 30 taxa may exist that are endemic to their experimental condition; and 10 taxa may exist only in the single site. In this case model.microbiome will randomly select 20 taxa from the list of 700 to be in the species list of all sites. Then it will randomly select 30 taxa for each experimental condition that will be in the species list of all sites that are in that experimental condition. And finally, Model.Microbiome will select 10 taxa for each site from the 700. These taxon lists are compiled so that each site includes uniuqe, groupwise, and globally distributed taxa. Then their abundance is calculated. Within each round of selection taxa are selected without replacement, but taxa are selected with replacement between rounds. This means that site-wise selection is blinded to the global selection, and to other site-specific selection. Therefore it is possible that a taxon is selected as a global taxon and a site-specific; or that two sites may include the same organism as site-specific. This is an intentional behavior to model sparcity and the role of historicity in ecological process. By modulating the proportion of global, group, and sitewise taxa, we can test how each normalization procedure handles sparcity. Another thing to note is that just because the abundance of a taxon is calculated at a site does not mean it has a positive count value. Many taxa in the global set will only have positive values in a subset of experimental conditions. This is an intended effect.

Once the species lists are created for each site, then Model.Microbiome calculates their abundance.

The user designates the number of communities that are created, and selects parameters to define the sparcity (i.e. how likely a taxon will appear only in one sample), mean sampling depth (i.e. sequencing depth), variance of sample-wise sequencing depth around the mean, etc. The following procedure will outline two methods for how one may designate the normalization procedure.

Model.Microbiome uses get() to import functions, giving the user platform the flexibility define new normalization functions, and import them into the analysis chain. There are two ways to include new functions. The first is to include them as a parameter in the core function. This requires defining a function. Here is an example:

# define the normalization function
Relative.Abundance<-function(x){require(phyloseq)
}
# create a list of names
method<-c("Relative.Abundance")

# reps = number of test cases generated
# commonN = number of taxa that will be calculated for all samples
# groupN = number of taxa that will be calculated within groups
# singleN = number of taxa that will be calculated only in a single sample
# D = sequencing depth (ie number of samples taken on average)
# V = variance of sequencing depth
# method = list of names of applied methods
testrun<-suppressWarnings(benchmark.MM2(reps=10, commonN=73, groupN=1, singleN=1, D=500, V=250, method))

The second approach also demonstrates the core functionality that allows the user to also add any statistical function/analysis to the array. Again for this process, we define a function

# define the normalization function
Relative.Abundance<-function(x){require(phyloseq)
}
# create a list of names
method<-c("Relative.Abundance")

testrun2<-

We then do our first modeling run. In this case, reps is the number of times that we repeat the model with the same conditions. CommonN is the number of taxa that are modeled acorss all sites; groupN is the number of taxa that are modeled at each site; singleN are the number of taxa that are modeled at single sites; D is the smean ampling depth, V is the variation around that depth. These are the parameters that are necessary to set up and execute the sampling run. It should be noted that large values for the reps (e.g. greater than 100) may take longer than an hour to run. 1000's of reps could take days, depending on the processing power of your machine. I recommend starting with 10 reps to get the feel of the output structure.

#Test runs:
C.D500<-suppressWarnings(benchmark.MM(reps=10, commonN=73, groupN=1, singleN=1, D=500, V=250))

runtest<-run.analysis2(commonN=73, groupN=1, singleN=1, D=500, V=250, method=c("RA"))
#

Notice that each repeated run of the function can be referenced from the object (e.g. 'C.D500$rep1$...'). The output format for this model is as lists of lists (or a complex array). It contains phyloseq objects of the community. Within each

# mostly universal taxa
# benchmark with total = 75; different sequencing depth (low sparsity)

C.D500<-suppressWarnings(benchmark.MM(reps=10, commonN=73, groupN=1, singleN=1, D=500, V=250))

C.D1000<-suppressWarnings(benchmark.MM(reps=10, commonN=73, groupN=1, singleN=1, D=1000, V=500))
C.D2000<-suppressWarnings(benchmark.MM(reps=10, commonN=73, groupN=1, singleN=1, D=2000, V=1000))
C.D10000<-suppressWarnings(benchmark.MM(reps=10, commonN=73, groupN=1, singleN=1, D=10000, V=5000))

# increasing rare taxa (medium-low sparsity)
#G.D500<-suppressWarnings(benchmark.MM(reps=1, commonN=50, groupN=15, singleN=10, D=500, V=250))
# benchmark with total = 75; different sequencing depth
G.D500<-suppressWarnings(benchmark.MM(reps=10, commonN=50, groupN=15, singleN=10, D=500, V=250))
G.D1000<-suppressWarnings(benchmark.MM(reps=10, commonN=50, groupN=15, singleN=10, D=1000, V=500))
G.D2000<-suppressWarnings(benchmark.MM(reps=10, commonN=50, groupN=15, singleN=10, D=2000, V=1000))
G.D10000<-suppressWarnings(benchmark.MM(reps=10, commonN=50, groupN=15, singleN=10, D=10000, V=5000))

# even distribution of rare taxa (medium sparcity)
# benchmark with total = 75; different sequencing depth
E.D500<-suppressWarnings(benchmark.MM(reps=10, commonN=25, groupN=25, singleN=25, D=500, V=250))
E.D1000<-suppressWarnings(benchmark.MM(reps=10, commonN=25, groupN=25, singleN=25, D=1000, V=500))
E.D2000<-suppressWarnings(benchmark.MM(reps=10, commonN=25, groupN=25, singleN=25, D=2000, V=1000))
E.D10000<-suppressWarnings(benchmark.MM(reps=10, commonN=25, groupN=25, singleN=25, D=10000, V=5000))

# (medium-high sparsity)
# benchmark with total = 75; different sequencing depth
S.D500<-suppressWarnings(benchmark.MM(reps=10, commonN=15, groupN=20, singleN=40, D=500, V=250))
S.D1000<-suppressWarnings(benchmark.MM(reps=10, commonN=15, groupN=20, singleN=40, D=1000, V=500))
S.D2000<-suppressWarnings(benchmark.MM(reps=10, commonN=15, groupN=20, singleN=40, D=2000, V=1000)) 
S.D10000<-suppressWarnings(benchmark.MM(reps=10, commonN=15, groupN=20, singleN=40, D=10000, V=5000)) 

# (High sparsity)
# benchmark with total = 75; different sequencing depth
V.D500<-suppressWarnings(benchmark.MM(reps=10, commonN=10, groupN=15, singleN=50, D=500, V=250))
V.D1000<-suppressWarnings(benchmark.MM(reps=10, commonN=10, groupN=15, singleN=50, D=1000, V=500))
V.D2000<-suppressWarnings(benchmark.MM(reps=10, commonN=10, groupN=15, singleN=50, D=2000, V=1000))
V.D10000<-suppressWarnings(benchmark.MM(reps=10, commonN=10, groupN=15, singleN=50, D=10000, V=5000))

# boxplot of R squared from model

box.df<-data.frame("reference"=C.D500$rep1$model$R$lm,"deseqVST"=C.D500$rep1$deseqVST$DMI$R$lm,"limmaVST"=C.D500$rep1$limma$DMI$R$lm,"RA"=C.D500$rep1$RA$DMI$R$lm,"erare"=C.D500$rep1$eRare$DMI$R$lm,"scaled"=C.D500$rep1$scaled$DMI$R$lm)
box.df$names<-rownames(box.df)
m.box<-melt(data=box.df, id.vars="names", measure.vars = c("reference", "RA", "scaled", "erare", "limmaVST", "deseqVST"))

keep<-names(C.D500$rep1$scaled$LII$R)[C.D500$rep1$scaled$LII$R>0.9]

box.df2<-data.frame("reference"=C.D500$rep1$model$R$lm[names(C.D500$rep1$model$R$lm)%in%keep],"deseqVST"=C.D500$rep1$deseqVST$DMI$R$lm[names(C.D500$rep1$deseqVST$DMI$R$lm)%in%keep],"limmaVST"=C.D500$rep1$limma$DMI$R$lm[names(C.D500$rep1$limma$DMI$R$lm)%in%keep],"RA"=C.D500$rep1$RA$DMI$R$lm[names(C.D500$rep1$RA$DMI$R$lm)%in%keep],"erare"=C.D500$rep1$eRare$DMI$R$lm[names(C.D500$rep1$eRare$DMI$R$lm)%in%keep],"scaled"=C.D500$rep1$scaled$DMI$R$lm[names(C.D500$rep1$scaled$DMI$R$lm)%in%keep])
box.df$names<-rownames(box.df)
box.df2$names<-rownames(box.df2)

m.box<-melt(data=box.df, id.vars="names", measure.vars = c("reference", "RA", "scaled", "erare", "limmaVST", "deseqVST"))

m.box2<-melt(data=box.df2, id.vars="names", measure.vars = c("reference", "RA", "scaled", "erare", "limmaVST", "deseqVST"))


ggplot(m.box, aes(x=variable, y=value))+
  geom_violin() +
  #geom_boxplot()+
  geom_jitter()+
  theme_classic()

ggplot(m.box2, aes(x=variable, y=value))+
  geom_violin() +
  #geom_boxplot()+
  geom_jitter()+
  theme_classic()

om<-prune_taxa(taxa_names(C.D500$rep1$model$comm)%in%keep,C.D500$rep1$model$comm)
plot_bar(om)

om2<-prune_taxa(taxa_names(C.D500$rep1$raw$comm)%in%keep,C.D500$rep1$raw$comm)
names<-data.frame("names"=taxa_names(om2), "number"=c(1:length(taxa_names(om2))))
rownames(names)<-names$names
#tax_table(om2)<-names
plot_bar(om2)

om3<-otu_table(prune_taxa(taxa_names(C.D500$rep1$model$comm)%in%keep,C.D500$rep1$model$comm))
om4<-otu_table(prune_taxa(taxa_names(C.D500$rep1$scaled$comm)%in%keep,C.D500$rep1$deseqVST$comm))

box.LII<-data.frame("deseqVST"=C.D500$rep1$deseqVST$LII$R,"limmaVST"=C.D500$rep1$limma$LII$R,"RA"=C.D500$rep1$RA$LII$R,"erare"=C.D500$rep1$eRare$LII$R,"scaled"=C.D500$rep1$scaled$LII$R)
box.LII$names<-c(1:length(box.LII$RA))
#box.LII$names<-rownames(box.df)
m.boxLII<-melt(data=box.LII, id.vars="names", measure.vars = c("RA", "scaled", "erare", "limmaVST", "deseqVST"))

ggplot(m.boxLII, aes(x=variable, y=value))+
  geom_violin() +
  #geom_boxplot()+
  geom_jitter()+
  theme_classic()

# index workspace
table(C.D10000$rep3$model$SpeciesMeta$scaledLII==1)
table(C.D10000$rep3$model$SpeciesMeta$deseqLII==1)
table(C.D10000$rep3$model$SpeciesMeta$limmaLII==1)
table(C.D10000$rep3$model$SpeciesMeta$RALII==1)

table(S.D500$rep3$model$SpeciesMeta$scaledLII==1)
table(S.D500$rep3$model$SpeciesMeta$deseqLII==1)
table(S.D500$rep3$model$SpeciesMeta$limmaLII==1)
table(S.D500$rep3$model$SpeciesMeta$RALII==1)
S.D10000

length(i$rawLII[i$rawLII<0.1])

ext.LII2<-function(tst, threshold){
out<-list("raw"=c(rep(NA, length(tst))),"RA"=c(rep(NA, length(tst))), "scaled"=c(rep(NA, length(tst))), "pRare"=c(rep(NA, length(tst))), "eRare"=c(rep(NA, length(tst))), "deseqVST"=c(rep(NA, length(tst))),"deseqVST_scaled"=c(rep(NA, length(tst))), "limmaVST"=c(rep(NA, length(tst))))

  for(i in 1:length(tst)) {out$raw[i]<-length(tst[[i]]$model$SpeciesMeta$rawLII[tst[[i]]$model$SpeciesMeta$rawLII<threshold])/length(tst[[i]]$model$SpeciesMeta$rawLII)}
  for(i in 1:length(tst)) {out$RA[i]<-length(tst[[i]]$model$SpeciesMeta$RALII[tst[[i]]$model$SpeciesMeta$RALII<threshold])/length(tst[[i]]$model$SpeciesMeta$RALII)}
  for(i in 1:length(tst)) {out$scaled[i]<-length(tst[[i]]$model$SpeciesMeta$scaledLII[tst[[i]]$model$SpeciesMeta$scaledLII<threshold])/length(tst[[i]]$model$SpeciesMeta$scaledLII)}
  for(i in 1:length(tst)) {out$pRare[i]<-length(tst[[i]]$model$SpeciesMeta$pRareLII[tst[[i]]$model$SpeciesMeta$pRareLII<threshold])/length(tst[[i]]$model$SpeciesMeta$pRareLII)}
  for(i in 1:length(tst)) {out$eRare[i]<-length(tst[[i]]$model$SpeciesMeta$eRareLII[tst[[i]]$model$SpeciesMeta$eRareLII<threshold])/length(tst[[i]]$model$SpeciesMeta$eRareLII)}
  for(i in 1:length(tst)) {out$deseqVST[i]<-length(tst[[i]]$model$SpeciesMeta$deseqLII[tst[[i]]$model$SpeciesMeta$deseqLII<threshold])/length(tst[[i]]$model$SpeciesMeta$deseqLII)}
  for(i in 1:length(tst)) {out$deseqVST_scaled[i]<-length(tst[[i]]$model$SpeciesMeta$deseqSC.LII[tst[[i]]$model$SpeciesMeta$deseqSC.LII<threshold])/length(tst[[i]]$model$SpeciesMeta$deseqSC.LII)}
  for(i in 1:length(tst)) {out$limmaVST[i]<-length(tst[[i]]$model$SpeciesMeta$limmaLII[tst[[i]]$model$SpeciesMeta$limmaLII<threshold])/length(tst[[i]]$model$SpeciesMeta$limmaLII)}
  out1<-as.data.frame(out)
  out1
  #out2<-boxplot(out1)
  #out4<-c("table"=out1,"plot"=out2)
  #out4
}
# benchmark with total = 75; different sequencing depth
LP10_C.D500<-ext.LII2(C.D500, threshold=0.1)
LP10_C.D1000<-ext.LII2(C.D1000,threshold=0.1)
LP10_C.D2000<-ext.LII2(C.D2000,threshold=0.1)
LP10_C.D10000<-ext.LII2(C.D10000,threshold=0.1)
# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
LP10_G.D500<-ext.LII2(G.D500,threshold=0.1)
LP10_G.D1000<-ext.LII2(G.D1000,threshold=0.1)
LP10_G.D2000<-ext.LII2(G.D2000,threshold=0.1)
LP10_G.D10000<-ext.LII2(G.D10000,threshold=0.1)

# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
LP10_E.D500<-ext.LII2(E.D500,threshold=0.1)
LP10_E.D1000<-ext.LII2(E.D1000,threshold=0.1)
LP10_E.D2000<-ext.LII2(E.D2000,threshold=0.1)
LP10_E.D10000<-ext.LII2(E.D10000,threshold=0.1)

# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
LP10_S.D500<-ext.LII2(S.D500,threshold=0.1)
LP10_S.D1000<-ext.LII2(S.D1000,threshold=0.1)
LP10_S.D2000<-ext.LII2(S.D2000,threshold=0.1)
LP10_S.D10000<-ext.LII2(S.D10000,threshold=0.1)

LP10_V.D500<-ext.LII2(S.D500,threshold=0.1)
LP10_V.D1000<-ext.LII2(S.D1000,threshold=0.1)
LP10_V.D2000<-ext.LII2(S.D2000,threshold=0.1)
LP10_V.D10000<-ext.LII2(S.D10000,threshold=0.1)

# prepare Loss of information for plotting

LP10_C.D500$Level<-c(rep(500, 10))
LP10_C.D1000$Level<-c(rep(1000, 10))
LP10_C.D2000$Level<-c(rep(2000, 10))
LP10_C.D10000$Level<-c(rep(10000, 10))

LP10_G.D500$Level<-c(rep(500, 10))
LP10_G.D1000$Level<-c(rep(1000, 10))
LP10_G.D2000$Level<-c(rep(2000, 10))
LP10_G.D10000$Level<-c(rep(10000, 10))

LP10_E.D500$Level<-c(rep(500, 10))
LP10_E.D1000$Level<-c(rep(1000, 10))
LP10_E.D2000$Level<-c(rep(2000, 10))
LP10_E.D10000$Level<-c(rep(10000, 10))

LP10_S.D500$Level<-c(rep(500, 10))
LP10_S.D1000$Level<-c(rep(1000, 10))
LP10_S.D2000$Level<-c(rep(2000, 10))
LP10_S.D10000$Level<-c(rep(10000, 10))

LP10_V.D500$Level<-c(rep(500, 10))
LP10_V.D1000$Level<-c(rep(1000, 10))
LP10_V.D2000$Level<-c(rep(2000, 10))
LP10_V.D10000$Level<-c(rep(10000, 10))

# rep
LP10_C.D500$rep<-c(1:10)
LP10_C.D1000$rep<-c(11:20)
LP10_C.D2000$rep<-c(21:30)
LP10_C.D10000$rep<-c(31:40)

LP10_G.D500$rep<-c(41:50)
LP10_G.D1000$rep<-c(51:60)
LP10_G.D2000$rep<-c(61:70)
LP10_G.D10000$rep<-c(71:80)

LP10_E.D500$rep<-c(81:90)
LP10_E.D1000$rep<-c(91:100)
LP10_E.D2000$rep<-c(101:110)
LP10_E.D10000$rep<-c(111:120)

LP10_S.D500$rep<-c(121:130)
LP10_S.D1000$rep<-c(131:140)
LP10_S.D2000$rep<-c(141:150)
LP10_S.D10000$rep<-c(161:170)

LP10_V.D500$rep<-c(181:190)
LP10_V.D1000$rep<-c(201:210)
LP10_V.D2000$rep<-c(211:220)
LP10_V.D10000$rep<-c(221:230)

LIP10.C<-do.call("rbind", list(LP10_C.D500,LP10_C.D1000,LP10_C.D2000,LP10_C.D10000))
LP10.C2<-melt(data=LIP10.C, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LP10.C2<-data_summary2(LP10.C2, varname="value", groupnames=c("Level", "variable"))
LIP10.G<-do.call("rbind", list(LP10_G.D500,LP10_G.D1000,LP10_G.D2000,LP10_G.D10000))
LP10.G2<-melt(data=LIP10.G, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LP10.G2<-data_summary2(LP10.G2, varname="value", groupnames=c("Level", "variable"))
LIP10.E<-do.call("rbind", list(LP10_E.D500,LP10_E.D1000,LP10_E.D2000,LP10_E.D10000))
LP10.E2<-melt(data=LIP10.E, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LP10.E2<-data_summary2(LP10.E2, varname="value", groupnames=c("Level", "variable"))
LIP10.S<-do.call("rbind", list(LP10_S.D500,LP10_S.D1000,LP10_S.D2000,LP10_S.D10000))
LP10.S2<-melt(data=LIP10.S, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LP10.S2<-data_summary2(LP10.S2, varname="value", groupnames=c("Level", "variable"))
LIP10.V<-do.call("rbind", list(LP10_V.D500,LP10_V.D1000,LP10_V.D2000,LP10_V.D10000))
LP10.V2<-melt(data=LIP10.V, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LP10.V2<-data_summary2(LP10.V2, varname="value", groupnames=c("Level", "variable"))

ggplot(LP10.C2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.2)) +
  geom_line(position=position_dodge(0.2))+
  geom_point(position=position_dodge(0.2))+
  xlab("Seq. Depth")+
  ylab("Percent of taxa with LI < 0.1")+
  scale_color_viridis_d(option="C")+
  theme_classic()

ggplot(LP10.G2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.2)) +
  geom_line(position=position_dodge(0.2))+
  geom_point(position=position_dodge(0.2))+
  xlab("Seq. Depth")+
  ylab("Percent of taxa with LI < 0.1")+
  scale_color_viridis_d(option="C")+
  theme_classic()

ggplot(LP10.E2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.2)) +
  geom_line(position=position_dodge(0.2))+
  geom_point(position=position_dodge(0.2))+
  xlab("Seq. Depth")+
  ylab("Percent of taxa with LI < 0.1")+
  scale_color_viridis_d(option="C")+
  theme_classic()

ggplot(LP10.S2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.2)) +
  geom_line(position=position_dodge(0.2))+
  geom_point(position=position_dodge(0.2))+
  xlab("Seq. Depth")+
  ylab("Percent of taxa with LI < 0.1")+
  scale_color_viridis_d(option="C")+
  theme_classic()

ggplot(LP10.V2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(0.2)) +
  geom_line(position=position_dodge(0.2))+
  geom_point(position=position_dodge(0.2))+
  xlab("Seq. Depth")+
  ylab("Percent of taxa with LI < 0.1")+
  scale_color_viridis_d(option="C")+
  theme_classic()

LIP10.C$sparsity<-c(rep(1, 40))
LIP10.E$sparsity<-c(rep(3, 40))
LIP10.G$sparsity<-c(rep(2, 40))
LIP10.S$sparsity<-c(rep(4, 40))
LIP10.V$sparsity<-c(rep(5, 40))

LIP10.stats<-do.call("rbind", list(LIP10.C, LIP10.E, LIP10.G,LIP10.S, LIP10.V))
LIP10.stats$sparsity<-as.factor(LIP10.stats$sparsity)
LIP10.stats$rep<-as.factor(LIP10.stats$rep)
LIP10.stats$Level<-as.factor(LIP10.stats$Level)
#Permanova.stats$sparsity<-as.factor(Permanova.stats$sparsity)
LIP10.stats<-melt(data=LIP10.stats, id.vars=c("Level","rep", "sparsity"), measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

#summary(aov(value~variable*sparsity*Level+variable:rep, data=LIP10.stats))
summary(aov(value~variable*sparsity*Level, data=LIP10.stats))

#TukeyHSD(aov(value~variable*sparsity+variable:rep, data=LIP10.stats))
TukeyHSD(aov(value~variable*sparsity*Level, data=LIP10.stats))

ggrare(S.D500$rep1$raw$comm, 50, color="Factor")
ggrare(S.D500$rep1$model$comm, 500, color="Factor")

ggplot(C.D500$rep3$model$SpeciesMeta, aes(x=log(M.Eval),y=scaledLII,col=scaledDMI))+geom_point()+ scale_color_viridis_c(option="C")

ggplot(C.D500$rep3$model$SpeciesMeta, aes(x=log(M.Eval),y=deseqLII,col=deseqDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep3$model$SpeciesMeta, aes(x=log(M.Eval),y=limmaLII,col=abs(limmaDMI)))+geom_point()+ scale_color_viridis_c(option="C")

ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=rawLII,y=rawDMI,col=log(M.Eval)))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=RALII,y=RADMI,col=log(M.Eval)))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=limmaLII,y=limmaDMI,col=log(M.Eval)))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=eRareLII,y=eRareDMI,col=log(M.Eval)))+geom_point()+ scale_color_viridis_c(option="C")

ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=deseqDMI,y=scaledDMI, col=log(M.Eval)))+geom_point()+ scale_color_viridis_c(option="C")
#ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=deseqLII,y=scaledLII, col=log(M.Eval)))+geom_point()+ scale_color_viridis_c(option="C")
#ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=deseqLII,y=deseqDMI,col=sd_abundance))+geom_point()
#ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=deseqLII,y=deseqDMI,col=mean_abundance))+geom_point()

ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=scaledLII,col=scaledDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep2$model$SpeciesMeta, aes(x=log10(M.Eval),y=scaledLII,col=scaledDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep3$model$SpeciesMeta, aes(x=log10(M.Eval),y=scaledLII,col=scaledDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep4$model$SpeciesMeta, aes(x=log10(M.Eval),y=scaledLII,col=scaledDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep5$model$SpeciesMeta, aes(x=log10(M.Eval),y=scaledLII,col=scaledDMI))+geom_point()+ scale_color_viridis_c(option="C")


ggplot(C.D10000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=deseqLII,col=scaledLII/deseqLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep2$model$SpeciesMeta, aes(x=log10(M.Eval),y=deseqLII,col=scaledLII/deseqLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep3$model$SpeciesMeta, aes(x=log10(M.Eval),y=deseqLII,col=scaledLII/deseqLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep4$model$SpeciesMeta, aes(x=log10(M.Eval),y=deseqLII,col=scaledLII/deseqLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep5$model$SpeciesMeta, aes(x=log10(M.Eval),y=deseqLII,col=scaledLII/deseqLII))+geom_point()+ scale_color_viridis_c(option="C")


ggplot(C.D10000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=limmaLII,col=scaledLII/limmaLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep2$model$SpeciesMeta, aes(x=log10(M.Eval),y=limmaLII,col=scaledLII/limmaLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep3$model$SpeciesMeta, aes(x=log10(M.Eval),y=limmaLII,col=scaledLII/limmaLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep4$model$SpeciesMeta, aes(x=log10(M.Eval),y=limmaLII,col=scaledLII/limmaLII))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D10000$rep5$model$SpeciesMeta, aes(x=log10(M.Eval),y=limmaLII,col=scaledLII/limmaLII))+geom_point()+ scale_color_viridis_c(option="C")

ggplot(C.D1000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(scaledLII),col=scaledDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(C.D1000$rep2$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(scaledLII),col=scaledDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()
ggplot(C.D1000$rep3$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(scaledLII),col=scaledDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()
ggplot(C.D1000$rep4$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(scaledLII),col=scaledDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  theme_classic()
ggplot(C.D1000$rep5$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(scaledLII),col=scaledDMI))+
  geom_point()+
  scale_color_viridis_c(option="C")+
  theme_classic()

ggplot(C.D1000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(rawLII),col=rawDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  scale_x_continuous(name="Log E value", limits=c(-4, 0)) +
  scale_y_continuous(name="Log raw LII value", limits=c(-8, 0))+
  theme_classic()

ggplot(C.D1000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(RALII),col=RADMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  scale_x_continuous(name="Log E value", limits=c(-4, 0)) +
  scale_y_continuous(name="Log relative abundance LII value", limits=c(-8, 0))+
  theme_classic()
ggplot(C.D1000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(scaledLII),col=scaledDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  scale_x_continuous(name="Log E value", limits=c(-4, 0)) +
  scale_y_continuous(name="Logs scaled LII value", limits=c(-8, 0))+
  theme_classic()
ggplot(C.D1000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(deseqLII),col=deseqDMI))+
  geom_point()+ 
  scale_color_viridis_c(option="C")+
  scale_x_continuous(name="Log E value", limits=c(-4, 0)) +
  scale_y_continuous(name="Log deseq LII value", limits=c(-8, 0))+
  theme_classic()
ggplot(C.D1000$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(limmaLII),col=limmaDMI))+
  geom_point()+
  scale_color_viridis_c(option="C")+
  scale_x_continuous(name="Log E value", limits=c(-4, 0)) +
  scale_y_continuous(name="Log limma LII value", limits=c(-8, 0))+
  theme_classic()

ggplot(C.D1000$rep1$model$SpeciesMeta[C.D1000$rep1$model$SpeciesMeta$deseqLII==0,], aes(x=log10(M.Eval),y=log(limmaLII),col=limmaDMI))+
  geom_point()+
  scale_color_viridis_c(option="C")+
  scale_x_continuous(name="Log E value", limits=c(-4, 0)) +
  scale_y_continuous(name="Log limma LII value", limits=c(-8, 0))+
  theme_classic()

t<-data.frame("names"=taxa_names(C.D1000$rep1$model$comm), "names2"=taxa_names(C.D1000$rep1$model$comm), "names3"=taxa_names(C.D1000$rep1$model$comm))
rownames(t)<-taxa_names(C.D1000$rep1$model$comm)
t<-tax_table(t)
taxa_names(t)<-taxa_names(C.D1000$rep1$model$comm)

tax_table(C.D1000$rep1$model$comm)<-t
tax_table(C.D1000$rep1$limma$comm)<-t
tax_table(C.D1000$rep1$deseqVST$comm)<-t
tax_table(C.D1000$rep1$scaled$comm)<-t

keep<-row.names(C.D1000$rep1$model$SpeciesMeta)[C.D1000$rep1$model$SpeciesMeta$rawLII==0]
reference_r<-prune_taxa(taxa_names(C.D1000$rep1$model$comm)%in%keep, C.D1000$rep1$model$comm)
limma_r<-prune_taxa(taxa_names(C.D1000$rep1$limma$comm)%in%keep, C.D1000$rep1$limma$comm)
deseq_r<-prune_taxa(taxa_names(C.D1000$rep1$deseqVST$comm)%in%keep, C.D1000$rep1$deseqVST$comm)
scaled_r<-prune_taxa(taxa_names(C.D1000$rep1$scaled$comm)%in%keep, C.D1000$rep1$scaled$comm)

plot_bar(reference_r, fill="ta1")
plot_bar(limma_r, fill="ta1")
plot_bar(deseq_r, fill="ta1")
plot_bar(scaled_r, fill="ta1")

ggplot(C.D500$rep1$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(deseqLII),col=deseqDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep2$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(deseqLII),col=deseqDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep3$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(deseqLII),col=deseqDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep4$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(deseqLII),col=deseqDMI))+geom_point()+ scale_color_viridis_c(option="C")
ggplot(C.D500$rep5$model$SpeciesMeta, aes(x=log10(M.Eval),y=log(deseqLII),col=deseqDMI))+geom_point()+ scale_color_viridis_c(option="C")

sample_sums(C.D500$rep1$raw$comm)
C.D500$rep1$metrics$skewness
C.D500$rep1$metrics$stats
C.D500$rep1$metrics$Richness
C.D500$rep1$metrics$lost
RI<-ext.RI(C.D500)
sk2<-NULL
for(i in 1:length(C.D500)){sk2[i]<-median(apply(X = otu_table(C.D500[[i]]$raw$comm), MARGIN=2,FUN = function(x){skewness(x)}))}
sk<-NULL
for(i in 1:length(C.D500)){sk[i]<-C.D500[[i]]$metrics$skewness}

ri<-NULL
for(i in 1:length(C.D500)){ri[i]<-estimate_richness(C.D500[[i]]$raw$comm, measures="Observed")}#/C.D500[[i]]$metrics$Richness}
#estimate_richness(C.D500$rep1$raw$comm)

for(i in 1:length(C.D500)){ri[i]<-mean(ri[[i]])}
#C.D500$rep1$metrics$Richness-estimate_richness(C.D500$rep1$raw$comm, measures="Observed")

# explore data
ggrare(C.D500$rep1$raw$comm, 50, color="Factor")
ggrare(C.D500$rep1$model$comm, 500, color="Factor")
ggrare(C.D500$rep2$raw$comm, 50, color="Factor")
ggrare(C.D500$rep2$model$comm, 500, color="Factor")

ggrare(C.D500$rep2$raw$comm, 50, color="Factor")
ggrare(C.D500$rep2$model$comm, 500, color="Factor")
ggrare(C.D500$rep37$raw$comm, 50, color="Factor")
ggrare(C.D500$rep37$model$comm, 500, color="Factor")

View(as.data.frame(as.matrix(otu_table(C.D500$rep2$raw$comm))))
View(as.data.frame(as.matrix(otu_table(C.D500$rep1$raw$comm))))
plot(RI$scaled/RI$deseqVST, sk)
abline(v=1,col="red", lwd=3, lty=2)
plot(RI$scaled/RI$deseqVST, sk2)
abline(v=1,col="red", lwd=3, lty=2)
plot(RI$scaled/RI$deseqVST, ri)
abline(v=1,col="red", lwd=3, lty=2)
plot(sk,ri, cex=c(RI$scaled/RI$deseqVST))

# PERMANOVA output ####

# benchmark with total = 75; different sequencing depth
P_C.D500<-ext.PERMANOVA2(C.D500)
P_C.D1000<-ext.PERMANOVA2(C.D1000)
P_C.D2000<-ext.PERMANOVA2(C.D2000)
P_C.D10000<-ext.PERMANOVA2(C.D10000)
# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
P_G.D500<-ext.PERMANOVA(G.D500)
P_G.D1000<-ext.PERMANOVA(G.D1000)
P_G.D2000<-ext.PERMANOVA(G.D2000)
P_G.D10000<-ext.PERMANOVA(G.D10000)

# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
P_S.D500<-ext.PERMANOVA(S.D500)
P_S.D1000<-ext.PERMANOVA(S.D1000)
P_S.D2000<-ext.PERMANOVA(S.D2000)
P_S.D10000<-ext.PERMANOVA(S.D10000)

# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
P_E.D500<-ext.PERMANOVA(E.D500)
P_E.D1000<-ext.PERMANOVA(E.D1000)
P_E.D2000<-ext.PERMANOVA(E.D2000)
P_E.D10000<-ext.PERMANOVA(E.D10000)

P_V.D500<-ext.PERMANOVA(V.D500)
P_V.D1000<-ext.PERMANOVA(V.D1000)
P_V.D2000<-ext.PERMANOVA(V.D2000)
P_V.D10000<-ext.PERMANOVA(V.D10000)

# prepare permanova data for plotting ####

P_C.D500$Level<-c(rep(500, 10))
P_C.D1000$Level<-c(rep(1000, 10))
P_C.D2000$Level<-c(rep(2000, 10))
P_C.D10000$Level<-c(rep(10000, 10))

P_G.D500$Level<-c(rep(500, 10))
P_G.D1000$Level<-c(rep(1000, 10))
P_G.D2000$Level<-c(rep(2000, 10))
P_G.D10000$Level<-c(rep(10000, 10))

P_E.D500$Level<-c(rep(500, 10))
P_E.D1000$Level<-c(rep(1000, 10))
P_E.D2000$Level<-c(rep(2000, 10))
P_E.D10000$Level<-c(rep(10000, 10))

P_S.D500$Level<-c(rep(500, 10))
P_S.D1000$Level<-c(rep(1000, 10))
P_S.D2000$Level<-c(rep(2000, 10))
P_S.D10000$Level<-c(rep(10000, 10))

P_V.D500$Level<-c(rep(500, 10))
P_V.D1000$Level<-c(rep(1000, 10))
P_V.D2000$Level<-c(rep(2000, 10))
P_V.D10000$Level<-c(rep(10000, 10))

#rep
P_C.D500$rep<-c(1:10)
P_C.D1000$rep<-c(11:20)
P_C.D2000$rep<-c(21:30)
P_C.D10000$rep<-c(31:40)

P_G.D500$rep<-c(51:60)
P_G.D1000$rep<-c(61:70)
P_G.D2000$rep<-c(71:80)
P_G.D10000$rep<-c(81:90)

P_E.D500$rep<-c(91:100)
P_E.D1000$rep<-c(101:110)
P_E.D2000$rep<-c(111:120)
P_E.D10000$rep<-c(121:130)

P_S.D500$rep<-c(131:140)
P_S.D1000$rep<-c(141:150)
P_S.D2000$rep<-c(151:160)
P_S.D10000$rep<-c(161:170)

P_V.D500$rep<-c(171:180)
P_V.D1000$rep<-c(181:190)
P_V.D2000$rep<-c(191:200)
P_V.D10000$rep<-c(201:210)

# permanova plots
ext.PERMANOVA2<-function(tst){
out<-list("raw"=c(rep(NA, length(tst))),"RA"=c(rep(NA, length(tst))), "scaled"=c(rep(NA, length(tst))), "pRare"=c(rep(NA, length(tst))), "eRare"=c(rep(NA, length(tst))), "deseqVST"=c(rep(NA, length(tst))),"deseqVST_scaled"=c(rep(NA, length(tst))), "limmaVST"=c(rep(NA, length(tst))))

  for(i in 1:length(tst)) {out$raw[i]<-sum(tst[[i]]$raw$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$RA[i]<-sum(tst[[i]]$RA$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$scaled[i]<-sum(tst[[i]]$scaled$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$pRare[i]<-sum(tst[[i]]$pRare$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$eRare[i]<-sum(tst[[i]]$eRare$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$deseqVST[i]<-sum(tst[[i]]$deseqVST$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$deseqVST_scaled[i]<-sum(tst[[i]]$deseqVST_scaled$PERMANOVA$aov.tab$R2[2:6])}
  for(i in 1:length(tst)) {out$limmaVST[i]<-sum(tst[[i]]$limma$PERMANOVA$aov.tab$R2[2:6])}
  out1<-as.data.frame(out)
  out1
  #out2<-boxplot(out1)
  #out4<-c("table"=out1,"plot"=out2)
  #out4
}
Permanova.C<-do.call("rbind", list(P_C.D500,P_C.D1000,P_C.D2000,P_C.D10000))
Permanova.C2<-melt(data=Permanova.C, id.vars="Level", measure.vars = c("raw", "RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
library(ggplot2)

Permanova.C3<-data_summary2(Permanova.C2, varname="value", groupnames=c("Level", "variable"))
Permanova.C2<-data_summary(Permanova.C2, varname="value", groupnames=c("Level", "variable"))

ggplot(Permanova.C2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Mean +/- se)")+
  theme_classic()


ggplot(Permanova.C3, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Median +/- min/max)")+
  theme_classic()

# permanova plots

Permanova.G<-do.call("rbind", list(P_G.D500,P_G.D1000,P_G.D2000,P_G.D10000))
Permanova.G2<-melt(data=Permanova.G, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

Permanova.G3<-data_summary2(Permanova.G2, varname="value", groupnames=c("Level", "variable"))
Permanova.G2<-data_summary(Permanova.G2, varname="value", groupnames=c("Level", "variable"))

ggplot(Permanova.G2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Mean +/- se)")+
  theme_classic()


ggplot(Permanova.G3, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Median +/- min/max)")+
  theme_classic()

# permanova plots

Permanova.E<-do.call("rbind", list(P_E.D500,P_E.D1000,P_E.D2000,P_E.D10000))
Permanova.E2<-melt(data=Permanova.E, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

Permanova.E3<-data_summary2(Permanova.E2, varname="value", groupnames=c("Level", "variable"))
Permanova.E2<-data_summary(Permanova.E2, varname="value", groupnames=c("Level", "variable"))

ggplot(Permanova.E2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Mean +/- se)")+
  theme_classic()


ggplot(Permanova.E3, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Median +/- min/max)")+
  theme_classic()

# permanova plots

Permanova.S<-do.call("rbind", list(P_S.D500,P_S.D1000,P_S.D2000,P_S.D10000))
Permanova.S2<-melt(data=Permanova.S, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

Permanova.S3<-data_summary2(Permanova.S2, varname="value", groupnames=c("Level", "variable"))
Permanova.S2<-data_summary(Permanova.S2, varname="value", groupnames=c("Level", "variable"))

ggplot(Permanova.S2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Mean +/- se)")+
  theme_classic()


ggplot(Permanova.S3, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Median +/- min/max)")+
  theme_classic()

# permanova plots

Permanova.V<-do.call("rbind", list(P_V.D500,P_V.D1000,P_V.D2000,P_V.D10000))
Permanova.V2<-melt(data=Permanova.V, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

Permanova.V3<-data_summary2(Permanova.V2, varname="value", groupnames=c("Level", "variable"))
Permanova.V2<-data_summary(Permanova.V2, varname="value", groupnames=c("Level", "variable"))

ggplot(Permanova.V2, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=value-sd, ymax=value+sd), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Mean +/- se)")+
  theme_classic()


ggplot(Permanova.V3, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("PERMANOVA R.sq. Index Value (Median +/- min/max)")+
  theme_classic()

Permanova.C$sparsity<-c(rep(1, 400))
Permanova.E$sparsity<-c(rep(3, 400))
Permanova.G$sparsity<-c(rep(2, 400))
Permanova.S$sparsity<-c(rep(4, 400))
Permanova.V$sparsity<-c(rep(5, 400))

Permanova.stats<-do.call("rbind", list(Permanova.C, Permanova.E, Permanova.G, Permanova.S, Permanova.V))
Permanova.stats$sparsity<-as.factor(Permanova.stats$sparsity)
Permanova.stats$rep<-as.factor(Permanova.stats$rep)
#Permanova.stats$sparsity<-as.factor(Permanova.stats$sparsity)
P.stats<-melt(data=Permanova.stats, id.vars=c("Level","rep", "sparsity"), measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

summary(aov(value~variable*sparsity+variable:rep, data=P.stats))
TukeyHSD(aov(value~variable*sparsity+variable:rep, data=P.stats))

# LII output ####

# benchmark with total = 75; different sequencing depth
L_C.D500<-ext.LII(C.D500)
L_C.D1000<-ext.LII(C.D1000)
L_C.D2000<-ext.LII(C.D2000)
L_C.D10000<-ext.LII(C.D10000)
# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
L_G.D500<-ext.LII(G.D500)
L_G.D1000<-ext.LII(G.D1000)
L_G.D2000<-ext.LII(G.D2000)
L_G.D10000<-ext.LII(G.D10000)

# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
L_E.D500<-ext.LII(E.D500)
L_E.D1000<-ext.LII(E.D1000)
L_E.D2000<-ext.LII(E.D2000)
L_E.D10000<-ext.LII(E.D10000)

# seq depth at mostly universal taxa
# benchmark with total = 75; different sequencing depth
L_S.D500<-ext.LII(S.D500)
L_S.D1000<-ext.LII(S.D1000)
L_S.D2000<-ext.LII(S.D2000)
L_S.D10000<-ext.LII(S.D10000)

L_V.D500<-ext.LII(S.D500)
L_V.D1000<-ext.LII(S.D1000)
L_V.D2000<-ext.LII(S.D2000)
L_V.D10000<-ext.LII(S.D10000)

# prepare Loss of information for plotting

L_C.D500$Level<-c(rep(500, 100))
L_C.D1000$Level<-c(rep(1000, 100))
L_C.D2000$Level<-c(rep(2000, 100))
L_C.D10000$Level<-c(rep(10000, 100))

L_G.D500$Level<-c(rep(500, 100))
L_G.D1000$Level<-c(rep(1000, 100))
L_G.D2000$Level<-c(rep(2000, 100))
L_G.D10000$Level<-c(rep(10000, 100))

L_E.D500$Level<-c(rep(500, 100))
L_E.D1000$Level<-c(rep(1000, 100))
L_E.D2000$Level<-c(rep(2000, 100))
L_E.D10000$Level<-c(rep(10000, 100))

L_S.D500$Level<-c(rep(500, 100))
L_S.D1000$Level<-c(rep(1000, 100))
L_S.D2000$Level<-c(rep(2000, 100))
L_S.D10000$Level<-c(rep(10000, 100))

L_V.D500$Level<-c(rep(500, 100))
L_V.D1000$Level<-c(rep(1000, 100))
L_V.D2000$Level<-c(rep(2000, 100))
L_V.D10000$Level<-c(rep(10000, 100))

# rep
L_C.D500$rep<-c(1:100)
L_C.D1000$rep<-c(101:200)
L_C.D2000$rep<-c(201:300)
L_C.D10000$rep<-c(301:400)

L_G.D500$rep<-c(401:500)
L_G.D1000$rep<-c(501:600)
L_G.D2000$rep<-c(601:700)
L_G.D10000$rep<-c(701:800)

L_E.D500$rep<-c(801:900)
L_E.D1000$rep<-c(901:1000)
L_E.D2000$rep<-c(1001:1100)
L_E.D10000$rep<-c(1101:1200)

L_S.D500$rep<-c(1201:1300)
L_S.D1000$rep<-c(1301:1400)
L_S.D2000$rep<-c(1401:1500)
L_S.D10000$rep<-c(1601:1700)

L_V.D500$rep<-c(1801:1900)
L_V.D1000$rep<-c(2001:2100)
L_V.D2000$rep<-c(2101:2200)
L_V.D10000$rep<-c(2201:2300)

# plot LII ####
LII.C<-do.call("rbind", list(L_C.D500,L_C.D1000,L_C.D2000,L_C.D10000))
LII.C2<-melt(data=LII.C, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LII.C2<-data_summary2(LII.C, varname="value", groupnames=c("Level", "variable"))

LII.C500<-LII.C[LII.C$Level==500,]
LII.C1000<-LII.C[LII.C$Level==1000,]
LII.C2000<-LII.C[LII.C$Level==2000,]
LII.C10000<-LII.C[LII.C$Level==10000,]

ggplot(LII.C, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line()+
  geom_point()+
  xlab("Seq. Depth")+
  ylab("Lost Information Index Value (Median +/- min/max)")+
  theme_classic()

LII.G<-do.call("rbind", list(L_G.D500,L_G.D1000,L_G.D2000,L_G.D10000))
LII.G2<-melt(data=LII.G, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LII.G2<-data_summary2(LII.G2, varname="value", groupnames=c("Level", "variable"))

LII.G500<-LII.G[LII.G$Level==500,]
LII.G1000<-LII.G[LII.G$Level==1000,]
LII.G2000<-LII.G[LII.G$Level==2000,]
LII.G10000<-LII.G[LII.G$Level==10000,]

ggplot(LII.G, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line()+
  geom_point()+
  xlab("Seq. Depth")+
  ylab("Lost Information Index Value (Median +/- min/max)")+
  theme_classic()

LII.E<-do.call("rbind", list(L_E.D500,L_E.D1000,L_E.D2000,L_E.D10000))
LII.E2<-melt(data=LII.E, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LII.E2<-data_summary2(LII.E2, varname="value", groupnames=c("Level", "variable"))

LII.E500<-LII.E[LII.E$Level==500,]
LII.E1000<-LII.E[LII.E$Level==1000,]
LII.E2000<-LII.E[LII.E$Level==2000,]
LII.E10000<-LII.E[LII.E$Level==10000,]

ggplot(LII.E, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Lost Information Index Value (Median +/- min/max)")+
  theme_classic()

LII.S<-do.call("rbind", list(L_S.D500,L_S.D1000,L_S.D2000,L_S.D10000))
LII.S2<-melt(data=LII.S, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LII.S2<-data_summary2(LII.S2, varname="value", groupnames=c("Level", "variable"))

LII.S500<-LII.S[LII.S$Level==500,]
LII.S1000<-LII.S[LII.S$Level==1000,]
LII.S2000<-LII.S[LII.S$Level==2000,]
LII.S10000<-LII.S[LII.S$Level==10000,]

ggplot(LII.S, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Lost Information Index Value (Median +/- min/max)")+
  theme_classic()

LII.V<-do.call("rbind", list(L_V.D500,L_V.D1000,L_V.D2000,L_V.D10000))
LII.V2<-melt(data=LII.V, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
LII.V2<-data_summary2(LII.V2, varname="value", groupnames=c("Level", "variable"))

LII.V500<-LII.V[LII.V$Level==500,]
LII.V1000<-LII.V[LII.V$Level==1000,]
LII.V2000<-LII.V[LII.V$Level==2000,]
LII.V10000<-LII.V[LII.V$Level==10000,]

ggplot(LII.V, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Lost Information Index Value (Median +/- min/max)")+
  theme_classic()

with(LII.V, plot())

LII.C$sparsity<-c(rep(1, 400))
LII.E$sparsity<-c(rep(3, 400))
LII.G$sparsity<-c(rep(2, 400))
LII.S$sparsity<-c(rep(4, 400))
LII.V$sparsity<-c(rep(5, 400))

L.stats<-do.call("rbind", list(LII.C, LII.E, LII.G, LII.S, LII.V))
L.stats<-melt(data=L.stats, id.vars=c("Level","rep", "sparsity"), measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

summary(aov(value~variable*sparsity+variable:rep, data=L.stats))
TukeyHSD(aov(value~variable*sparsity+variable:rep, data=L.stats))

# extract over-modeling index ####
O_C.D500<-ext.OI(C.D500)
O_C.D1000<-ext.OI(C.D1000)
O_C.D2000<-ext.OI(C.D2000)
O_C.D10000<-ext.OI(C.D10000)

O_G.D500<-ext.OI(G.D500)
O_G.D1000<-ext.OI(G.D1000)
O_G.D2000<-ext.OI(G.D2000)
O_G.D10000<-ext.OI(G.D10000)

O_E.D500<-ext.OI(E.D500)
O_E.D1000<-ext.OI(E.D1000)
O_E.D2000<-ext.OI(E.D2000)
O_E.D10000<-ext.OI(E.D10000)

O_S.D500<-ext.OI(S.D500)
O_S.D1000<-ext.OI(S.D1000)
O_S.D2000<-ext.OI(S.D2000)
O_S.D10000<-ext.OI(S.D10000)

O_V.D500<-ext.OI(V.D500)
O_V.D1000<-ext.OI(V.D1000)
O_V.D2000<-ext.OI(V.D2000)
O_V.D10000<-ext.OI(V.D10000)
# prepare to plot overmodeling index 
O_C.D500$Level<-c(rep(500, 100))
O_C.D1000$Level<-c(rep(1000, 100))
O_C.D2000$Level<-c(rep(2000, 100))
O_C.D10000$Level<-c(rep(10000, 100))

O_G.D500$Level<-c(rep(500, 100))
O_G.D1000$Level<-c(rep(1000, 100))
O_G.D2000$Level<-c(rep(2000, 100))
O_G.D10000$Level<-c(rep(10000, 100))

O_E.D500$Level<-c(rep(500, 100))
O_E.D1000$Level<-c(rep(1000, 100))
O_E.D2000$Level<-c(rep(2000, 100))
O_E.D10000$Level<-c(rep(10000, 100))

O_S.D500$Level<-c(rep(500, 100))
O_S.D1000$Level<-c(rep(1000, 100))
O_S.D2000$Level<-c(rep(2000, 100))
O_S.D10000$Level<-c(rep(10000, 100))

O_V.D500$Level<-c(rep(500, 100))
O_V.D1000$Level<-c(rep(1000, 100))
O_V.D2000$Level<-c(rep(2000, 100))
O_V.D10000$Level<-c(rep(10000, 100))

# rep

O_C.D500$rep<-c(1:100)
O_C.D1000$rep<-c(101:200)
O_C.D2000$rep<-c(201:300)
O_C.D10000$rep<-c(301:400)

O_G.D500$rep<-c(401:500)
O_G.D1000$rep<-c(601:700)
O_G.D2000$rep<-c(701:800)
O_G.D10000$rep<-c(801:900)

O_E.D500$rep<-c(901:1000)
O_E.D1000$rep<-c(1001:1100)
O_E.D2000$rep<-c(1101:1200)
O_E.D10000$rep<-c(1201:1300)

O_S.D500$rep<-c(1301:1400)
O_S.D1000$rep<-c(1401:1500)
O_S.D2000$rep<-c(1501:1600)
O_S.D10000$rep<-c(1601:1700)

O_V.D500$rep<-c(1701:1800)
O_V.D1000$rep<-c(1801:1900)
O_V.D2000$rep<-c(1901:2000)
O_V.D10000$rep<-c(2001:2100)

# plot overmodeling index ####
OI.C<-do.call("rbind", list(O_C.D500,O_C.D1000,O_C.D2000,O_C.D10000))
OI.C2<-melt(data=OI.C, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
OI.C2<-data_summary2(OI.C2, varname="value", groupnames=c("Level", "variable"))

ggplot(OI.C, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Overmodeling Index Value (Median +/- min/max)")+
  theme_classic()

OI.G<-do.call("rbind", list(O_G.D500,O_G.D1000,O_G.D2000,O_G.D10000))
OI.G2<-melt(data=OI.G, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
OI.G2<-data_summary2(OI.G2, varname="value", groupnames=c("Level", "variable"))

ggplot(OI.G, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Overmodeling Index Value (Median +/- min/max)")+
  theme_classic()

OI.E<-do.call("rbind", list(O_E.D500,O_E.D1000,O_E.D2000,O_E.D10000))
OI.E2<-melt(data=OI.E, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
OI.E2<-data_summary2(OI.E2, varname="value", groupnames=c("Level", "variable"))

ggplot(OI.E, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Overmodeling Index Value (Median +/- min/max)")+
  theme_classic()

OI.S<-do.call("rbind", list(O_S.D500,O_S.D1000,O_S.D2000,O_S.D10000))
OI.S2<-melt(data=OI.S, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
OI.S2<-data_summary2(OI.S2, varname="value", groupnames=c("Level", "variable"))

ggplot(OI.S, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Overmodeling Index Value (Median +/- min/max)")+
  theme_classic()

OI.V<-do.call("rbind", list(O_V.D500,O_V.D1000,O_V.D2000,O_V.D10000))
OI.V2<-melt(data=OI.V, id.vars="Level", measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))
OI.V2<-data_summary2(OI.V2, varname="value", groupnames=c("Level", "variable"))

ggplot(OI.C, aes(x=Level, y=value, group = variable, color=variable))+
  geom_errorbar(aes(ymin=low, ymax=high), width=.1, position=position_dodge(300)) +
  geom_line(position=position_dodge(300))+
  geom_point(position=position_dodge(300))+
  xlab("Seq. Depth")+
  ylab("Overmodeling Index Value (Median +/- min/max)")+
  theme_classic()

with(plot())

OI.C$sparsity<-c(rep(1, 400))
OI.E$sparsity<-c(rep(3, 400))
OI.G$sparsity<-c(rep(2, 400))
OI.S$sparsity<-c(rep(4, 400))
OI.V$sparsity<-c(rep(5, 400))

O.stats<-do.call("rbind", list(OI.C, OI.E, OI.G, OI.S, OI.V))
#O.stats$sparsity<-as.factor(O.stats$sparsity)
O.stats<-melt(data=O.stats, id.vars=c("Level", "rep", "sparsity"), measure.vars = c("raw","RA", "scaled", "eRare", "pRare", "limmaVST", "deseqVST", "deseqVST_scaled"))

summary(aov(value~variable*sparsity+variable:rep, data=O.stats))
TukeyHSD(aov(value~variable*sparsity+variable:rep, data=O.stats))