knitr::opts_chunk$set(warning = FALSE, message = FALSE, echo = FALSE)
rm(list=ls())

library(PERFect)
library(ggplot2)
library(knitr)
library(kableExtra)

set.seed(12341)

Installation

The PERFect software is freely available at https://github.com/katiasmirn/PERFect and can be installed using the following codes:

library(devtools)  
install_github("katiasmirn/PERFect")

Introduction

This vignette explains the idea of filtering loss $FL(J)$ and demonstrates the PERFect permutation method on a small subset of the mock community data (Brooks et al. 2015). There are in total $10$ samples ($n = 10$) and $20$ taxa ($p = 20$) which contain $7$ signal taxa. The two tables below shows ordered taxa abundance by columns, where the first table gives $13$ noise taxa and the second table gives $7$ signal taxa. For example, $N_1$ is the least abundant noise taxon and $N_{13}$ is the most abundant noise taxon; $S_1$ is the least abundant signal taxon and $S_7$ is the most abundant signal taxon.

data(mock2)
Counts <- mock2$Counts
taxa <- c("Bradyrhizobiaceae.cluster49", "Propionibacterium.acnes", "Gemella.OTU86",
          "Comamonadaceae.cluster57", "Streptococcus.gordonii", "Caulobacter.leidyi",
          "Finegoldia.magna", "Aerococcus.christensenii", "Agromyces.cluster54",
          "Mycoplasma.orale_faucium", "Enterobacteriaceae.cluster31",
          "Lactobacillus.jensenii", "Fusobacterium.cluster48", "Methylophilus.cluster11",
          "Coriobacteriaceae.OTU27", "Prevotella.cluster2",
          "Delftia.acidovorans_lacustris_tsuruhatensis", "Hyphomicrobium.methylovorum",
          "Bifidobacterium.longum_infantis_suis", "Streptococcus.parasanguinis")
Ind <- apply(Counts[,taxa], 2, function(x) which(x!=0))
sampleID <- c(31,78, 39, 5, 9, 76, 22, 105, 34, 82)
data <- Counts[sampleID,c(taxa, mock2$TrueTaxa)]
data <- data[,apply(data, 2, Matrix::nnzero) >0]
data<- data[,-3]
#sort by abundance
data <- data[, NP_Order(data)]
#rename columns by noise and signal
SignTaxa <- colnames(data) %in% mock2$TrueTaxa
colnames(data)[SignTaxa] <-paste0("S~", 1:sum(SignTaxa),"~")
colnames(data)[!SignTaxa] <-paste0("N~", 1:(dim(data)[2] -sum(SignTaxa)),"~")
rownames(data) <- paste("Sample ", 1:nrow(data))


kable(data[,1:13], 
      format='html', 
      digits=2, 
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left") 

kable(data[,14:20], 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")

Method Illustration

Filtering loss function

The filtering loss measures taxa contribution to the total covariance. We assume that if the set of $J$ taxa is not important, then removing these taxa will not dramatically decrease the OTU table total covariance (defined in section 2.1 of Smirnova et al. 2018). We define the loss due to filtering out the set of $J$ taxa as $$ FL (J_j)= 1- \frac{\text{covariance magnitude after removing $J$ taxa}}{\text{total covariance}}. $$

The filtering loss $FL(J_j)$ is a number between $0$ and $1$, with values close to $0$ if the set of taxa $J$ has small contribution to the total covariance and $1$ otherwise.

To find filtering loss, taxa are first re-arranged in ascending order from left to right according to their number of occurrences in the $n$ samples, i.e. the abundant taxa are to the right of the table. Once taxa are ordered, the filtering loss is calculated sequentially by removing the taxa from left to right.

We evaluate the $j^{th}$ taxon contribution to the signal using the difference in filtering loss statistic $$ DFL(j + 1) = FL(J_{j+1}) - FL(J_j). $$ The values of $DFL(j + 1)$ are close $0$ if the $j^{th}$ taxon is included erroneously, and significantly different from $0$ if this taxon has high contribution to the signal.

To illustrate the ability of this filtering loss to distinguish between noise and signal taxa, the filtering loss values for a noise taxon, a signal taxon and a combination of noise and signal taxa are shown in the table below. Specifically, this table compares the filtering loss values and their corresponding log values due to removing:

1) The $10^{th}$ least abundant noise taxon, $J_1 = {N_{10}}$ 2) The first least abundant signal taxon $J_2 = {S_1}$ 3) All noise taxa, $J_3 = {N_1, \dots, N_{13}}$ 4) All noise taxa and the first least abundant signal taxon, $J_4 = {N_1, \dots, N_{13}, S_1}$.

From the table, one can notice a large difference between log filtering loss value of the $10^{th}$ least abundant noise taxon and that of the first least abundant signal taxon ($-14.786$ v.s. $-6.970$ respectively). This implies that the data loss due to removing $N_{10}$ is minimal and dramatically less than the loss due to removing $S_1$. Moreover, since the loss from omitting all noise taxa is still less the loss from omitting one signal taxon, it shows that all noise taxa cumulatively have less contribution than a single signal taxon.

X <- as.matrix(data)
Netw <- t(as.matrix(X))%*%as.matrix(X)
den <- psych::tr(t(Netw)%*%Netw)
#removing 10th noise taxon
X_R <- X[,-which(colnames(X) == "N~10~")]
#calculate corresponding norm
Netw_R <- t(X_R)%*%X_R
num <-  psych::tr(t(Netw_R)%*%Netw_R)
FL <-  1 - (num/den)
#df <- data.frame(num = num, den = den, FL = FL, log_FL = round(log(FL), 3))
df <- data.frame(FL = FL, log_FL = round(log(FL), 3))

#removing first least abundant signal taxon
X_R <- X[,-which(colnames(X) == "S~1~")]
Netw_R <- t(X_R)%*%X_R
num <-  psych::tr(t(Netw_R)%*%Netw_R)
FL <-  1 - (num/den)
#df <- rbind(df, c(num, den, FL, round(log(FL), 3)))
df <- rbind(df, c(FL, round(log(FL), 3)))

#removing all noise taxa
X_R <- X[,-which(substr(colnames(X), start = 1, stop = 2) %in% "N~")]
Netw_R <- t(X_R)%*%X_R
num <-  psych::tr(t(Netw_R)%*%Netw_R)
FL <-  1 - (num/den)
#df <- rbind(df, c(num, den, FL, round(log(FL), 3)))
df <- rbind(df, c(FL, round(log(FL), 3)))

#removing all noise + first signal taxon
X_R <- X[,-c(which(substr(colnames(X), start = 1, stop = 2) %in% "N~"),
             which(colnames(X) == "S~1~"))]
Netw_R <- t(X_R)%*%X_R
num <-  psych::tr(t(Netw_R)%*%Netw_R)
FL <-  1 - (num/den)
#df <- rbind(df, c(num, den, FL, round(log(FL), 3)))
df <- rbind(df, c(FL, round(log(FL), 3)))

rownames(df) <- c("N~10~", "S~1~", "Noise taxa", "Noise taxa and S~1~")
colnames(df) <- c("Filtering Loss","Log Filtering Loss")
df[,1] <- format(df[,1], width = 1, digits = 5, format = "g")

kable(df, 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")
#function to output results for each simulation run
resSummary <-function(X, filtX,  time = NA){

  rank_pvals = NULL 
  rank_pres = NULL
  ntotal <- dim(X)[2]
  npres <- dim(filtX)[2] #number of taxa left

  pfilt <- (ntotal - npres)/ntotal #percentage of taxa filtered

  #combine into results vector
  res <- c(ntotal, npres, pfilt)
  names(res) <- c("ntotal", "npres", "pfilt")

  return(list(res = res,  time = time))

}

Permutation Filtering algorithm

This algorithm's objective is to remove noise taxa while identifying and retaining signal taxa. It takes an OTU table and a test critical value $\alpha$ as inputs and produces a reduced OTU table with less taxa. The following example illustrates this algorithm.

Input: OTU table above, test critical value $\alpha = 0.1$

  1. Run simultaneous PERFect algorithm to obtain taxa p-values $p_j, j=1, \dots, p$

The simultaneous PERFect algorithm yields preliminary taxa significance. Taxa can also be ordered by abundance and this step is not necessary; however method evaluation on mock data has shown that ordering taxa by simultaneous PERFect p-values improves permutation PERFect performance.

#########################
#quantiles from fit a
##########################
start <- Sys.time()
res_sim <- PERFect_sim(X=data)
end <-  Sys.time()-start
summary_sim <- resSummary(X = data, filtX = res_sim$filtX, time = end)
  1. Order columns of $X$ such that $p_1 \geq p_2 \geq p_p$.

The table below shows taxa and their corresponding simultaneous PERFect p-values. Some taxa that were initially less important according to the the abundance ordering gained higher rank according to simultaneous PERFect p-values. For example, the $2^{nd}$ least important taxon $N_2$ in the abundance ordering becomes the $7^{th}$ least important taxon in the simultaneous PERFect p-values ordering.

#add p-values column to the data
pvals  <- c(1, round(res_sim$pvals, 2))
pvals.sort  <- sort.int(pvals, decreasing = TRUE, index.return = TRUE)
#sort data by p-values
data <- data[, pvals.sort$ix]
#save X for illustrating method steps
X = as.matrix(data)
#add p-values to the data
data <- rbind(pvals.sort$x, data)
rownames(data)[1] <- "Sim pvalues"
data[1,1] <- NA


kable(data[1,1:13], 
      format='html', 
      digits=2, 
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 11, bootstrap_options = "striped", full_width = TRUE, position = "left") 

kable(data[1,14:20], 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")
  1. For taxon $j = 1, \dots, p-1$

                 Let $J_j = {1, \dots, j}$

                 Calculate $DFL(j+1) = FL(J+1) - FL(J)$

          end

The test statistic $DFL$ and its corresponding values on the log scale for each taxon are displayed below. The values of $\log(DFL)$ range between $-23.71$ and $-13.98$ for the noise taxa and increase dramatically to $-6.79$ for the value of the first signal taxon $S_6$. The $\log(DFL)$ values for signal taxa range between $-7.5$ and $-0.12$, which is much larger compared to corresponding statistic values for the noise taxa.

Order.vec <- pvals_Order(data, res_sim)
Order_Ind <- rep(1:length(Order.vec))#convert to numeric indicator values
DFL <- DiffFiltLoss(X = X, Order_Ind, Plot = TRUE, Taxa_Names = Order.vec)

#add log DFL values to the data
data <- rbind(c(NA, round(log(DFL$DFL), 2)), data)
rownames(data)[1] <- "log(DFL)"
#rearrange rows to have DFL below sim-pvalues
data <- data[c(2,1, 3:nrow(data)),]
#data

#display DFL and log(DFL) values
df <- rbind(c(NA,DFL$DFL),c(NA, log(DFL$DFL)))
df[1,] <- format(df[1,], width = 1, digits = 3, format = "g")
df[2,] <- round(as.numeric(df[2,]), 2)
rownames(df) <- c("DFL", "log(DFL)")
colnames(df)[1]<- "N~1~"
kable(df[,1:8], 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")
kable(df[,9:13], 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")
kable(df[,14:20], 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")
  1. For taxon $j = 1, \dots, p-1$

             For permutation $1, \dots, k$

                         Randomly select $J^_{j+1} \subset {1, \dots, p}$ with $|J^_{j+1}| =j+1$

                         Calculate $DFL^(j+1) = FL(J^_{j+1}) - FL(J^*)$

               end

         end

In this step of the algorithm, we construct the permutation distribution for each set of $j$ taxa to evaluate significance of corresponding $\log(DFL)$ values. To build the distribution of $d\mathcal{F}{j+1}$ and test statistical hypothesis \begin{equation} {\small H_0: d\mathcal{F}{j+1} = 0 \hspace{10mm}\mbox{vs}\hspace{10mm} H_A: d\mathcal{F}{j+1} > 0, } \end{equation} we randomly select $k$ sets $J^_{j+1}$ of taxa labels and calculate the corresponding $DFL^(j+1)$ value for each sample. For example, to obtain 6 permutations ($k = 6$) for $5$ taxa, we draw sets of size $|J^_{j+1}| = 5$. Then, the $DFL^$ for these samples are calculated according to this permutation ordering. In particular, for the first permutation, $$ DFL^(5) = FL^({N_8, N{11}, S_2, S_6, N_4}) - FL^*({N_8, N_{11}, S_2, S_6}). $$

The values of $DFL^*$ for other sets of taxa are calculated similarly.

## Some internal functions

########################################
#Function to calculate j^th DFL loss
########################################
DiffFiltLoss_j <- function(Perm_Order,Netw, j){
  J_jp1 <-  Perm_Order[seq_len(j+1)]
  #calculate corresponding norm ratios this is the value of the matrix ||
  DFL <-  (2*t(Netw[J_jp1[max(NROW(J_jp1))],-J_jp1])%*%Netw[J_jp1[max(NROW(J_jp1))],-J_jp1]+
             Netw[J_jp1[max(NROW(J_jp1))], J_jp1[max(NROW(J_jp1))]]^2)

  return(DFL)
}
##################################
#Randomly permute labels n times for a fixed j
#################################
Perm_j_s <- function(j, Netw, k,p, p2 = NULL){
  #Netw = X'X - data matrix with taxa in columns and samples in rows
  #k - number of permutations used
  #create a list of k*p arrangements of orders
  if(is.null(p2)){p2 <- p}
  labl <- sapply(seq_len(k),function(x) NULL)
  labl <- lapply(labl,function(x)  sample(seq_len(p),p2))
  FL_j <- sapply(labl,DiffFiltLoss_j, Netw = Netw, j=j)
  return(FL_j)
}

###################################################
#Obtain sampling distribution using permutations of DFL
###################################################
sampl_distr <- function(X, k){
  p <- dim(X)[2]
  Netw <- t(X)%*%X
  #full_norm <- psych::tr(t(Netw)%*%Netw)#this is full norm value
  full_norm <- sum(Netw*Netw)
  #For each taxon j, create a distribution of its DFL's by permuting the labels
  res_all <- lapply(seq_len(p-1),function(x) x)

  # Calculate the number of cores
  no_cores <- parallel::detectCores()-1
  # Initiate cluster, start parrallel processing
  cl <- parallel::makeCluster(no_cores)
  #load variables for each core
  parallel::clusterExport(cl,c("DiffFiltLoss_j","Perm_j_s","Netw","k","p"),envir=environment())
  #parallel apply
  FL_j <- parallel::parLapply(cl, res_all, function(x) Perm_j_s(j = x, Netw =Netw, k=k, p =p, p2 = x+1))
  #FL_j <- lapply(res_all, function(x) Perm_j_s(j = x, Netw =Netw, k=k, p =p, p2 = x+1))
  # End the parallel processing
  parallel::stopCluster(cl)

  #divide by the full matrix norm values
  res_pres <- lapply(FL_j, function(x) {x/full_norm})
  return(res_pres)
}

Perm_j_s_il <- function(j, Netw, k,p, p2 = NULL){
  #Netw = X'X - data matrix with taxa in columns and samples in rows
  #k - number of permutations used  
  #create a list of k*p arraments of orders
  if(is.null(p2)){p2 <- p}
  labl <- sapply(1:k,function(x) NULL)
  labl <- lapply(labl,function(x)  sample(1:p,p2))
  FL_j <- sapply(labl,DiffFiltLoss_j, Netw = Netw, j=j)
  return(list(labl = labl, FL_j = FL_j))
}
#we illustrate computation of the first k=6 permutations for each taxon
p <- dim(X)[2] 
Netw <- t(X)%*%X
full_norm <- psych::tr(t(Netw)%*%Netw)#this is full norm value
#For each taxon j, create a distribution of its DFL's by permuting the labels 
res_all <- lapply(1:(p-1),function(x) x)
FL <- lapply(res_all, function(x) Perm_j_s_il(j = x, Netw =Netw, k=6, p =p, p2 = x+1))
#extract every 2nd subelement for each element of list FL to get FL_j values
FL_j <- lapply(FL, "[[", 2)
#divide by the full matrix norm values to get DFL*(j+1) values
res_pres <- lapply(FL_j, function(x) {x/full_norm})

#extract every 1st subelement for each element of list of selected labels
labl <- lapply(FL, "[[", 1)
labl<- lapply(labl, function(x) matrix(unlist(x), ncol = 6, byrow = FALSE))


#take sets of 5 taxa
T5 <- apply(labl[[4]], 2, function(x) Order.vec[x])
df <- T5
df <- rbind(df,round(log(res_pres[[4]]),2))
rownames(df) <- c(paste0("Taxa ", 1:5),"log(DFL*)")
colnames(df) <- paste0("Permutation ",1:6)

kable(df, 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 11, bootstrap_options = "striped", full_width = TRUE, position = "left")
  1. For taxon $j = 1, \dots, p-1$

                 Using quantile matching fit the normal distribution to the

                 logarithm of the sample $DFL^*(j+1), j = 1, \dots, p-1$ to obtain

                 the null distribution $X_j \sim \mbox{SN}(\widehat\xi_j, \widehat\omega_j^2, \widehat\alpha_j)$

          end

Here, after $k$ values of $\log(DFL^)$ for each group of $j$ taxa are obtained, they are fit to a Skew Normal distribution with three parameters $\xi,\omega$ and $\alpha$ using quantile matching method. For example, to estimate the parameters of $5$ taxa distribution, we would use $\log(DFL^)$ values from the $6$ columns in the table above to fit Skew Normal distribution to the sample ${-21.25,-23.71,-4.7,-13.98,-3.95,-6.79}$.

This step is necessary to disentangle the null and alternative distributions contributing to the observed permutation distribution. When we use $k$ permutations to build the reference distribution of filtering loss differences, it always includes both the values from the null and from the alternative hypotheses as illustrated above in the example of step 4 of the algorithm. Therefore, we need to make the assumption that the resulting permutation distribution is a mixture of observed differences under the null and alternative distributions. The p-value is a measure of how extreme a value of statistic (here $DFL(5)$) is relative to the null distribution. If we use fully non-parametric approach and simply construct the p-value as the proportion of times when $\log DFL(5)$ is above $\log DFL^*(5)$, then we use mixture of null and alternative as the null distribution. Under this approach, the p-value would not be appropriately calculated and we may lose more taxa than necessary.

#distribution based on k = 1000 permutations
dfl_distr <- sampl_distr(X = X, k=1000) #, sample = FALSE

#distribution fit to skew-normal 
lfl <- lapply(dfl_distr, function(x) log(x[!x==0]))
lp <- list(xi = mean(lfl[[1]]), omega = sd(lfl[[1]]), alpha = 1.5)
fit <- lapply(lfl, function(x) fitdistrplus::qmedist(x, "sn", probs=c(0.10, 0.25, 0.5), start=lp))  
est <- lapply(fit, function(x) x$estimate)

#histogram of log(DFL^*) values for the 5th taxon
i=4
lfl <- data.frame(log(dfl_distr[[i]][!dfl_distr[[i]]==0]))
    if(length(dfl_distr[[i]][dfl_distr[[i]]==0])>0){
      print(paste("taxon", i, "number of zeroes = ", 
                  length(dfl_distr[[i]][dfl_distr[[i]]==0])))}
    names(lfl) <- c("DFL")
    #plot histogram
    x = "DFL"
    ord_map = aes_string(x = x)
    #hist <- ggplot(lfl, ord_map) + geom_histogram()
    hist <- ggplot(lfl, ord_map) + geom_histogram(bins = 30, aes(y=..density..), 
                                                  col = "red", fill = "green", alpha =0.2)+
      theme(panel.background = element_rect(fill = "white"), 
            panel.grid.major = element_line(colour = "grey90"),
            axis.text.x  = element_text( size=10))+
      ggtitle("") + xlab("log differences in filtering loss") + ylab("Density")

#add skew-normal fit
hist <- hist + stat_function(fun = dsn, 
            args = list(xi = est[[i]][1], omega = est[[i]][2], alpha = est[[i]][3]), colour="blue")
hist   

In this example, we increase the number of permutations to $k=1000$, which is necessary to get a reasonable distribution for the values of $DFL^*(j+1)$. The histogram of $5$ taxa distribution is given below with the blue line indicating Skew Normal fit with parameters ($\hat{\xi}, \hat{\omega},\hat{\alpha}$) = (r round(c(est[[i]][1],est[[i]][2],est[[i]][3]),2)). Since we have only $20$ taxa in this example, the distribution fit is not as accurate as for the larger number of taxa, but it illustrates the idea of the algorithm.

  1. For taxon $j = 1, \dots, p-1$

                 Calculate the p-value $p_j$ for $DFL(j+1), j = 1, \dots, p-1$ as

                 $p_j := P[X_j > \log{DFL(j+1)}]$

         end

The $\log(DFL)$ value for the $5^{th}$ taxon in simultaneous p-values ordering was calculated in step 3 as r round(log(DFL$DFL[[4]]),2). Therefore, we calculate the $5$th taxon p-value as $p_{5} = P[X_{5} > -19.72]$ = r round(1 - psn(x=-19.72, xi = -25.26, omega = 10.09, alpha = 34.41),4), where $X_{5} \sim \mbox{SN}(\widehat\xi_{5} = -25.26, \widehat\omega_{5}^2 = 10.09, \widehat\alpha_{5} = 34.41)$.

#all p-values
for(i in 1:(p-1)){
pvals[i] <- 1 - psn(x=log(DFL$DFL[i]), 
              xi = est[[i]][1], omega = est[[i]][2], alpha = est[[i]][3])}

data <- rbind(c(NA, round(pvals,2)), data)
rownames(data)[1] <- "Perm pvalues"
  1. Average $3$ subsequent p-values

The $4$ rows of the example OTU table below combines taxa PERFect simultaneous p-values, their corresponding $\log(DFL)$ values, raw PERFect permutation p-values, and their corresponding averaged values. The p-values of noise taxa are large and the p-values for the signal taxa are small as expected.

pvals_avg <- zoo::rollmean(pvals, k=3, align="left",  fill=NA )
#replace na's with original values
pvals_avg[is.na(pvals_avg)] <- pvals[is.na(pvals_avg)]

data <- rbind(c(NA, round(pvals_avg, 2)), data)
rownames(data)[1] <- "Avg Perm pvalues"

#rearrange rows
data <- data[c(3,4,2,1, 5:nrow(data)),]

#nice display
kable(data[1:4,1:13], 
      format='html', 
      digits=2, 
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 10, bootstrap_options = "striped", full_width = TRUE, position = "left") 

kable(data[1:4,14:20], 
      format='html',
      row.names=TRUE,
      escape = FALSE,
      padding=-3L) %>% kable_styling(font_size = 14, bootstrap_options = "striped", full_width = TRUE, position = "left")
  1. Filter the set of taxa $J_j$ with the first p-value such that $p_{j+1} \leq \alpha$

The OTU table above indicates that the first significantly small averaged p-value is $p_{14}= 0.02 \leq 0.1 :=\alpha$, which occurs at the first signal taxon $S_6$. Thus the algorithm has successfully filtered noise taxa and preserved all the true signal taxa.

#select taxa that are kept in the data set at significance level alpha
Ind <- min(which(pvals_avg <=0.1))
#if jth DFL is significant, then throw away all taxa 1:j 
filtX <- X[,-(1:Ind)]

# kable(filtX, 
#       format='html',
#       row.names=TRUE,
#       escape = FALSE,
#       padding=-3L) %>% kable_styling(font_size = 12)
#   

Workflow

The filtering methods for this package are wrapped into two main functions, PERFect_sim() (performing simultaneous filtering) and PERFect_perm() (performing permutation filtering). First, we load the OTU matrix with 240 samples and 46 taxa.

data(mock2)
Counts <- mock2$Counts
dim(Counts)

By default, the function PERFect_sim() takes the data table, $X$, as a matrix or data frame, orders it by the taxa abundance, uses 10%, 25% and 50% quantiles for matching the log of DFL to a Skew-Normal distribution and then calculates the p-value for each taxon at the significance level of $\alpha$ = 0.1. The function PERFect_sim() only needs a taxa table as the input, and other parameters are set to default.

res_sim <- PERFect_sim(X = Counts)

The "filtX" object from the result stores the filtered OTU matrix, which consists of 240 samples and 10 taxa in this example. We can identify the remaining taxa by looking into the column of the OTU matrix.

dim(res_sim$filtX)      
colnames(res_sim$filtX) # signal taxa

The p-values for all taxa can be plot using the function pvals_Plots().

p <- pvals_Plots(PERFect = res_sim, X = Counts, quantiles = c(0.25, 0.5, 0.8, 0.9), alpha=0.05)
p$plot + ggtitle("Simultanenous Filtering")

Alternatively, we can use permutation filtering PERFect_perm() which is more robust than simultaneous filtering. By default, this function generates k = 10000 permutations for each taxa, thus it can be computationally expensive for a large OTU matrix. We offer user a fast algorithm which employs an unbalanced binary search that optimally finds the cutoff taxon without building the permutation distribution for all taxa. The codes for these options are shown below.

res_perm <- PERFect_perm(X = Counts, Order = "pvals", pvals_sim = res_sim, algorithm = "full")
res_perm2 <- PERFect_perm(X = Counts, Order = "pvals", pvals_sim = res_sim, algorithm = "fast")
p1 <- pvals_Plots(res_perm, Counts)
p1 <- p1$plot + ggtitle("Full Algorithm")
p2 <- pvals_Plots(res_perm2, Counts)
p2 <- p2$plot + ggtitle("Fast Algorithm")
ggpubr::ggarrange(p1,p2,ncol = 2)

The figure above illustrates the plot of permutation PERFect p-values calculated by the full and fast algorithm for the mock2 dataset. Although both methods achieve the similar cutoff taxon, the fast algorithm only calculate 11 out 46 p-values hence is more computationally efficient.

References

  1. Smirnova, E., Huzurbazar, S., and Jafari, F.(2018). PERFect: PERmutation Filtering test for microbiome data, Biostatistics, kxy020, pp. 1-17. https://doi.org/10.1093/biostatistics/kxy020.

  2. Brooks, J. P., Edwards, D. J., Harwich, M. D., Rivera, M. C., Fettweis, J. M., Serrano, M. G., Reris, R. A., Sheth, N. U., Huang, B., Gigerd, P. and others. (2015). The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiology, 15, pp. 1–14.



katiasmirn/PERFect documentation built on Sept. 17, 2019, 11:54 a.m.