knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Scope and purpose

In this vignette, we will walk through how to use tinyvamp to estimate and compare detection efficiencies across batches or experiments. We will consider the Phase 2 data of Costea et al. (2017), who spiked-in a synthetic community to 10 samples, which were then sequenced using three shotgun metagenomic sequencing experimental protocols (protocols H, Q and W). Taxon abundances were also estimated using flow cytometry.

Our analysis here will estimate the detection efficiencies of the three shotgun protocols compared to flow cytometry. We will also test the null hypothesis that detection efficiencies of the three shotgun protocols are the same.

Because of its flexibility, fitting a model using tinyvamp involves specifying a lot of parameters. The focus of this vignette will therefore be on how we set up the model to estimate detection efficiencies, rather than why we set up the model in this way. Please see the tinyvamp paper and references therein for those details!

Comparing detection efficiencies across experiments

Setup

We will start by loading the relevant packages. You might need to install logsum and fastnnls if you haven't already -- you can do this with remotes::install_github("ailurophilia/logsum") and remotes::install_github("ailurophilia/fastnnls").

library(tidyverse)
library(tinyvamp)

Now let's load the relevant data and inspect it. This dataset contains the MetaPhlAn2-estimated relative abundances of taxa in 29 samples, formatted in the usual MetaPhlAn2 way (percentages listed at all taxonomic levels):

data("costea2017_metaphlan2_profiles")

Since we are going to compare relative detections of the taxa in the synthetic community that was spiked into all samples, we will start by filtering to only the species in the spiked-in community (10 taxa). To do this, let's grab the composition data and save the column of species names:

data("costea2017_mock_composition")
mock_taxa <- costea2017_mock_composition$Taxon

Now, we can tidy the MetaPhlAn2 data and pull out the rows for our spike-in species:

costea2017_metaphlan2_profiles_species <- costea2017_metaphlan2_profiles %>% 
  filter(!str_detect(Clade, "t__")) %>% 
  filter(str_detect(Clade, paste(mock_taxa, collapse="|"))) %>% 
  mutate(Clade = str_remove(Clade, ".*s__"))

We can now see our final relative abundance table as follows:

costea2017_metaphlan2_profiles_species

Construct observation matrix W

We now need to grab the flow cytometry data and combine it with our sequencing data. Since two observations were taken on all taxa except Vibrio cholerae, let's start by averaging the flow cytometry measurements, then combining this data with the relative abundance data:

species_by_sample <- costea2017_mock_composition %>% 
  rename(cells_per_ml = 2) %>%
  group_by(Taxon) %>%
  summarize(cells_per_ml = mean(cells_per_ml)) %>%
  inner_join(costea2017_metaphlan2_profiles_species, by = c("Taxon" = "Clade"))

We can then finally arrange our data into the $n \times J$ matrix $\mathbf{W}$ that we will be modeling with tinyvamp:

W <- species_by_sample %>% 
  select(-Taxon) %>% 
  as.matrix %>% 
  t
colnames(W) <- species_by_sample$Taxon

Because it's a matrix, typing W will spew out a bunch of numbers -- but feel free to check it out in an interactive environment!

Specify sample-by-specimen and other design matrices

To fit tinyvamp, we need to specify the detection design matrix $\mathbf{X}$, which tells the software which samples were sequenced with which protocols. To do that, let's load the sample data and join it to our abundance data (this makes sure we don't mix up the order of the rows across datasets):

data("costea2017_sample_data")
protocol_df <- W %>% 
  as_tibble(rownames="Run_accession") %>%
  full_join(costea2017_sample_data) %>% 
  select(Run_accession, Protocol)
protocol_df

There are lots of different ways we could set up our comparisons between protocols. For example, we could estimate detection efficiencies with respect to the flow cytometry measurements. That is, we could estimate detection effects for protocol H vs flow cytometry; protocol W vs flow cytometry; protocol W vs flow cytometry (of course, we would get a different estimate of the detection effects for each taxon). Here's how we could create that matrix:

X <- protocol_df %>%
  mutate(hh = ifelse(Protocol == "H" & !is.na(Protocol), 1, 0),
         qq = ifelse(Protocol == "Q" & !is.na(Protocol), 1, 0),
         ww = ifelse(Protocol == "W" & !is.na(Protocol), 1, 0)) %>% 
  select(hh, qq, ww) %>%
  as.matrix

TODO(AW) can we remove the as.matrix?

However, we are interested in assessing the evidence against detection efficiencies being the same between protocols H, Q and W. To make our lives easy for testing this hypothesis later, we're going to estimate the model slightly differently: we're going to create a "intercept" variable as well as a protocol Q variable and a protocol W variable. The intercept variable equals 1 for all sequencing protocols, and the Q variable equals 1 for only samples sequenced with the protocol Q and zero for all other samples. You could also think of the intercept as an indicator for "this sample was sequenced". We're going to take those three variables as our $\mathbf{X}$ matrix.

X <- protocol_df %>%
  mutate(intercept = ifelse(!is.na(Protocol), 1, 0),
         qq = ifelse(Protocol == "Q" & !is.na(Protocol), 1, 0),
         ww = ifelse(Protocol == "W" & !is.na(Protocol), 1, 0)) %>% 
  select(intercept, qq, ww) %>%
  as.matrix

By parameterizing the model this way, we can test whether detection efficiencies being the same between protocols H, Q and W by testing the hypothesis $\beta_{Q} = \beta_{W} = \mathbf{0}$. (This may be clarified below -- so if you're confused, dear reader, read on!)

Since $\mathbf{X}$ and $\mathbf{\beta}$ go together in the model, lets also initialize our estimate of $\mathbf{\beta}$, and tell the model what we know about $\mathbf{\beta}$. Let initialize the estimation algorithm at "all taxa have the same detection efficiencies under all protocols":

B <- matrix(0, ncol = 10, nrow = 3)

(Remember that there are 10 taxa and 3 protocol variables: Intercept, Protocol Q and Protocol W)

What do we know about $\mathbf{\beta}$? Nothing, right? Well, almost. We always need to choose which taxon we're comparing our efficiencies against. Let's choose the last taxon, Yersinia pseudotuberculosis.

B_fixed_indices <- matrix(FALSE, ncol = 10,nrow = 3)
B_fixed_indices[,10] <- TRUE

Under this parametrization and constraints, we interpret

Now let's move on to set up the sample design matrix, which links samples to specimens. $\mathbf{Z}$ is a $n \times K$ matrix, where $n$ is the number of samples and $K$ is the number of specimens. Here, there is only one specimen -- the synthetic community that was spiked-in to every sample. So here's our $\mathbf{Z}$:

Z <- matrix(1, nrow = 30, ncol = 1)

We now need to initialize our relative abundance matrix $\mathbf{p}$, a $K \times J$. Let's initialize it at the sample relative abundance from the flow cytometry data, and tell the algorithm that we don't know any of the entries of $\mathbf{p}$.

P <- matrix(W[1,]/sum(W[1,]),nrow =1, ncol = 10)
P_fixed_indices <- matrix(FALSE,nrow = 1, ncol = 10)

Let's tell tinyvamp that we don't know the sample intensities $\mathbf{\gamma}$, and initialize them at the log-total number of reads across all taxa:

gammas <- apply(W,1,function(x) log(sum(x)))
gammas_fixed_indices <- rep(FALSE, length(gammas))

Almost done! We're not going to try to estimate any contamination in this model -- so $\mathbf{\tilde{Z}} = 0$, $\mathbf{\tilde{X}} = 0$, $\tilde{\gamma}$ could be anything (we set it to zero), and we tell it "don't estimate $\mathbf{\tilde{p}}$", like so:

Z_tilde <- matrix(0, nrow= 30, ncol = 1)
Z_tilde_gamma_cols <- 1
gamma_tilde <- matrix(0,ncol = 1, nrow = 1)
gamma_tilde_fixed_indices <- TRUE
P_tilde <- matrix(0,nrow =1, ncol = 10)
P_tilde_fixed_indices <- matrix(TRUE,nrow = 1, ncol = 10)
X_tilde <- matrix(0, ncol = 3, nrow= 1)

Fit full and null models

Woohoo! It's time to fit the model.

full_model <- estimate_parameters(W = W,
                                  X = X,
                                  Z = Z,
                                  Z_tilde = Z_tilde,
                                  Z_tilde_gamma_cols = Z_tilde_gamma_cols,
                                  gammas = gammas,
                                  gammas_fixed_indices = gammas_fixed_indices,
                                  P = P,
                                  P_fixed_indices = P_fixed_indices,
                                  B = B,
                                  B_fixed_indices = B_fixed_indices,
                                  X_tilde = X_tilde,
                                  P_tilde = P_tilde,
                                  P_tilde_fixed_indices = P_tilde_fixed_indices,
                                  gamma_tilde = gamma_tilde,
                                  gamma_tilde_fixed_indices = gamma_tilde_fixed_indices,
                                  alpha_tilde = NULL,
                                  Z_tilde_list = NULL)
full_model$optimization_status

Woohoo, done! Pretty fast considering the number of parameters in this model (67). For this analysis we just use the default optimization settings, but you can play around with them and learn more through ?estimate_parameters

full_model is a list that contains a bunch of information. You can what's in there with full_model %>% names. Let's dive right in and check out some of the estimates:

full_model$varying %>% as_tibble

TODO(picture!)

Now, to test... we need to tell tinyvamp what the null hypothesis is. We do this as follows:

### Create null model specification for test
null_param <- full_model
### set second and third rows of B equal to zero
null_param$B[2:3,] <- 0
null_param$B_fixed_indices[2:3,] <- TRUE

This next step can take ages! Good to parallelize if possible.

# bootstrap_test <-  bootstrap_lrt(W = W,
#                                  fitted_model= full_model,
#                                  null_param = null_param,
#                                  n_boot = 1000,
#                                  ncores = 5, 
#                                  parallelize = TRUE)

Fit reparametrization with $X_i = [1_H 1_Q 1_W]$ to (more easily) get CIs for protocol-specific effects (also can bootstrap from original model, but currently a bit more involved to pull out CIs for quantities of form $A%*%beta$ than for beta itself)

X_repar <- X
X_repar[,1] <- X_repar[,1] - X_repar[,2] - X_repar[,3]
full_reparam  <-
  estimate_parameters(W = W,
                      X = X_repar,
                      Z = Z,
                      Z_tilde = Z_tilde,
                      Z_tilde_gamma_cols = 1,
                      gammas = gammas,
                      gammas_fixed_indices = gammas_fixed_indices,
                      P = P,
                      P_fixed_indices = P_fixed_indices,
                      B = B,
                      B_fixed_indices = B_fixed_indices,
                      X_tilde = X_tilde,
                      P_tilde = P_tilde,
                      P_tilde_fixed_indices = P_tilde_fixed_indices,
                      gamma_tilde = gamma_tilde,
                      gamma_tilde_fixed_indices = gamma_tilde_fixed_indices,
                      alpha_tilde = NULL,
                      Z_tilde_list = NULL,
                      barrier_t = 1, #starting value of reciprocal barrier penalty coef.
                      barrier_scale = 10, #increments for value of barrier penalty
                      max_barrier = 1e12, #maximum value of barrier_t
                      initial_conv_tol = 1000,
                      final_conv_tol = 0.1,
                      constraint_tolerance = 1e-10,
                      hessian_regularization = 0.01,
                      criterion = "Poisson",
                      profile_P = FALSE,
                      profiling_maxit = 25,
                      wts = NULL,
                      verbose = TRUE)
# 
# full_cis <- bootstrap_ci(W,
#                          fitted_model = full_reparam,
#                          n_boot=1000,
#                          m = NULL,
#                          alpha = 0.05,
#                          parallelize = TRUE,
#                          ncores = 5,
#                          seed = 3423,
#                          return_models = FALSE,
#                          verbose = FALSE
#                          
# )
# 
# taxa <- colnames(W)
# taxa <- lapply(taxa,
#                function(x) strsplit(x,"_") %>%
#                  (function(y) paste(substr(y[[1]][1],1,1),". ",
#                                     y[[1]][2],sep = "",collapse = ""))) %>%
#   do.call(c,.)
# 
# full_cis$ci %>%
#   filter(param == "B") %>%
#   mutate(Protocol = c("H","Q","W")[k],
#          Estimate = round(value,2)) %>%
#   mutate(Upper = round(upper_ci,2),
#          Lower = round(lower_ci,2),
#          Taxon = rep(taxa[1:9],3)) %>%
#   dplyr::select(Protocol, Taxon, Estimate, Lower, Upper) %>%
#   mutate(Estimate = apply(cbind(Estimate,Lower,Upper),
#                           1, function(x) paste(x[1]," (", x[2], " - ", x[3], ")",
#                                                sep = "",
#                                                collapse = ""))) %>%
#   select(c(Protocol, Taxon, Estimate)) %>%
#   pivot_wider(id_cols = Taxon, names_from = Protocol,
#               values_from = Estimate) %>%
#   knitr::kable(format = "latex")
# 

10-fold cross-validation

# 
# # construct folds
# folds <- vector(10, mode = "list")
# available <- 2:30
# unique_individuals <- c(1:8,"M")
# 
# for(i in 1:9){
#   folds[[i]] <- which(individuals == unique_individuals[i]) +1
# }
# 
# folds[[10]] <- which(individuals %in% c("A","B")) +1
# 
# full_cv <- vector(10, mode = "list")
# null_cv <- vector(10, mode = "list")
# for(whichfoldout in 1:10){
#   print(whichfoldout)
#   heldout <- folds[[whichfoldout]]
#   nheldout <- length(heldout)
#   Z_cv <- Z
#   Z_cv <- cbind(Z_cv - Z_cv*(1:30 %in% heldout))
#   P_cv <- P
#   P_fixed_indices_cv <- P_fixed_indices
#   for(k in 1:nheldout){
#     Z_cv <- cbind(Z_cv,as.numeric(1:30 == heldout[k]))
#     P_cv <- rbind(P_cv,P)
#     P_fixed_indices_cv <- rbind(P_fixed_indices_cv,
#                                 P_fixed_indices)
#   }
#   
#   full_cv[[whichfoldout]]  <-
#     estimate_parameters(W = W,
#                         X = X,
#                         Z = Z_cv,
#                         Z_tilde = Z_tilde,
#                         Z_tilde_gamma_cols = 1,
#                         gammas = gammas,
#                         gammas_fixed_indices = gammas_fixed_indices,
#                         P = P_cv,
#                         P_fixed_indices = P_fixed_indices_cv,
#                         B = B,
#                         B_fixed_indices = B_fixed_indices,
#                         X_tilde = X_tilde,
#                         P_tilde = P_tilde,
#                         P_tilde_fixed_indices = P_tilde_fixed_indices,
#                         gamma_tilde = gamma_tilde,
#                         gamma_tilde_fixed_indices = gamma_tilde_fixed_indices,
#                         alpha_tilde = NULL,
#                         Z_tilde_list = NULL,
#                         barrier_t = 1, #starting value of reciprocal barrier penalty coef.
#                         barrier_scale = 10, #increments for value of barrier penalty
#                         max_barrier = 1e12, #maximum value of barrier_t
#                         initial_conv_tol = 1000,
#                         final_conv_tol = 0.1,
#                         constraint_tolerance = 1e-10,
#                         hessian_regularization = 0.01,
#                         criterion = "Poisson",
#                         profile_P = FALSE,
#                         profiling_maxit = 25,
#                         wts = NULL,
#                         verbose = FALSE)
#   
#   null_cv[[whichfoldout]]  <-
#     estimate_parameters(W = W,
#                         X = X[,1,drop = FALSE],
#                         Z = Z_cv,
#                         Z_tilde = Z_tilde,
#                         Z_tilde_gamma_cols = 1,
#                         gammas = gammas,
#                         gammas_fixed_indices = gammas_fixed_indices,
#                         P = P_cv,
#                         P_fixed_indices = P_fixed_indices_cv,
#                         B = B[1,,drop = FALSE],
#                         B_fixed_indices = B_fixed_indices[1,,drop = FALSE],
#                         X_tilde = X_tilde[,1,drop = FALSE],
#                         P_tilde = P_tilde,
#                         P_tilde_fixed_indices = P_tilde_fixed_indices,
#                         gamma_tilde = gamma_tilde,
#                         gamma_tilde_fixed_indices = gamma_tilde_fixed_indices,
#                         alpha_tilde = NULL,
#                         Z_tilde_list = NULL,
#                         barrier_t = 1, #starting value of reciprocal barrier penalty coef.
#                         barrier_scale = 10, #increments for value of barrier penalty
#                         max_barrier = 1e12, #maximum value of barrier_t
#                         initial_conv_tol = 1000,
#                         final_conv_tol = 0.1,
#                         constraint_tolerance = 1e-10,
#                         hessian_regularization = 0.01,
#                         criterion = "Poisson",
#                         profile_P = FALSE,
#                         profiling_maxit = 25,
#                         wts = NULL,
#                         verbose = FALSE)
#   
# }
# 
# full_cv_predictions <- lapply(1:10,
#                               function(x)
#                                 full_cv[[x]]$varying[
#                                   full_cv[[x]]$varying$param == "P"&
#                                     full_cv[[x]]$varying$k>1,])
# 
# for(i in 1:10){
#   full_cv_predictions[[i]]$k <- sapply(full_cv_predictions[[i]]$k,
#                                        function(x) folds[[i]][x - 1])
#   
# }
# 
# full_cv_predictions <- do.call(rbind,full_cv_predictions)
# fc_values <- W[1,]/sum(W[1,])
# full_cv_predictions$fc_value <- sapply(full_cv_predictions$j,
#                                        function(d) fc_values[d])
# full_cv_predictions$protocol <-
#   sapply(full_cv_predictions$k,
#          function(d) protocols[d-1]) #  d - 1 bc k starts at 2
# #  (k = 1 is fc data)
# 
# 
# full_cv_predictions$specimen <-
#   sapply(full_cv_predictions$k,
#          function(d) individuals[d - 1]) #  d - 1 bc k starts at 2
# #  (k = 1 is fc data)
# 
# 
# null_cv_predictions <- lapply(1:10,
#                               function(x)
#                                 null_cv[[x]]$varying[
#                                   null_cv[[x]]$varying$param == "P"&
#                                     null_cv[[x]]$varying$k>1,])
# 
# for(i in 1:10){
#   null_cv_predictions[[i]]$k <- sapply(null_cv_predictions[[i]]$k,
#                                        function(x) folds[[i]][x - 1])
#   
# }
# 
# null_cv_predictions <- do.call(rbind,null_cv_predictions)
# fc_values <- W[1,]/sum(W[1,])
# null_cv_predictions$fc_value <- sapply(null_cv_predictions$j,
#                                        function(d) fc_values[d])
# null_cv_predictions$protocol <-
#   sapply(null_cv_predictions$k,
#          function(d) protocols[d -1 ]) #  d - 1 bc k starts at 2 (k = 1 is fc data)
# 
# null_cv_predictions$specimen <-
#   sapply(null_cv_predictions$k,
#          function(d) individuals[d - 1]) #  d - 1 bc k starts at 2 (k = 1 is fc data)
# null_cv_predictions$model <- "Null Model"
# full_cv_predictions$model <- "Full Model"
# 
# W_prop <- W[-1,]
# for(i in 1:nrow(W_prop)){
#   W_prop[i,] <- W_prop[i,]/sum(W_prop[i,])
# }
# 
# naive_predictions <- null_cv_predictions[numeric(0),]
# 
# for(i in 1:nrow(W_prop)){
#   protocol <- protocols[i]
#   specimen <- individuals[i]
#   for(j in 1:ncol(W)){
#     naive_predictions <- rbind(naive_predictions,
#                                data.frame(value = W_prop[i,j],
#                                           param = "P",
#                                           k = i,
#                                           j  = j,
#                                           fc_value = fc_values[j],
#                                           protocol = protocol,
#                                           specimen = specimen,
#                                           model = "Plug-in"))
#   }
# }
# 
# 
# 
# 
# rbind(null_cv_predictions,full_cv_predictions,
#       naive_predictions) %>%
#   mutate(protocol = sapply(protocol, function(x) paste("Protocol ",x,
#                                                        sep = "",
#                                                        collapse = ""))) %>%
#   # filter(!is.na(protocol)) %>% #why would protocol be NA? Check!
#   ggplot() +
#   geom_point(aes(x = fc_value, y = value, color= specimen),
#              size = .5) +
#   geom_line(aes(x = fc_value, y = value, color = specimen,
#                 group = as.factor(k)), size = .5) +
#   geom_abline(aes(intercept = 0, slope = 1), linetype = 2) +
#   facet_grid(model~protocol,scales = "free_y") +
#   scale_color_grey() +
#   scale_y_log10() +
#   scale_x_log10() +
#   theme_bw() +
#   guides(color=guide_legend(title="Specimen")) +
#   xlab("Relative Abundance Measured by Flow Cytometry") +
#   ylab("Cross-Validated Estimated Relative Abundance")

References



statdivlab/tinyvamp documentation built on July 28, 2023, 11:21 p.m.