knitr::opts_chunk$set(echo = TRUE, fig.width = 4.5, fig.height = 4, fig.align = "center") set.seed(11)
This document shows both the data wrangling to generate easily splinectomeR-friendly dataframes from a standard taxa table, and the subsequent analysis demonstrating the splinectomeR functions.
The data here come from a study on the microbiome of Danish babies in the first 36 months of life. The picked OTU tables and corresponding metadata are available via the BROAD institute here, and their findings were published in the following article:
Yassour, et al. (2016) Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability. Sci. Trans. Med. 8(343)
The analysis below uses the 16S rRNA sequencing-based taxa table, included as package data here under the name filtered_otu_table
. If you load the package with library(splinectomeR)
, this data file should be accessible as an R object.
I saved each sheet of the metadata Excel files as a separate .txt file (though you could use an Excel-reading R package instead). I also discovered that there are a couple samples in the taxa table without metadata; similarly there is a sample in the metadata that doesn't have taxonomy data. I removed these from the metadata and any analysis below.
# First, load the required packages library(dplyr) library(tibble) library(ggplot2) library(reshape2) library(tidyr) library(splinectomeR) library(vegan)
set.seed(11) # For reproducibility
First, we can extract the samples and their corresponding timepoints from the taxa table's header.
otus <- filtered_otu_table sample_names <- colnames(otus) sample_time <- lapply(sample_names, FUN = function(x) strsplit(x, split = '_')) timeseries <- data.frame(matrix(nrow = 1101, ncol = 2, data = 0)) j = 1 for (i in sample_time) { timeseries[j,1] <- i[[1]][1] timeseries[j,2] <- i[[1]][2] j = j + 1 } colnames(timeseries) <- c('sample_id', 'month') head(timeseries, n=4) # To save this table for reference: # write.csv(timeseries, file = 'samples_timepoints2.csv', row.names = F, quote = F)
We can quickly then confirm the study IDs, and that we have the right longitudinal series by comparing these numbers to those in the paper. Next, we want to generate a full metadata/mapping file with information about whether each baby had antibiotics (Y/N), their breastfeeding status, delivery type, sex, and anything else. I combined the feeding rows with the general demographics in Excel before importing, and made the headers more R friendly (replaced spaces with underscores, etc). This data is included with the package, and is named general_metadata
. The antibiotics sheet is likewise saved with the package as antibiotics_metadata
.
metadata <- general_metadata head(metadata, n = 4)
# Create the binary antibiotic exposure annotation abx <- antibiotics_metadata abx_pos <- unique(abx$Subject) abx_df <- data.frame(abx_pos, 'Y') abx_neg <- setdiff(timeseries$sample_id, abx_pos) abx_df2 <- setNames(data.frame(abx_neg, 'N'), c('abx_pos', 'X.Y.')) abx_df <- rbind(abx_df, abx_df2) colnames(abx_df) <- c('baby_id', 'antibiotics_y_n') dim(abx_df) head(abx_df, n=4) # write.csv(abx_df, file = 'antibiotics_samplenames.csv', row.names = F, quote = F)
# Now, merge OTUs with each piece of the metadata otus <- filtered_otu_table otus <- data.frame(t(otus)) otus <- tibble::rownames_to_column(otus, var = 'sampleID') otus_split <- otus %>% separate(sampleID, c('baby_id', 'timepoint'), sep = '_') row.names(otus_split) <- otus$sampleID otus_split <- rownames_to_column(otus_split, var = '#SampleID') metadata <- general_metadata colnames(metadata)[1] <- 'baby_id' metadata <- merge(metadata, abx_df, by = 'baby_id') # Merge the metadata with the OTU table to create a master metadata # file including the taxonomy abundances full_data <- merge(metadata, otus_split, by = 'baby_id') x <- '#SampleID' # Move the "#SampleID" column to the front: full_data <- full_data[c(x, setdiff(names(full_data), x))] # write_delim(full_data, 'full_taxa_metadata_yassour.txt', delim = '\t')
The paper focuses many of its analyses at the Family level of taxonomy. This next chunk parses out the family abundances from the OTU table.
otus <- filtered_otu_table otus <- tibble::rownames_to_column(otus, var = 'OTU_ID') otus_family <- otus %>% filter(!grepl('g__', otus$OTU_ID) & grepl('f__', otus$OTU_ID)) otus_family_sum <- otus_family otus_family_sum$f__abun <- rowSums(otus_family[, 2:ncol(otus_family)]) otus_family_sum <- otus_family_sum[c('f__abun', setdiff(names(otus_family_sum), 'f__abun'))] otus_family_sum <- otus_family_sum[order(-otus_family_sum$f__abun), ] # write_delim(otus_family_sum, 'family_level_OTUs_with_sums.txt', delim = '\t') # The top 10 Families in this dataset otus_family_sum[1:10, 2]
Check that the abundances roughly add up to 1 across the families in a sample:
colSums(otus_family_sum[, 3:10])
Now, we're going to take the Family summarized data and merge that with the metadata. We'll be doing this same process several times during this analysis; this is a great opportunity to write a custom function that we can apply over and over.
# Define a function for all the tedious flipping, splitting, and merging flip_split_merge <- function(otus_in, metadata) { row.names(otus_in) <- otus_in$OTU_ID otus_in$OTU_ID <- NULL otus_in <- data.frame(t(otus_in)) otus_in <- tibble::rownames_to_column(otus_in, var = 'sampleID') otus_split <- otus_in %>% separate(sampleID, c('baby_id', 'timepoint'), sep = '_') otus_split$timepoint <- as.numeric(otus_split$timepoint) # Trouble recognizing numbers row.names(otus_split) <- otus_in$sampleID otus_split <- rownames_to_column(otus_split, var = 'SampleID') otus_meta <- merge(metadata, otus_split, by = 'baby_id') otus_meta <- otus_meta[c('SampleID', setdiff(names(otus_meta), 'SampleID'))] return(otus_meta) }
Now use the new function on the Family-summarized taxa table
family_tax_metadata <- flip_split_merge(otus_family, metadata) # To save the table: # write_delim(family_tax_metadata, 'family_taxa_metadata_yassour.txt', delim = '\t')
With this tool ready to go, we can now do some analysis from the data.
First, let's see if we can regenerate their first plot with the top ~6 bacterial Families (Fig. 1B). Before that, we'll make an ultra-longform table that will be absurdly long but easier to work with in wrangling and ggplot.
top_10_families <- otus_family_sum[1:10, 2] top_10_families <- lapply(top_10_families, FUN = function(x) gsub(x, pattern = '|', replacement = '.', fixed = T)) top_family_meta <- family_tax_metadata %>% gather(key = 'family', value = 'relative_abundance', 13:as.numeric(ncol(family_tax_metadata))) %>% filter(family %in% top_10_families) # Double check that it worked: unique(top_family_meta$family)
Now we can plot! We're going to try to remake their Figure using ggplot2's geom_smooth
function.
plot.df <- top_family_meta plot.df <- separate(plot.df, col = family, sep = 'f__', remove = T, into = c('uplevel', 'family')) plot.df$uplevel <- NULL ggplot(plot.df, aes(color = family, x = timepoint, y = relative_abundance)) + theme_bw() + geom_smooth(method='loess', span = 0.5) + labs(x = 'months', y = 'relative abundance') + theme(legend.position = 'right') + scale_x_continuous(breaks = c(0,12,24,36)) + theme(panel.grid = element_blank())
We included a few more families, but the plot is a very close match to Figure 1B.
The splinectomeR
package allows us to do a large number of statistical tests from this datatable using the longitudinal data. In the paper, statistical tests aren't performed across the intact longitudinal data, instead relying on visual comparisons of stream plots, for example, or on factors derived from the data, such as comparing consecutive-sample beta-diversity. With splinectomeR
, we can quickly test whether abundance of a taxonomic unit (Family, Genus, etc) is significantly different across the time series between babies that had antibiotics and those who didn't, or male/female babies, birth mode, etc. We can also ask whether the abundance of a particular taxonomic unit is going up or down over time in a particular group (antibiotic exposed, female, etc).
set.seed(11) # For reproducibility
# Test the difference between Bacteroidaceae abundance in # antibiotic exposed vs. non-exposed conditions f__Bacteroidaceae.top_family_meta <- top_family_meta %>% filter(family == 'k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__Bacteroidaceae') bacteroidaceae_result <- permuspliner(data = f__Bacteroidaceae.top_family_meta, xvar = 'timepoint', yvar = 'relative_abundance', perms = 99, category = 'antibiotics_y_n', cases = 'baby_id', retain_perm = T) permuspliner.plot.permdistance(bacteroidaceae_result, xlabel = 'timepoint')
This tells us that over the time course, the difference in Bacteroidaceae abundance between antibiotic exposed and non-exposed infant microbiomes is not greater than expected by random chance. But, this is a very dynamic environment, and perhaps it is significant at some point along the time course. We can test that as well, using splinectomeR's sliding_spliner
test.
set.seed(11) # For reproducibility
slide_result <- sliding_spliner(data = f__Bacteroidaceae.top_family_meta, xvar = 'timepoint', yvar = 'relative_abundance', category = 'antibiotics_y_n', cases = 'baby_id', test_density = 10, cut_low = 7, set_spar = 0.5) sliding_spliner.plot.pvals(data = slide_result, xvar = 'timepoint')
This shows that there are actually some significant differences near the 30-month window. There's also some interesting periodicity to the trend.
We might also ask whether Bacteroidaceae abundance increases or decreases overall as the babies get older. The trendyspliner
test allows us to see if the overall change in Bacteroidaceae abundance is non-linear across all babies over the whole time course.
set.seed(11) # For reproducibility
trendy_result <- trendyspliner(data = f__Bacteroidaceae.top_family_meta, xvar = 'timepoint', quiet = F, cases = 'baby_id', mean_center = T, yvar = 'relative_abundance', perms = 999)
We can use the same test and ask a similar question about other categorical variables. For exaple, maybe there are different trends in the 'mixed' vs 'only breastmilk' in-hospital feeding categories.
set.seed(11) # For reproducibility
result_mixedfeed <- trendyspliner(data = f__Bacteroidaceae.top_family_meta, xvar = 'timepoint', quiet = T, cases = 'baby_id', mean_center = T, category = 'feeding_in_hospital', group = 'mixed', yvar = 'relative_abundance') cat(result_mixedfeed$pval) result_breastmilk <- trendyspliner(data = f__Bacteroidaceae.top_family_meta, xvar = 'timepoint', quiet = T, cases = 'baby_id', mean_center = T, category = 'feeding_in_hospital', group = 'OnlyBreastmilk', yvar = 'relative_abundance') cat(result_breastmilk$pval)
Both are significant, meaning the abundance of Bacteroidaceae is non-linear over this time course. But remember, we'd have to plot it to know what direction each is going.
Once we are analyzing in R, we can take a programmatic approach, and run the permutation test over all the top Families, and correct for false discovery rate on the resulting p-values.
set.seed(11) # For reproducibility
top_20_families <- otus_family_sum[1:20, 2] top_20_families <- lapply(top_20_families, FUN = function(x) gsub(x, pattern = '|', replacement = '.', fixed = T)) top20_family_meta <- family_tax_metadata %>% gather(key = 'family', value = 'relative_abundance', 13:as.numeric(ncol(family_tax_metadata))) %>% filter(family %in% top_20_families) # Double check that it worked: unique(top20_family_meta$family) # Create a list containing each family's OTU+metadata table f_top20_abx_pvals <- list() for (f in unique(top20_family_meta$family)) { f__df <- top20_family_meta %>% filter(family == f) cat(f) f.result <- permuspliner(data = f__df, xvar = 'timepoint', yvar = 'relative_abundance', perms = 99, category = 'antibiotics_y_n', cases = 'baby_id', cut_low = 6, quiet = T) cat(paste0(', p = ',f.result$pval,'\n')) f_top20_abx_pvals <- append(f_top20_abx_pvals, f.result$pval) } f_top20_abx_qvals <- p.adjust(f_top20_abx_pvals, method = 'fdr') # Adjusted p values
Let's see what the Porphyromonadaceae plot looks like. You could subset the table to this Family and plot with your favorite plotting package. Or, the permuspliner
test includes two plotting functions. One plots the permuted splines along with the interpolated spline fit for the real data. The other one plots the distances between the two variables, and the distances between the permuted versions. In the former, a significant result would show most of the permuted mass lying between the real data splines.
set.seed(11) # For reproducibility
Porphyromonadaceae.top_family_meta <- top20_family_meta %>% filter(family == 'k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__Porphyromonadaceae') porph.result <- permuspliner(data = Porphyromonadaceae.top_family_meta, xvar = 'timepoint', yvar = 'relative_abundance', category = 'antibiotics_y_n', cases = 'baby_id', cut_low = 6, quiet = T, retain_perm = T) cat(porph.result$pval) permuspliner.plot.permsplines(data = porph.result, xvar = 'timepoint', yvar = 'relative_abundance')
We can see from this plot that this taxon is becoming more discriminating between the groups over time. The translucent lines depict the permuted splines, which serves as a visual for how distinct the two groups are relative to random assortment. So, at the beginning, while the group splines are distinct, the permuted data covers a wide range above and below the observed data; this suggests that there is considerable noise and the difference is not strong here. In contrast, as time going on the two lines move farther apart and the permuted lines lay almost entirely between the real data; this indicates that the difference between these groups is stronger than you would see by random chance.
The second plotting function shows the distances between the two groups (N and Y antibiotics, here), directly visualizing the relationship between the observed difference and the random distribution of differences. So, a significant result here should show the bulk of permutation distances lying below the real data.
permuspliner.plot.permdistance(porph.result, xlabel = 'timepoint')
This confirms what we would expect from the previous plot, showing that our confidence of the group difference is higher once the infants pass the first year of life, and continues to become more striking through the end of the time series, where there are very few permuted lines above the observed distance.
Let's look at birth mode the same way.
set.seed(11) # For reproducibility
for (f in unique(top20_family_meta$family)) { f__df <- top20_family_meta %>% filter(family == f) cat(f) f.result <- permuspliner(data = f__df, xvar = 'timepoint', yvar = 'relative_abundance', perms = 99, category = 'birth_mode', cases = 'baby_id', cut_low = 6, quiet = T) cat(paste0(', p = ', f.result$pval, '\n')) }
There are several families near significance ( Bacteroidaceae, Porphyromonadaceae, Clostridiaceae, Veillonellaceae, Pasteurellaceae, Enterobacteriaceae, during the writing of this vignette) but will not astonishing after adjusting for multiple comparisons. Perhaps they are more so at a given time in the series. We can apply the same approach using the sliding spline function.
set.seed(11) # For reproducibility
birthmode_families <- list( 'k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__Bacteroidaceae', 'k__Bacteria.p__Proteobacteria.c__Gammaproteobacteria.o__Enterobacteriales.f__Enterobacteriaceae', 'k__Bacteria.p__Firmicutes.c__Clostridia.o__Clostridiales.f__Clostridiaceae') top20_family_meta$birth_mode <- gsub(x = top20_family_meta$birth_mode, pattern = ' ',replacement = '_', fixed = T) slideplots <- function(xx) { f__df <- top20_family_meta %>% filter(family == xx) print(xx) f.result <- sliding_spliner(data = f__df, xvar = 'timepoint', yvar = 'relative_abundance', category = 'birth_mode', cases = 'baby_id', test_density = 3, cut_low = 7, set_spar = 0.5, quiet = T) sliding_spliner.plot.pvals(f.result, xvar = 'timepoint') } lapply(birthmode_families, FUN = slideplots)
In the original paper, they also tested for differences in response to antibiotics at the genus level. For example, we already looked at the Family Bacteroidaceae; we can re-test their hypothesis at the Genus level, again making use of all the original time points. In the paper, this comparison is done visually by plotting individual splines over an entire page in the supplement. Using splinectomeR here allows us to summarize the groups and test whether any difference is statistically significant.
Toa test that here, first we'll prep the filtered dataframe.
set.seed(11) # For reproducibility
otus_genus <- otus %>% filter(!grepl('s__', otus$OTU_ID) & grepl('g__', otus$OTU_ID)) # Run the function defined above genus_tax_metadata <- flip_split_merge(otus_genus, metadata) dim(genus_tax_metadata) bacteroides_genus_meta <- genus_tax_metadata %>% gather(key = 'genus', value = 'relative_abundance', 13:as.numeric(ncol(genus_tax_metadata))) %>% filter(genus == 'k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__Bacteroidaceae.g__Bacteroides') dim(bacteroides_genus_meta) colnames(bacteroides_genus_meta)
Now we can run the splinectomeR test, first comparing antibiotic exposure:
set.seed(11) # For reproducibility
Bacteroides_abx_result <- permuspliner(data = bacteroides_genus_meta, xvar = 'timepoint', yvar = 'relative_abundance', cases = 'baby_id', perms = 99, category = 'antibiotics_y_n', retain_perm = T, quiet = T) cat('p-value =', Bacteroides_abx_result$pval)
This Genus is not significantly different between antibiotic exposure groups over the time series. But, let's check birth mode, since this was something of interest in the original report:
set.seed(11) # For reproducibility
Bacteroides_birthmode_result <- permuspliner(data = bacteroides_genus_meta, xvar = 'timepoint', yvar = 'relative_abundance', cases = 'baby_id', perms = 999, category = 'birth_mode', retain_perm = F, quiet = T) cat('p-value =', Bacteroides_birthmode_result$pval)
The test supports the conclusion that the Bacteroides genus is indeed significantly different in the vaginal vs. cesarean born infants over these three years. We can investigate if there is a particularly divergent region with the sliding spline test.
bacteroides_slider <- sliding_spliner(data = bacteroides_genus_meta, xvar = 'timepoint', yvar = 'relative_abundance', category = 'birth_mode', cases = 'baby_id', quiet = T, test_density = 3, # Only four C-section births total ints = 36, set_spar = 0.5, cut_low = 3) # Set ints to 36 to go ~monthly sliding_spliner.plot.pvals(bacteroides_slider, xvar = 'timepoint')
Interestingly, the biggest differences are in the first 1-2 years. But, we know from the permuspliner test that these group differences are extensive enough to be significant over the entire 36-month time series. It is worth noting that there were only four cesarean births, so the groups are weighted differently in these comparisons.
One parameter that is often studied in the microbiome is the ratio of Firmicutes to Bacteroidetes (Phlya) within individuals, and the correlations between changes in the F:B ratio and risk for health conditions like obesity ( Ley et al., 2006 ). Typically, infants have a low ratio (greater Bacteroidetes) but this is inverted in adults ( Mariat et al., 2009 ). Let's look at this ratio across all babies.
set.seed(11) # For reproducibility
# Filter and merge with mapping otus_phyla <- otus %>% filter(!grepl('c__', otus$OTU_ID) & grepl('p__', otus$OTU_ID)) phyla_tax_metadata <- flip_split_merge(otus_phyla, metadata) all_phyla_meta <- phyla_tax_metadata %>% gather(key = 'phyla', value = 'relative_abundance', 13:as.numeric(ncol(phyla_tax_metadata))) # Now, unfortunately the cases need to be unique factors for the tests to work. # So here we will just create a pseudonym for each baby's phylum abundances all_phyla_meta$baby_id_phylum <- paste(all_phyla_meta$baby_id, all_phyla_meta$phyla, sep = '_') head(all_phyla_meta$baby_id_phylum) # Now, let's use the Phyla as the category, instead of antibiotics firm_vs_oides_result <- permuspliner(data = all_phyla_meta, xvar = 'timepoint', yvar = 'relative_abundance', cases = 'baby_id_phylum', perms = 999, category = 'phyla', groups = c('k__Bacteria.p__Bacteroidetes','k__Bacteria.p__Firmicutes'), retain_perm = T, quiet = T) cat('p-value =', firm_vs_oides_result$pval)
But remember, these are relative abundances... So if studying two major phyla, when one goes down in proportion, the other expands in proportion. What we can do is either look at them individually (Do Bacteroidetes go up/down over time? Do Firmicutes?) or measure the ratio and then look at whether the change in the F:B ratio is non-zero over time. These questions can be answered using splinectomeR's trendyspliner
test.
set.seed(11) # For reproducibility
phyla_tax_metadata_ratio <- phyla_tax_metadata phyla_tax_metadata_ratio$firm_to_bact <- (phyla_tax_metadata$k__Bacteria.p__Firmicutes / phyla_tax_metadata_ratio$k__Bacteria.p__Bacteroidetes) # Test non-zero trend here firm_to_bact_result <- trendyspliner(data = phyla_tax_metadata_ratio, xvar = 'timepoint', yvar = 'firm_to_bact', cases = 'baby_id', mean_center = F) # Check the difference between birth modes # (antibiotic use not really different, not shown) firm_bact_birthmode_result <- permuspliner(data = phyla_tax_metadata_ratio, xvar = 'timepoint', yvar = 'firm_to_bact', cases = 'baby_id', category = 'birth_mode', quiet = T, perms = 99) cat('Permuspliner p-value =', firm_bact_birthmode_result$pval)
These results tell us that the Firmicutes/Bacteroidetes ratio is indeed changing over these three years, and presents as a significantly distinct trend between C-section and vaginally born babies (with the limitation of small C-section cohort). This demonstrates how this hypothesis is quickly and straightforwardly tested with these tools. Also, the plots still show how noisy the data is -- the large peaks out from the norm only result because there is considerable variability that, when shuffled in the right way, creates large differences by chance.
Let's look similarly at another ratio that is often studied in microbiome analyses: Bacteroides vs Prevotella (Genera), with the caveat that these have mostly focused on adult samples.
set.seed(11) # For reproducibility
genus_tax_metadata_ratio <- genus_tax_metadata genus_tax_metadata_ratio$prev_bact_ratio <- (genus_tax_metadata$k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__Prevotellaceae.g__Prevotella + genus_tax_metadata_ratio$k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__.Paraprevotellaceae..g__.Prevotella.) / genus_tax_metadata$k__Bacteria.p__Bacteroidetes.c__Bacteroidia.o__Bacteroidales.f__Bacteroidaceae.g__Bacteroides prev_bact_result <- trendyspliner(data = genus_tax_metadata_ratio, xvar = 'timepoint', yvar = 'prev_bact_ratio', cases = 'baby_id', mean_center = T, perms = 99, quiet = T) cat('p-value =', prev_bact_result$pval) # Check the difference between birth modes (antibiotic use not really different, not shown) prev_bact_birthmode_result <- permuspliner(data = genus_tax_metadata_ratio, xvar = 'timepoint', yvar = 'prev_bact_ratio', cases = 'baby_id', category = 'birth_mode', perms = 99, quiet = T) cat('p-value =', prev_bact_birthmode_result$pval)
If we look closely at the actual ratio values, they indicate that there is so little Prevotella present in these babies that the results are harder to interpret; Prevotella appears to be mostly abundant in the C-section babies, and this is what likely drives the permuspliner p value lower.
set.seed(11) # For reproducibility
bifidobacterium_genus_meta <- genus_tax_metadata %>% gather(key = 'genus', value = 'relative_abundance', 13:as.numeric(ncol(genus_tax_metadata))) %>% filter(genus == 'k__Bacteria.p__Actinobacteria.c__Actinobacteria.o__Bifidobacteriales.f__Bifidobacteriaceae.g__Bifidobacterium') head(bacteroides_genus_meta) bifido_slider <- sliding_spliner(data = bifidobacterium_genus_meta, xvar = 'timepoint', yvar = 'relative_abundance', category = 'feeding_in_hospital', quiet = T, cases = 'baby_id', test_density = 10, cut_low = 7, set_spar = 0.5, groups = c('OnlyBreastmilk','mixed')) sliding_spliner.plot.pvals(bifido_slider, xvar = 'timepoint')
This plot there's something about the period where many of the babies are beginning to wean that is different between these groups. The average age when weaning can be reported easily:
mean(as.numeric(metadata$months_end_breastfeeding))
Clostridium groups IV and XIVa are of interest in microbiome studies because they are butyrate producers, butyrate being one of the primary short chain fatty acids that may be mechanistically involved in some of the health consequences associated with microbiome composition. I'm including the Faecalibacterium genus as a member of cluster IV, though it may not be universally included, it appears.
In the paper, they use the relative abundance at a single timepoint to test the difference (Fig. S8). With splinectomeR, we can use all of the data across the entire time series to test whether antibiotic exposure status is associated with different levels of these butyrate producers.
set.seed(11) # For reproducibility
# Define the Genera in these Clostridium clusters (that are present in this dataset) clostrid_groups <- c( 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Coprococcus', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Dorea', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Lachnospira', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Roseburia', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__[Ruminococcus]', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Faecalibacterium') # Filter and merge otus_clostrid <- otus %>% filter(OTU_ID %in% clostrid_groups) clostrid_tax_metadata <- flip_split_merge(otus_clostrid, metadata) clostrid_tax_metadata$clostridiales_abundance <- rowSums(x = clostrid_tax_metadata[, 13:ncol(clostrid_tax_metadata)]) colnames(clostrid_tax_metadata) clostrid_result <- permuspliner(data = clostrid_tax_metadata, xvar = 'timepoint', yvar = 'clostridiales_abundance', category = 'antibiotics_y_n', quiet = T, cases = 'baby_id', retain_perm = T) cat('p-value =', clostrid_result$pval)
Noteably, this is a stronger result than when comparing relative abundance at 36 months, and provides even more support for their conclusions. We can also look at the plot to see where the differences mostly lie temporally.
permuspliner.plot.permsplines(clostrid_result, xvar = 'timepoint', yvar = 'clostridiales_abundance') permuspliner.plot.permdistance(clostrid_result, xlabel = 'timepoint')
Interesting - they nearly converge where the authors were testing the difference (36 months). Looking across the time series gives a more complete answer, and in this case it is statistically more powerful as well.
Side note: if you remove Faecalibacterium, the difference even slightly stronger overall but at 36 months they are much more similar; therefore, without testing across the full time course, as enabled by splinectomeR, you would not detect a difference with that timepoint.
First, calculate the Shannon diversity (using the vegan
package) from the OTU table, and add the metadata. We don't have the raw counts, so we will just filter to a Genus table and calculate the Shannon diversity from that.
set.seed(11) # For reproducibility
otus_genus <- otus %>% filter(!grepl('s__', otus$OTU_ID) & grepl('g__', otus$OTU_ID)) alphadiv <- as.data.frame(diversity(t(otus_genus[,2:ncol(otus_genus)]), 'shannon')) alphadiv <- tibble::rownames_to_column(alphadiv, var = 'sampleID') colnames(alphadiv)[2] <- 'shannon' alphadiv <- alphadiv %>% separate(sampleID, c('baby_id', 'timepoint'), sep = '_') alphadiv$timepoint <- as.numeric(alphadiv$timepoint) alphadiv$shannon <- as.numeric(alphadiv$shannon) alphadiv_meta <- merge(metadata, alphadiv, by = 'baby_id')
Is alpha diversity different over time between antibiotic status or birth mode?
set.seed(11) # For reproducibility
alpha_permresult_abx <- permuspliner(data = alphadiv_meta, xvar = 'timepoint', yvar = 'shannon', category = 'antibiotics_y_n', cases = 'baby_id', perms = 99, retain_perm = T, quiet = T) alpha_permresult_mode <- permuspliner(data = alphadiv_meta, xvar = 'timepoint', yvar = 'shannon', category = 'birth_mode', cases = 'baby_id', perms = 99, retain_perm = T, quiet = T) p_abx <- permuspliner.plot.permsplines(data = alpha_permresult_abx, xvar = 'timepoint', yvar = 'shannon') p_mode <- permuspliner.plot.permsplines(data = alpha_permresult_mode, xvar = 'timepoint', yvar = 'shannon') p_abx
p_mode
It looks like alpha diversity is increasing in these babies, regardless their antibiotic exposure or birth mode, and that increase is fairly similar in rate and shape between these environmental factors.
We can also look more finely at the differences at each interval of time.
set.seed(11) # For reproducibility
alpha_slide <- sliding_spliner(data = alphadiv_meta, xvar = 'timepoint', yvar = 'shannon', category = 'birth_mode', cases = 'baby_id', set_spar = 0.5, quiet = T) sliding_spliner.plot.pvals(alpha_slide, xvar = 'timepoint')
Another hypothesis surrounding babies in the first years of life is that their overall diversity should be increasing as they get older. We can confirm this with our tests, by asking whether alpha diversity is non-linear over time. We can also test this within the groups of antibiotic exposed and non-exposed babies.
set.seed(11) # For reproducibility
alpha_trendresult_all <- trendyspliner(data = alphadiv_meta, xvar = 'timepoint', yvar = 'shannon', cases = 'baby_id', perms = 99, quiet = T) cat('p-value =', alpha_trendresult_all$pval) # And to visualize the trend in the context of the permuted data trendyspliner.plot.perms(alpha_trendresult_all, xlabel = 'time', ylabel = 'Shannon diversity') # Now in antibiotic exposed infants alpha_trendresult_abx <- trendyspliner(data = alphadiv_meta, xvar = 'timepoint', yvar = 'shannon', cases = 'baby_id', perms = 99, category = 'antibiotics_y_n', group = 'Y', quiet = T) cat('p-value =', alpha_trendresult_abx$pval) # And in infants not exposed to antibiotics alpha_trendresult_noabx <- trendyspliner(data = alphadiv_meta, xvar = 'timepoint', yvar = 'shannon', cases = 'baby_id', perms = 99, quiet = T, category = 'antibiotics_y_n', group = 'N') cat('p-value =', alpha_trendresult_noabx$pval)
Alpha diversity increases significantly in both groups, but the trends are not significantly different over time between antibiotics (from the previous tests).
Yassour et al. present a monumental dataset with dense longitudinal data that details the human gut microbiome's complex dynamics over the first three years of life. They also have recorded data on several variables which the microbiome patterns can be investigated, including antibiotic exposure, delivery mode, and feeding patterns.
A remaining challenge in analyses like these is a lack of clear, simple methods to directly compare data across multiple individuals and many time points--in particular when those time points are not in sync--without averaging or adding across multiple points, or subsetting and dropping many of the data points. In both cases, the longitudinal density is not being leveraged to its full possible extent.
In this vignette, we've shown how these splinectomeR
tools can quickly assess statistical significance by summarizing group data with splines and generating randomly permuted distributions to evaluate the probability that the observed magnitude of difference is due to chance. By employing the three test functions, we can explore overall changes in abundances or metrics, and compare longitudinal trends between two categorical groups.
The approaches here provide statistical support for proposals the authors have drawn, and allow us to test supplemental hypotheses. Importantly, we are able to perform these tests in a way that takes full advantage of the longitudinal data and maintains the individual observations.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.