Beta Diversity
In ecodive: Parallel and Memory-Efficient Ecological Diversity Metrics

Input Matrix

Here we'll use the ex_counts feature table included with ecodive. It contains the number of observations of each bacterial genera in each sample. In the text below, you can substitute the word 'genera' for the feature of interest in your own data.

library(ecodive)

counts <- rarefy(ex_counts)

counts
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  309    6     1
#> Bacteroides            2    2    0   341
#> Corynebacterium        0    0  171     1
#> Haemophilus          180   34    0     1
#> Propionibacterium      1    0   82     0
#> Staphylococcus         0    0   86     1

Beta Diversity

Beta diversity is a measure of how different two samples are.

Looking at the counts matrix above, you can easily see that saliva and gums are similar, while saliva and stool are different. The different metrics described below quantify that difference, referred to as the "distance" or "dissimilarity" between a pair of samples. The distance is 0 for identical samples and 1 for completely different samples.

Weighted vs Unweighted

The classic algorithms all run in weighted mode by default. Specifying weighted = FALSE, e.g. canberra(counts, weighted = FALSE) will switch them to unweighted mode.

bray_curtis(), canberra(), euclidean(), gower(), jaccard(), kulczynski(), manhattan()

For the UniFrac algorithms, unweighted_unifrac() is unweighted and all the others are weighted.

Unweighted: unweighted_unifrac()
Weighted: weighted_unifrac(), weighted_normalized_unifrac(), generalized_unifrac(), variance_adjusted_unifrac()

Partial Calculation

The default value of pairs=NULL in ecodive's beta diversity functions results in the returned all-vs-all distance matrix being completely filled in.

bray_curtis(counts)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101 0.9826087          
#> Stool 0.9884058 0.9884058 0.9913043

If you are doing a reference-vs-all comparison, you can use the pairs parameter to skip unwanted calculations and save some CPU time. The larger the dataset, the more noticeable the improvement will be.

bray_curtis(counts, pairs = 1:3)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101        NA          
#> Stool 0.9884058        NA        NA

The pairs argument can be:

A numeric vector, giving the positions in the result to calculate.
A logical vector, indicating whether to calculate a position in the result.
A function(i,j) that returns whether columns i and j should be compared.

Therefore, all of the following are equivalent:

bray_curtis(counts, pairs = 1:3)
bray_curtis(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
bray_curtis(counts, pairs = function (i, j) i == 1)

The ordering of pairs follows the pairings produced by combn().

# Column index pairings
combn(ncol(counts), 2)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    1    1    2    2    3
#> [2,]    2    3    4    3    4    4

# Sample name pairings
combn(colnames(counts), 2)
#>      [,1]     [,2]     [,3]     [,4]   [,5]    [,6]   
#> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums"  "Nose" 
#> [2,] "Gums"   "Nose"   "Stool"  "Nose" "Stool" "Stool"

So, for instance, to use gums as the reference sample:

my_combn <- combn(colnames(counts), 2)
my_pairs <- my_combn[1,] == 'Gums' | my_combn[2,] == 'Gums'

my_pairs
#> [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

bray_curtis(counts, pairs = my_pairs)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose         NA 0.9826087          
#> Stool        NA 0.9884058        NA