library(ape)
library(landvR) # Making sure this is v >= 0.4

This vignette shows briefly how to rarefy some data (e.g. geometric morphometric data, PCA scores, etc...) along a tree. An empirical example of this work is currently in review but can be found through Ariel's repeatable GitHub page. If you use this method, please contact Thomas or Ariel to figure out how to cite it (while the paper is still in review).

Data

Note that the following section (all the way up to Rarefying a test along a tree) are not required if you have your own dataset already. This is just some simulated example data. Feel free to skip!

The data required for this vignette will be a tree (of class "phylo") with species names and a "meta" dataset of specimens with species names and some data (again, the data can be PCA scores, centroid size, etc...).

The tree

set.seed(42)
## The tree with five species ("t1", "t2", etc...)
my_tree <- rcoal(5)
## What it looks like
plot(my_tree)

Note that in our example the tree is ultrametric but it doesn't matter. Only topology will be used.

The data

Here we are going to use a simple random bi-variate data. The dimensions of your dataset of course depends on you dataset and question. Also, here we are going to have a variable (random) number of specimens per species between 5 and 15.

## Function for generating specimens for one species
simulate.specimens <- function(name, sample.range = c(5:15), n.var = 2, var.fun = rnorm, ...) {
    ## Number of specimens
    n_spec <- sample(sample.range, 1)
    ## Get the variables
    variables <- replicate(n.var, var.fun(n_spec, ...))
    ## Output the table
    data.frame(cbind(paste(name, sep = "_", 1:n_spec), rep(name, n_spec)), variables)
}

## Get all the specimens into a data.frame
my_data <- do.call(rbind,sapply(my_tree$tip.label, simulate.specimens, simplify = FALSE))

## Giving names for clarity
colnames(my_data) <- c("specimen", "species", "var1", "var2")

## This is what it looks like
head(my_data)

This step and the one above is of course just for the purpose of this demonstration. You can directly use your own tree and data

Rarefying a test along a tree {#rarefy}

Getting each clade

First we need to identify and isolate each element from each partition (clade) using a modified version from the ape::prop.part function.

## A modified version of the ape::prop.part function:
prop.part.names <- function(phy, singletons = FALSE) {
    ## Get the bipartitions
    clades <- ape::prop.part(phy)

    ## Get the tips names for each clades
    clades <- lapply(clades, function(clade, labels) labels[clade], labels = attr(clades, "labels"))

    ## Add the tip names
    if(singletons) {
        clades <- c(clades, as.list(phy$tip.label))
    }

    return(clades)
}

## Getting the tree partitions (with singletons)
clades <- prop.part.names(my_tree, singletons = TRUE)

## Function for getting the dataset for each clade
get.data.clade <- function(clade, data, species.col.n, data.cols) {
    return(data[which(data[ ,species.col.n] %in% clade), data.cols])
}

## Get all the datasets per clades
data_clades <- lapply(clades, get.data.clade, data = my_data,
                      species.col.n = 2, data.cols = c(3:4))
# here data.cols = c(3:4) correspond to the column numbers of interest

Applying each tests

First we need to define the function that will get us the statistic. In our case we want the slope from a linear model (but any statistic can be used):

## Function for getting the slope of a linear model
get.slope <- function(data) {
    return(lm(data)$coefficients[[2]])
}

We can then calculate the estimated slope difference (observed - rarefied slope) by using the lowest potential number of specimens in our dataset (5 - see above).

## Getting the observed slopes
observed_slopes <- unlist(lapply(data_clades, get.slope))

## Getting the rarefied statistics for each clade
estimated_slopes_differences <- lapply(data_clades, rarefy.stat,
                                       stat.fun = get.slope, rarefaction = 5)

We can also compare them to the slope of the the full dataset:

## Getting the rarefied statistics for each clade
estimated_slopes_differences_overall <- lapply(data_clades, rarefy.stat,
                                               stat.fun = get.slope, rarefaction = 5,
                                               observed = get.slope(my_data[,c(3,4)]))

Summarising the results

We can then summarise these results per clade size (that's already done in this example) and adding the clade names (if they exist):

## Adding the names to the slopes
names(estimated_slopes_differences) <- unlist(lapply(clades, paste, collapse = "+"))
names(estimated_slopes_differences_overall) <- names(estimated_slopes_differences)
names(observed_slopes) <- names(estimated_slopes_differences)

And then transform these statistics into tables to simply plot them with boxplot:

par(mfrow = c(2,1))
## Combining the results into tables and plotting them
boxplot(do.call(cbind, estimated_slopes_differences),
        ylab = "Estimated slopes", xlab = "Clade",
        main = "Slope difference for each clade")
abline(h = 0, lty = 3)
boxplot(do.call(cbind, estimated_slopes_differences_overall),
        ylab = "Estimated slopes", xlab = "Clade",
        main = "Slope difference for each clade (compared to the whole dataset)")
abline(h = get.slope(my_data[,c(3,4)]), lty = 3)

Of course this data is pointless (as it is simulated).

A significant point

Once we have all the slope differences, it could be interesting to figure out a statistical threshold above which we can consider slopes as different. Theoretically, the estimated allometric slope can only be different up to a maximum of 90˚ angle of the observed slope (an orthogonal allometric slope) so we can consider a conservative angle difference of 4.5˚ as our threshold (5% of 90˚): a median estimated angle difference below 4.5˚ would then mean that the angles between our estimated slopes and the observed slope would be not significantly different than 0. In other words the estimated slope is not different than the observed slope (of course, depending on our significance threshold value).

Now as you've probably noticed, we switched from slopes to angles in the paragraph above. This is actually pretty easy using some good old trigonometry:

$\text{Angle difference} = atan(\frac{\text{estimated slope} - \text{observed slope}}{1 + \text{estimated slope} \times \text{observed slope}}) \times \frac{180}{\pi}$

We can then calculate that for all the slopes.

## Getting the slope differences for each estimated slope
slope_angles <- list()
for(one_clade in 1:length(observed_slopes)) {
    ## Calculating the slope differences in angles per clade
    slope_angles[[one_clade]] <- slope.diff(estimated_slopes_differences[[one_clade]],
                                           observed_slopes[[one_clade]])
}
names(slope_angles) <- names(estimated_slopes_differences)

And visualise the results

## Displaying the results
boxplot(slope_angles, ylim = c(0, 90), ylab = "Slope difference (in degrees)", xlab = "Clade")
abline(h = 4.5, lty = 2)
text(2, 6, "Significance threshold (4.5 degrees)", cex = 0.75)

In this made up example, we can see that the slopes are hard to estimate for individual taxa (t1 to t5) but not for the higher clades. In other words the slope measured for the individual taxa is influenced by the number of observations (e.g. specimens) while this is not the case for the higher clades.



TGuillerme/landvR documentation built on July 4, 2025, 10:16 p.m.