An overview of AnthropMMD

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(digits = 3)
library(AnthropMMD)

Graphical user interface

AnthropMMD was primarily designed as an R-Shiny application [@Santos18], that can be launched with the following instruction:

start_mmd()

Starting with the version 3.0.0, it can also be used from the command line, to make it convenient in a context of reproducible research.

Using AnthropMMD from the command line

Data formatting

The MMD is a dissimilarity measure among groups of individuals described by binary (presence-absence) traits [@Sjovold73; @Sjovold77; @Harris04]. The MMD formula only requires to know the sample sizes and relative trait frequencies within each group: providing the raw individual data is not mandatory.

Usually, the user will have recorded the data in one of the the two following formats.

Raw binary data

Most of the time, the data will be formatted as a classical dataframe with $n$ rows (one per individual) and $p+1$ columns (the first column must be a group indicator; the $p$ other columns correspond to the $p$ binary traits observed). Each trait has two possible values: $1$ if present, $0$ if absent (missing values are allowed). Rownames are optional but colnames (i.e., headers) are mandatory.

An artificial dataset, toyMMD, is available in the package to give an example of such a format:

data(toyMMD)
head(toyMMD)

It is not necessary for the Traits to be manually converted as factors prior to the analysis. toyMMD includes nine traits with many missing values, and the individuals belong to five groups:

str(toyMMD)

If the data were recorded as a raw binary dataset, they must be converted into a table of relative frequencies prior to the MMD analysis. The R function binary_to_table with the argument relative = TRUE can perform this operation:

tab <- binary_to_table(toyMMD, relative = TRUE)
tab

Tables of sample sizes and absolute frequencies

Sometimes, in particular if the data were extracted from research articles rather than true exprimentation, the user will only know the sample sizes and absolute trait frequencies in each group. As previously stated, those data are perfectly sufficient to compute MMD values. Here is an example of such a table:

data(absolute_freqs)
print(absolute_freqs)

If there are $k$ groups in the data, the first $k$ rows (whose names must start with an N_ prefix, followed by the group labels) indicates the sample sizes of each trait in each group, and the last $k$ rows indicates the absolute trait frequencies. Such a table can be directly imported by the user, but must be converted to relative trait frequencies prior to the analysis, for instance by using the function table_relfreq:

tab <- table_relfreq(absolute_freqs)
print(tab)

Important remark

Due to rounding issues, it is clearly not advised for the user to submit directly a table of relative frequencies. If the absolute frequencies do not perfectly sum up to 1 for each trait, an error will be triggered.

A table of relative frequencies is the starting point of all subsequent analyses, but it should be computed from the data loaded in R; it should not be loaded itself.

A quick summary:

Trait selection

AnthropMMD proposes a built-in feature for trait selection, in order to discard the traits that could be observed on too few individuals, or that do not show enough variability among groups. The function select_traits allows such a selection according to several strategies [@Harris04; @Santos18].

For instance, with the following instruction, we can discard the traits that could be observed on at least 10 individuals per group, and that exhibit no significant variability in frequencies among groups according to Fisher's exact tests:

tab_selected <- select_traits(tab, k = 10, strategy = "keepFisher")
tab_selected$filtered

Trait1 (which could be observed on only 8 individuals in Group B), Trait5 and Trait8 (whose variation in frequencies was not significant among groups) were removed from the data.

Compute the mean measure of divergence

Once the trait selection has been performed, the MMD can be computed with the mmd function. With the following instruction, the MMD is computed using Anscombe's angular transformation of trait frequencies:

mmd.result <- mmd(tab_selected$filtered, angular = "Anscombe")
mmd.result

Graphical representations of MMD

Multidimensional scaling

Although mmd.result$MMDSym can perfectly be passed to your favourite function to produce a MDS plot, AnthropMMD also proposes a built-in generic function for such a graphical representation: plot.

The MDS can be computed using the classical stats::cmdscale function (and the produces a metric MDS), or several variants of MDS algorithms implemented in the R package smacof.

For instance, we plot here the MDS coordinates computed with one variant of SMACOF algorithms:

par(cex = 0.8)
plot(x = mmd.result, method = "interval", 
     gof = TRUE, axes = TRUE, xlim = c(-1.2, 0.75))

The argument gof = TRUE displays goodness of fits statistics for the MDS configurations directly on the plot.

Hierarchical clustering

AnthropMMD does not propose a built-in function for hierarchical clustering, but such a plot can easily be obtained with the usual R functions.

library(cluster)
par(cex = 0.8)
plot(agnes(mmd.result$MMDSym), which.plots = 2, main = "Dendrogram of MMD dissimilarities")

References



Try the AnthropMMD package in your browser

Any scripts or data that you put into this service are public.

AnthropMMD documentation built on July 20, 2020, 5:07 p.m.