# R/mantel.R In kevinstadler/cultevo: Tools, Measures and Statistical Tests for Cultural Evolution

#### Documented in mantel.testmantel.test.defaultmantel.test.formulamantel.test.listplot.mantel

#' Perform one or more Mantel permutation tests.
#'
#' Perform correlation tests between pairs of distance matrices. The Mantel
#' test is different from classical correlation tests (such as those
#' implemented by \code{\link[stats]{cor.test}}) in that the null distribution
#' (and significance level) are obtained through randomisation. The null
#' distribution is generated by shuffling the locations (matrix rows and
#' columns) of one of the matrices to calculate an empirical null distribution
#' for the given data set.
#'
#' If the number of possible permutations of the matrices is reasonably close
#' to the number of permutations specified by the trials parameter, a
#' deterministic enumeration of all the permutations will be carried out
#' instead of random sampling: such a deterministic test will return an exact
#' p-value.
#'
#' @references Dietz, E. J. 1983 “Permutation Tests for Association
#' Between Two Distance Matrices.” \emph{Systematic Biology} 32 (1): 21-–26.
#' <https://doi.org/10.1093/sysbio/32.1.21>.
#'
#' North, B. V., D. Curtis and P. C. Sham. 2002 “A Note on the Calculation of
#' Empirical P Values from Monte Carlo Procedures.” \emph{The American Journal of
#' Human Genetics} 71 (2): 439-–41. <https://doi.org/10.1086/341527>.
#'
#' @param x a formula, distance matrix, or list of distance matrices (see below)
#' @param y a data frame, distance matrix, or list of distance matrices of the
#'   same length as x
#' @param plot logical: immediately produce a plot of the test results (default:
#'   FALSE)
#' @param method correlation coefficient to be computed. Passed on to
#'   \link[stats]{cor}, so one of "spearman", "kendall", or, inadvisable
#'   in the case of ties: "pearson". Following Dietz (1983), "spearman" is
#'   used as a default that is both powerful and robust across different
#'   distance measures.
#' @param trials integer: maximum number of random permutations to be computed
#'   (see Details).
#' @param omitzerodistances logical: if TRUE, the calculation of the
#'   correlation coefficient omits pairs of off-diagonal cells which contain a
#'   0 in the \emph{second} distance matrix argument. (For the formula
#'   interface, this is the matrix which specifies the meaning distances.)
#' @param groups when x is a formula: column name by which the data in y is
#'   split into separate data sets to run several Mantel tests on
#' @param stringdistfun when x is a formula: edit distance function used to
#'   compute the distance matrix from the specified string column. Supports any
#'   edit distance function that returns a distance matrix from a vector or
#'   list of character strings. Default is Levenshtein distance
#'   [normalisedlevenshteindists()] and [orderinsensitivedists()].
#' @param meaningdistfun when x is a formula: meaning distance function used
#'   to compute the distance matrix from the specified meaning columns.
#'   Defaults to Hamming distances between meanings ([hammingdists()]), custom
#'   meaning functions can be created easily using
#'   [wrap.meaningdistfunction()].
#' @param ... further arguments which are passed on to the default method (in
#'   particular plot, method, trials and omitzerodistances)
#' @return A dataframe of class mantel, with one row per Mantel test carried
#'   out, containing the following columns:
#' \describe{
#'    \item{\code{method}}{Character string: type of correlation coefficient used}
#'    \item{\code{statistic}}{The veridical correlation coefficient between
#'          the entries in the two distance matrices}
#'    \item{\code{rsample}}{A list of correlation coefficients calculated
#'          from the permutations of the input matrices}
#'    \item{\code{mean}}{Average correlation coefficient produced by the permutations}
#'    \item{\code{sd}}{Standard deviation of the sampled correlation coefficients}
#'    \item{\code{p.value}}{Empirical p-value computed from the Mantel
#'          test: let \code{ngreater} be the number of correlation coefficients
#'          in \code{rsample} greater than or equal to \code{statistic}, then
#'          \code{p.value} is \code{(ngreater+1)/(length(rsample)+1}} (North,
#'          Curtis and Sham 2002).
#'    \item{\code{p.approx}}{The theoretical p-value that would correspond
#'          to the standard \code{z} score as calculated above.}
#'    \item{\code{is.unique.max}}{Logical, \code{TRUE} iff the veridical
#'          correlation coefficient is greater than any of the coefficients
#'          calculated for the permutations. If this is true, then
#'          \code{p.value == 1 / (length(rsample)+1)}}
#'  }
#' Multiple \code{mantel} objects can easily be combined by calling
#' \code{rbind(test1, test2, ...)}.
#' @examples
#' # small distance matrix, Mantel test run deterministically
#' mantel.test(dist(1:7), dist(1:7))
#'
#'
#' \dontrun{
#' # run test on smallest distance matrix which requires a random
#' # permutation test, and plot it
#' plot(mantel.test(dist(1:8), dist(1:8), method="kendall"))
#' }
#'
#' \dontrun{
#' # 2x2x2x2 design
#' mantel.test(hammingdists(enumerate.meaningcombinations(c(2, 2, 2, 2))),
#'   dist(1:16), plot=TRUE)
#' }
#' @export
mantel.test <- function(x, y, ...)
UseMethod("mantel.test")

#' @describeIn mantel.test Perform Mantel correlation test on two distance
#' matrices. The distance matrices can either be of type
#' \code{\link[stats]{dist}}, plain R matrices or any object that can be
#' interpreted by \code{\link{check.dist}}. The order of the two matrices does
#' not matter unless \code{omitzerodistances = TRUE}, in which case cells with
#' a 0 in the \emph{second} matrix are omitted from the calculation of the
#' correlation coefficient. For consistency it is therefore recommended to
#' always pass the string distance matrix first, meaning distance matrix second.
#' @importFrom stats cor
mantel.test.default <- function(x, y, plot=FALSE,
method=c("spearman", "kendall", "pearson"), trials=9999,
omitzerodistances=FALSE, ...) {
method <- match.arg(method)
m1 <- check.dist(x)
m2 <- check.dist(y)
d <- dim(m1)[1]
if (dim(m2)[1] != d)
stop("The two distance matrices have incompatible dimensions")

# figure out actual complexity by counting *unique* combinations in rows:
# check if a row i is equivalent to a row j which has i+j swapped (in which
# case the permutation of the two does not lead to different r)
#  swaps <- combinat::combn(d, 2)
#  inconsequentialswaps <- apply(swaps, 2, function(i) {
#    !any(m1[i[1],] != m2[i[2], c(seq_len(i[1]-1), i[2], i[1]+seq_len(i[2]-i[1]-1), i[1], i[2]+seq_len(d-i[2]))])
#    })
#  if (any(inconsequentialswaps))
#    print(swaps[,inconsequentialswaps])

# extract values relevent for correlation computation
indices <- which(lower.tri(m1))
if (omitzerodistances)
indices <- setdiff(indices, which(m2 == 0))
m1 <- m1[indices]
duration <- max(0.001, system.time(veridical <- cor(m1, m2[indices], method=method))[[3]])

# determine whether it would be more efficient to enumerate all permutations:
# assuming we take 'trials' samples, how many of the factorial(d) possible
# values haven't been selected yet compared to the maximum number of trials?
exact <- tryCatch(trials/((factorial(d)*stats::pgeom(trials, 1/factorial(d),
lower.tail=FALSE))) >= 1, warning=function(x) FALSE)
if (exact) {
trials <- factorial(d)
message("Permutation space is small, enumerating all ", trials, " possible permutations.")
}
if (duration*trials > 30)
message("Estimated time to evaluate all ", trials, " permutations is ",
round(duration*trials), " seconds, go get yourself a biscuit!")

if (exact) {
rsample <- sapply(combinat::permn(d), function(perm)
cor(m1, shuffle.locations(m2, perm)[indices], method=method))
} else {
rsample <- replicate(trials,
cor(m1, shuffle.locations(m2)[indices], method=method))
# calculate lower bound by sampling local permutations
# all pairwise swaps
# switches <- combinat::combn(d, 2)
# localrs <- apply(switches, 2, function(i) {
#   cor(m1,
#     shuffle.locations(m2,
#       c(seq_len(i[1]-1), i[2], i[1]+seq_len(i[2]-i[1]), i[1], i[2]+seq_len(d-i[2])))[indices],
#     method=method)})
# p.lowerbound <- sum(localrs >= veridical) / factorial(d)
}

greater <- sum(rsample >= veridical)
# empirical p value http://www.ncbi.nlm.nih.gov/pmc/articles/PMC379178/
p.empirical <- (greater + 1) / (length(rsample) + 1)

mn <- mean(rsample)
s <- stats::sd(rsample)

result <- data.frame(method=method, statistic=c(r=veridical),
N=length(indices), mean=mn, sd=s, p.value=p.empirical,
p.approx=ifelse(exact, NA,
stats::pnorm((veridical - mn) / s, lower.tail=FALSE)),
is.unique.max=(greater==0), alternative="greater",
rsample=I(list(rsample)), stringsAsFactors=FALSE)
if (plot)
plot.mantel(result)
structure(result, class=c("mantel", "htest", "data.frame"))
}

#' @export
mantel.test.dist <- mantel.test.default

#' @describeIn mantel.test This function can be called with raw experimental
#' result data frames, distance matrix calculation is taken care of internally.
#' \code{x} is a formula of the type \code{s ~ m1 + m2 + ...} where \code{s}
#' is the column name of the character strings in data frame or matrix \code{y},
#' while \code{m1} etc. are the column names specifying the different meaning
#' dimensions. To calculate the respective distances, the function
#' \code{stringdistfun} is applied to the strings, \code{meaningdistfun} to the
#' meaning columns.
#' @examples
#'
#' # using the formula interface in combination with a data frame:
#' print(data <- cbind(word=c("aa", "ab", "ba", "bb"),
#'   enumerate.meaningcombinations(c(2, 2))))
#'
#' mantel.test(word ~ Var1 + Var2, data)
#' @export
mantel.test.formula <- function(x, y, groups=NULL, stringdistfun=utils::adist,
meaningdistfun=hammingdists, ...) {
t <- stats::terms(x)
fields <- rownames(attr(t, "factors"))
lhs <- fields[attr(t, "response")]
rhs <- attr(t, "term.labels")
# TODO implement grouping/conditioning term parsing for formula interface
if (is.null(groups)) {
# pass meanings first
mantel.test.default(stringdistfun(y[,lhs]), meaningdistfun(y[,rhs]), ...)
} else {
levels <- sort(unique(y[,groups]))
mantel.test.list(sapply(levels, function(lvl)
stringdistfun(y[which(y[[groups]] == lvl), lhs]),
simplify=FALSE),
sapply(levels, function(lvl)
meaningdistfun(y[which(y[[groups]] == lvl), rhs]),
simplify=FALSE), ...)
}
}

#' @describeIn mantel.test When \code{x} is a list of distance matrices, and
#' \code{y} is either a single distance matrix or a list of distance matrices
#' the same length as \code{x}: runs a Mantel test for every pairwise
#' combination of distance matrices in \code{x} and \code{y} and returns a
#' \code{mantel} object with as many rows.
#' @examples
#'
#' \dontrun{
#' # pass a list of distance matrices as the first argument, but just one
#' # distance matrix as the second argument: this runs separate tests on
#' # the pairwise combinations of the first and second argument
#' result <- mantel.test(list(dist(1:8), dist(sample(8:1)), dist(runif(8))),
#'   hammingdists(enumerate.meaningcombinations(c(2, 2, 2))))
#'
#' # print the result of the three independently run permutation tests
#' print(result)
#'
#' # show the three test results in one plot
#' plot(result, xlab="group")
#' }
#' @export
mantel.test.list <- function(x, y, plot=FALSE, ...) {
# catch plot argument so that individual tests don't all produce a plot
result <- do.call(rbind, mapply(mantel.test.default, x,
# save single distance matrices y from being iterated into with list(y)
if (is.list(y)) y else list(y), plot=FALSE, MoreArgs=..., SIMPLIFY=FALSE))
result <- cbind(group=if (is.null(names(x))) seq_along(x) else names(x), result)
if (plot)
plot.mantel(result)
structure(result, class=c("mantel", "data.frame"))
}

#' @export
print.mantel <- function(x, ...) {
for (i in 1:nrow(x)) {
if (!is.null(x$group)) cat("Group '", as.character(x$group[i]), "':\n", sep='')
cat("Mantel permutation test (method: ", x$method[i], ")\nr = ", format(x$statistic[i], digits=3), ", N = ", x$N[i], "\n", length(x$rsample[[i]]), " permutations, mean = ",
format(x$mean[i], digits=3), ", sd = ", format(x$sd[i], digits=3),
"\np (", if(is.na(x$p.approx[i])) "exact" else "empirical", ") = ", format(x$p.value[i], digits=3),
ifelse(x$is.unique.max[i], " (veridical correlation is highest found)", ""), "\n\n", sep="") } # TODO print approximate p value iff not is.unique.max invisible(x) } # functions for visualising the outcome of Mantel tests blue.if.true <- function(x, default="black") mapply(function(x, d) if (x) "blue" else d, x, default) method.label <- list(pearson="r", kendall=expression(tau), spearman=expression(rho)) # grab the xaxt, xlab, xlim and ylim arguments to stop them from being passed on via ... # @importFrom graphics curve plot points segments text #' @importFrom stats dnorm plotmantelsample <- function(mantels, nbins=25, main="", xaxt=NULL, xlab=NULL, xlim=NULL, ylim=NULL, ...) { d <- list() if (nrow(mantels) > 1) graphics::par(mfrow=c(1, nrow(mantels)), mar=c(5.1, 2.5, 2, 1)) # layout(matrix(c(rep(1, nrow(mantels)), 2:(nrow(mantels)+1)), nrow=2, byrow=TRUE)) maxdensity <- 0 for (i in 1:nrow(mantels)) { d[[i]] <- graphics::hist(unlist(mantels$rsample[i]), breaks=nbins, plot=FALSE)
maxdensity <- max(maxdensity, d[[i]]$density, stats::dnorm(0, sd=mantels$sd[i]))
}
for (i in 1:nrow(mantels)) {
xlim <- range(mantels$rsample[[i]], mantels$statistic[i]) + c(-.05, .05)
graphics::plot(d[[i]], freq=FALSE, yaxs="i", xlim=xlim, ylim=c(0, maxdensity), xlab=method.label[[mantels$method[i]]], ylab="Density", border="gray", main=paste(method.label[[mantels$method[i]]], "=", format(mantels$statistic[i], digits=3), ", N=", mantels$N[i], ", ", length(mantels$rsample[[i]]), " permutations", sep=""), ...) # add fit used for z score estimation graphics::curve({stats::dnorm(d, mean=mantels$mean[i], sd=mantels$sd[i])}, xname="d", lty=3, add=TRUE) # mark veridical r col <- blue.if.true(mantels$is.unique.max[i])
level <- dnorm(mantels$statistic[i], mean=mantels$mean[i], sd=mantels$sd[i]) graphics::segments(mantels$statistic[i], 0, y1=level, col=col, lty=2)
graphics::points(mantels$statistic[i], y=0.15+level, pch=25, bg=col, col=col) graphics::text(mantels$statistic[i], 0.15+level, paste("p=", format(mantels$p.value[i], digits=3), sep=""), pos=3) } } #' @details #' \code{plot()} called on a data frame of class \code{mantel} plots a #' visualisation of the test results (in particular, the distribution of #' the permutated samples against the veridical correlation coefficient). If #' the veridical correlation coefficient is plotted in blue it means #' that it was higher than all other coefficients generated by random #' permutations of the data. When the argument contains the result of more than #' one Mantel tests, a side-by-side boxplot visualisation shows the mean and #' standard deviation of the randomised samples (see examples). Additional #' parameters ... to plot() are passed on to #' \code{\link[graphics]{plot.default}}. #' @param xlab the x axis label used when plotting the result of several Mantel #' tests next to each other #' @rdname mantel.test #' @importFrom graphics abline axis points text #' @export plot.mantel <- function(x, xlab="generation", ...) { if (nrow(x) == 1) { plotmantelsample(x) } else { # TODO replace with boxplots # TODO x axis factor labels Hmisc::errbar(1:nrow(x), x$mean, x$mean + x$sd, x$mean - x$sd, xlab=xlab,
xaxt="n", ylab=paste(method.label[[x$method[1]]], " (mean+-sd)", sep=""), xlim=c(0.5, nrow(x) + 0.5), ylim=range(0.1, x$statistic, x$mean + x$sd, x$mean - x$sd)+c(-0.1, 0.1),
main=paste("N=", length(x$rsample[[1]]), sep=""), ...) # plot correlation coefficient reference points graphics::abline(h=-1:1, lty=c(3, 2, 3), col="grey") # blue points signify that no single larger r value has been sampled graphics::points(1:nrow(x), x$statistic, col=blue.if.true(x$is.unique.max)) # label the veridical rs with their p values graphics::text(1:nrow(x), ifelse(x$statistic >= x$mean, pmax(x$statistic, x$mean + x$sd), pmin(x$statistic, x$mean - x$sd)), labels=paste("p", format(x$p.value, digits=2), sep="="), pos=2+sign(x$statistic)) graphics::axis(1, at=1:nrow(x), labels=x$group)
}
}

kevinstadler/cultevo documentation built on April 25, 2018, 9:05 a.m.