R/mantel.R

Defines functions plot.mantel plotmantelsample blue.if.true print.mantel mantel.test.list mantel.test.formula mantel.test.default mantel.test

Documented in mantel.test mantel.test.default mantel.test.formula mantel.test.list plot.mantel

#' Perform one or more Mantel permutation tests.
#' 
#' Perform correlation tests between pairs of distance matrices. The Mantel
#' test is different from classical correlation tests (such as those
#' implemented by \code{\link[stats]{cor.test}}) in that the null distribution
#' (and significance level) are obtained through randomisation. The null
#' distribution is generated by shuffling the locations (matrix rows and
#' columns) of one of the matrices to calculate an empirical null distribution
#' for the given data set.
#'
#' If the number of possible permutations of the matrices is reasonably close
#' to the number of permutations specified by the `trials` parameter, a
#' deterministic enumeration of all the permutations will be carried out
#' instead of random sampling: such a deterministic test will return an exact
#' p-value.
#'
#' @references Dietz, E. J. 1983 “Permutation Tests for Association
#' Between Two Distance Matrices.” \emph{Systematic Biology} 32 (1): 21-–26.
#' <https://doi.org/10.1093/sysbio/32.1.21>.
#'
#' North, B. V., D. Curtis and P. C. Sham. 2002 “A Note on the Calculation of
#' Empirical P Values from Monte Carlo Procedures.” \emph{The American Journal of
#' Human Genetics} 71 (2): 439-–41. <https://doi.org/10.1086/341527>.
#'
#' @param x a formula, distance matrix, or list of distance matrices (see below)
#' @param y a data frame, distance matrix, or list of distance matrices of the
#'   same length as `x`
#' @param plot logical: immediately produce a plot of the test results (default:
#'   `FALSE`)
#' @param method correlation coefficient to be computed. Passed on to
#'   `\link[stats]{cor}`, so one of `"spearman"`, `"kendall"`, or, inadvisable
#'   in the case of ties: `"pearson"`. Following Dietz (1983), `"spearman"` is
#'   used as a default that is both powerful and robust across different
#'   distance measures.
#' @param trials integer: maximum number of random permutations to be computed 
#'   (see Details).
#' @param omitzerodistances logical: if `TRUE`, the calculation of the
#'   correlation coefficient omits pairs of off-diagonal cells which contain a
#'   0 in the \emph{second} distance matrix argument. (For the formula
#'   interface, this is the matrix which specifies the meaning distances.)
#' @param groups when `x` is a formula: column name by which the data in `y` is
#'   split into separate data sets to run several Mantel tests on
#' @param stringdistfun when `x` is a formula: edit distance function used to
#'   compute the distance matrix from the specified string column. Supports any
#'   edit distance function that returns a distance matrix from a vector or
#'   list of character strings. Default is Levenshtein distance
#'   (\code{\link[utils]{adist}}), other options from this package include
#'   [normalisedlevenshteindists()] and [orderinsensitivedists()].
#' @param meaningdistfun when `x` is a formula: meaning distance function used
#'   to compute the distance matrix from the specified meaning columns.
#'   Defaults to Hamming distances between meanings ([hammingdists()]), custom
#'   meaning functions can be created easily using
#'   [wrap.meaningdistfunction()].
#' @param ... further arguments which are passed on to the default method (in
#'   particular `plot`, `method`, `trials` and `omitzerodistances`)
#' @return A dataframe of class `mantel`, with one row per Mantel test carried
#'   out, containing the following columns:
#' \describe{
#'    \item{\code{method}}{Character string: type of correlation coefficient used}
#'    \item{\code{statistic}}{The veridical correlation coefficient between
#'          the entries in the two distance matrices}
#'    \item{\code{rsample}}{A list of correlation coefficients calculated
#'          from the permutations of the input matrices}
#'    \item{\code{mean}}{Average correlation coefficient produced by the permutations}
#'    \item{\code{sd}}{Standard deviation of the sampled correlation coefficients}
#'    \item{\code{p.value}}{Empirical p-value computed from the Mantel
#'          test: let \code{ngreater} be the number of correlation coefficients
#'          in \code{rsample} greater than or equal to \code{statistic}, then
#'          \code{p.value} is \code{(ngreater+1)/(length(rsample)+1}} (North,
#'          Curtis and Sham 2002).
#'    \item{\code{p.approx}}{The theoretical p-value that would correspond
#'          to the standard \code{z} score as calculated above.}
#'    \item{\code{is.unique.max}}{Logical, \code{TRUE} iff the veridical
#'          correlation coefficient is greater than any of the coefficients
#'          calculated for the permutations. If this is true, then
#'          \code{p.value == 1 / (length(rsample)+1)}}
#'  }
#' Multiple \code{mantel} objects can easily be combined by calling
#' \code{rbind(test1, test2, ...)}. 
#' @examples
#' # small distance matrix, Mantel test run deterministically
#' mantel.test(dist(1:7), dist(1:7))
#'
#'
#' \dontrun{
#' # run test on smallest distance matrix which requires a random
#' # permutation test, and plot it
#' plot(mantel.test(dist(1:8), dist(1:8), method="kendall"))
#' }
#'
#' \dontrun{
#' # 2x2x2x2 design
#' mantel.test(hammingdists(enumerate.meaningcombinations(c(2, 2, 2, 2))),
#'   dist(1:16), plot=TRUE)
#' }
#' @seealso \code{\link[stats]{cor}},
#'   \code{\link[utils]{adist}}, \code{\link{hammingdists}},
#'   \code{\link{normalisedlevenshteindists}},
#'   \code{\link{orderinsensitivedists}}
#' @export
mantel.test <- function(x, y, ...)
  UseMethod("mantel.test")

#' @describeIn mantel.test Perform Mantel correlation test on two distance
#' matrices. The distance matrices can either be of type
#' \code{\link[stats]{dist}}, plain R matrices or any object that can be
#' interpreted by \code{\link{check.dist}}. The order of the two matrices does
#' not matter unless \code{omitzerodistances = TRUE}, in which case cells with
#' a 0 in the \emph{second} matrix are omitted from the calculation of the
#' correlation coefficient. For consistency it is therefore recommended to
#' always pass the string distance matrix first, meaning distance matrix second.
#' @importFrom stats cor
mantel.test.default <- function(x, y, plot=FALSE,
    method=c("spearman", "kendall", "pearson"), trials=9999,
    omitzerodistances=FALSE, ...) {
  method <- match.arg(method)
  m1 <- check.dist(x)
  m2 <- check.dist(y)
  d <- dim(m1)[1]
  if (dim(m2)[1] != d)
    stop("The two distance matrices have incompatible dimensions")

  # figure out actual complexity by counting *unique* combinations in rows:
  # check if a row i is equivalent to a row j which has i+j swapped (in which
  # case the permutation of the two does not lead to different r)
#  swaps <- combinat::combn(d, 2)
#  inconsequentialswaps <- apply(swaps, 2, function(i) {
#    !any(m1[i[1],] != m2[i[2], c(seq_len(i[1]-1), i[2], i[1]+seq_len(i[2]-i[1]-1), i[1], i[2]+seq_len(d-i[2]))])
#    })
#  if (any(inconsequentialswaps))
#    print(swaps[,inconsequentialswaps])

  # extract values relevent for correlation computation
  indices <- which(lower.tri(m1))
  if (omitzerodistances)
    indices <- setdiff(indices, which(m2 == 0))
  m1 <- m1[indices]
  duration <- max(0.001, system.time(veridical <- cor(m1, m2[indices], method=method))[[3]])

  # determine whether it would be more efficient to enumerate all permutations:
  # assuming we take 'trials' samples, how many of the factorial(d) possible
  # values haven't been selected yet compared to the maximum number of trials?
  exact <- tryCatch(trials/((factorial(d)*stats::pgeom(trials, 1/factorial(d),
    lower.tail=FALSE))) >= 1, warning=function(x) FALSE)
  if (exact) {
    trials <- factorial(d)
    message("Permutation space is small, enumerating all ", trials, " possible permutations.")
  }
  if (duration*trials > 30)
    message("Estimated time to evaluate all ", trials, " permutations is ",
      round(duration*trials), " seconds, go get yourself a biscuit!")

  if (exact) {
    rsample <- sapply(combinat::permn(d), function(perm)
                 cor(m1, shuffle.locations(m2, perm)[indices], method=method))
  } else {
    rsample <- replicate(trials,
      cor(m1, shuffle.locations(m2)[indices], method=method))
    # calculate lower bound by sampling local permutations
    # all pairwise swaps
    # switches <- combinat::combn(d, 2)
    # localrs <- apply(switches, 2, function(i) {
    #   cor(m1,
    #     shuffle.locations(m2,
    #       c(seq_len(i[1]-1), i[2], i[1]+seq_len(i[2]-i[1]), i[1], i[2]+seq_len(d-i[2])))[indices],
    #     method=method)})
    # p.lowerbound <- sum(localrs >= veridical) / factorial(d)
  }

  greater <- sum(rsample >= veridical)
  # empirical p value http://www.ncbi.nlm.nih.gov/pmc/articles/PMC379178/
  p.empirical <- (greater + 1) / (length(rsample) + 1)

  mn <- mean(rsample)
  s <- stats::sd(rsample)

  result <- data.frame(method=method, statistic=c(r=veridical),
    N=length(indices), mean=mn, sd=s, p.value=p.empirical,
    p.approx=ifelse(exact, NA,
      stats::pnorm((veridical - mn) / s, lower.tail=FALSE)),
    is.unique.max=(greater==0), alternative="greater",
    rsample=I(list(rsample)), stringsAsFactors=FALSE)
  if (plot)
    plot.mantel(result)
  structure(result, class=c("mantel", "htest", "data.frame"))
}

#' @export
mantel.test.dist <- mantel.test.default

#' @describeIn mantel.test This function can be called with raw experimental
#' result data frames, distance matrix calculation is taken care of internally.
#' \code{x} is a formula of the type \code{s ~ m1 + m2 + ...} where \code{s}
#' is the column name of the character strings in data frame or matrix \code{y},
#' while \code{m1} etc. are the column names specifying the different meaning
#' dimensions. To calculate the respective distances, the function
#' \code{stringdistfun} is applied to the strings, \code{meaningdistfun} to the
#' meaning columns.
#' @examples
#'
#' # using the formula interface in combination with a data frame:
#' print(data <- cbind(word=c("aa", "ab", "ba", "bb"),
#'   enumerate.meaningcombinations(c(2, 2))))
#'
#' mantel.test(word ~ Var1 + Var2, data)
#' @export
mantel.test.formula <- function(x, y, groups=NULL, stringdistfun=utils::adist,
  meaningdistfun=hammingdists, ...) {
  t <- stats::terms(x)
  fields <- rownames(attr(t, "factors"))
  lhs <- fields[attr(t, "response")]
  rhs <- attr(t, "term.labels")
  # TODO implement grouping/conditioning term parsing for formula interface
  if (is.null(groups)) {
    # pass meanings first
    mantel.test.default(stringdistfun(y[,lhs]), meaningdistfun(y[,rhs]), ...)
  } else {
    levels <- sort(unique(y[,groups]))
    mantel.test.list(sapply(levels, function(lvl)
          stringdistfun(y[which(y[[groups]] == lvl), lhs]),
        simplify=FALSE),
      sapply(levels, function(lvl)
          meaningdistfun(y[which(y[[groups]] == lvl), rhs]),
        simplify=FALSE), ...)
  }
}

#' @describeIn mantel.test When \code{x} is a list of distance matrices, and
#' \code{y} is either a single distance matrix or a list of distance matrices
#' the same length as \code{x}: runs a Mantel test for every pairwise
#' combination of distance matrices in \code{x} and \code{y} and returns a
#' \code{mantel} object with as many rows.
#' @examples
#'
#' \dontrun{
#' # pass a list of distance matrices as the first argument, but just one
#' # distance matrix as the second argument: this runs separate tests on
#' # the pairwise combinations of the first and second argument
#' result <- mantel.test(list(dist(1:8), dist(sample(8:1)), dist(runif(8))),
#'   hammingdists(enumerate.meaningcombinations(c(2, 2, 2))))
#'
#' # print the result of the three independently run permutation tests
#' print(result)
#'
#' # show the three test results in one plot
#' plot(result, xlab="group")
#' }
#' @export
mantel.test.list <- function(x, y, plot=FALSE, ...) {
  # catch plot argument so that individual tests don't all produce a plot
  result <- do.call(rbind, mapply(mantel.test.default, x,
    # save single distance matrices y from being iterated into with list(y)
    if (is.list(y)) y else list(y), plot=FALSE, MoreArgs=..., SIMPLIFY=FALSE))
  result <- cbind(group=if (is.null(names(x))) seq_along(x) else names(x), result)
  if (plot)
    plot.mantel(result)
  structure(result, class=c("mantel", "data.frame"))
}

#' @export
print.mantel <- function(x, ...) {
  for (i in 1:nrow(x)) {
    if (!is.null(x$group))
      cat("Group '", as.character(x$group[i]), "':\n", sep='')
    cat("Mantel permutation test (method: ", x$method[i], ")\nr = ",
      format(x$statistic[i], digits=3), ", N = ", x$N[i], "\n",
      length(x$rsample[[i]]), " permutations, mean = ",
      format(x$mean[i], digits=3), ", sd = ", format(x$sd[i], digits=3),
      "\np (", if(is.na(x$p.approx[i])) "exact" else "empirical",
      ") = ", format(x$p.value[i], digits=3),
      ifelse(x$is.unique.max[i], " (veridical correlation is highest found)",
        ""), "\n\n", sep="")
  }
  # TODO print approximate p value iff not is.unique.max
  invisible(x)
}

# functions for visualising the outcome of Mantel tests
blue.if.true <- function(x, default="black")
  mapply(function(x, d) if (x) "blue" else d, x, default)

method.label <- list(pearson="r", kendall=expression(tau), spearman=expression(rho))

# grab the xaxt, xlab, xlim and ylim arguments to stop them from being passed on via ...
# @importFrom graphics curve plot points segments text
#' @importFrom stats dnorm
plotmantelsample <- function(mantels, nbins=25, main="", xaxt=NULL, xlab=NULL, xlim=NULL, ylim=NULL, ...) {
  d <- list()
  if (nrow(mantels) > 1)
    graphics::par(mfrow=c(1, nrow(mantels)), mar=c(5.1, 2.5, 2, 1))
    # layout(matrix(c(rep(1, nrow(mantels)), 2:(nrow(mantels)+1)), nrow=2, byrow=TRUE))

  maxdensity <- 0
  for (i in 1:nrow(mantels)) {
    d[[i]] <- graphics::hist(unlist(mantels$rsample[i]), breaks=nbins, plot=FALSE)
    maxdensity <- max(maxdensity, d[[i]]$density, stats::dnorm(0, sd=mantels$sd[i]))
  }
  for (i in 1:nrow(mantels)) {
    xlim <- range(mantels$rsample[[i]], mantels$statistic[i]) + c(-.05, .05)
    graphics::plot(d[[i]], freq=FALSE, yaxs="i", xlim=xlim, ylim=c(0, maxdensity), xlab=method.label[[mantels$method[i]]], ylab="Density", border="gray", main=paste(method.label[[mantels$method[i]]], "=", format(mantels$statistic[i], digits=3), ", N=", mantels$N[i], ", ", length(mantels$rsample[[i]]), " permutations", sep=""), ...)
    # add fit used for z score estimation
    graphics::curve({stats::dnorm(d, mean=mantels$mean[i], sd=mantels$sd[i])}, xname="d", lty=3, add=TRUE)
    # mark veridical r
    col <- blue.if.true(mantels$is.unique.max[i])
    level <- dnorm(mantels$statistic[i], mean=mantels$mean[i], sd=mantels$sd[i])
    graphics::segments(mantels$statistic[i], 0, y1=level, col=col, lty=2)
    graphics::points(mantels$statistic[i], y=0.15+level, pch=25, bg=col, col=col)
    graphics::text(mantels$statistic[i], 0.15+level, paste("p=", format(mantels$p.value[i], digits=3), sep=""), pos=3)
  }
}

#' @details
#' \code{plot()} called on a data frame of class \code{mantel} plots a
#' visualisation of the test results (in particular, the distribution of
#' the permutated samples against the veridical correlation coefficient). If
#' the veridical correlation coefficient is plotted in blue it means
#' that it was higher than all other coefficients generated by random
#' permutations of the data. When the argument contains the result of more than
#' one Mantel tests, a side-by-side boxplot visualisation shows the mean and
#' standard deviation of the randomised samples (see examples). Additional
#' parameters `...` to `plot()` are passed on to
#' \code{\link[graphics]{plot.default}}.
#' @param xlab the x axis label used when plotting the result of several Mantel
#'   tests next to each other
#' @rdname mantel.test
#' @importFrom graphics abline axis points text
#' @export
plot.mantel <- function(x, xlab="generation", ...) {
  if (nrow(x) == 1) {
    plotmantelsample(x)
  } else {
    # TODO replace with boxplots
    # TODO x axis factor labels
    Hmisc::errbar(1:nrow(x), x$mean, x$mean + x$sd, x$mean - x$sd, xlab=xlab,
      xaxt="n", ylab=paste(method.label[[x$method[1]]], " (mean+-sd)", sep=""),
      xlim=c(0.5, nrow(x) + 0.5),
      ylim=range(0.1, x$statistic, x$mean + x$sd, x$mean - x$sd)+c(-0.1, 0.1), 
      main=paste("N=", length(x$rsample[[1]]), sep=""), ...)
    # plot correlation coefficient reference points
    graphics::abline(h=-1:1, lty=c(3, 2, 3), col="grey")
    # blue points signify that no single larger r value has been sampled
    graphics::points(1:nrow(x), x$statistic, col=blue.if.true(x$is.unique.max))
    # label the veridical rs with their p values
    graphics::text(1:nrow(x), ifelse(x$statistic >= x$mean,
      pmax(x$statistic, x$mean + x$sd), pmin(x$statistic, x$mean - x$sd)),
      labels=paste("p", format(x$p.value, digits=2), sep="="), pos=2+sign(x$statistic))
    graphics::axis(1, at=1:nrow(x), labels=x$group)
  }
}
kevinstadler/cultevo documentation built on April 25, 2018, 9:05 a.m.