#' @title Pairwise similarities of phenotype spaces
#'
#' @description \code{space_similarity} estimate pairwise similarities of phenotype spaces
#' @inheritParams template_params
#' @param method Character vector of length 1. Controls the method of (di)similarity metric to be compare the phenotypic sub-spaces of two groups at the time. Seven built-in metrics are available which quantify as pairwise sub-space overlap ('similarity') or pairwise distance between bi-dimensional sub-spaces ('dissimilarity'):
#' \itemize{
#' \item \code{density.overlap}: proportion of the phenotypic sub-spaces area that overlap, taking into account the irregular densities of the sub-spaces. Two groups that share their higher density areas will be more similar than similar sub-spaces that only share their lower density areas. Two values are supplied as the proportion of the space of A that overlaps B is not necessarily the same as the proportion of B that overlaps A. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 6 observations.
#' \item \code{mean.density.overlap}: similar to 'density.overlap' but the two values are merged into a single pairwise mean overlap. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 6 observations.
#' \item \code{mcp.overlap}: proportion of the phenotypic sub-spaces area that overlap, in which areas are calculated as the minimum convex polygon of all observations for each sub-space. Two values are supplied as the proportion of the space of A that overlaps B is not necessarily the same as the proportion of B that overlaps A. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 5 observations.
#' \item \code{mean.mcp.overlap}: similar to 'mcp.overlap' but the two values are merged into a single pairwise mean overlap. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 5 observations.
#' \item \code{proportional.overlap}: proportion of the joint area of both sub-spaces that overlaps (overlapped area / total area of both groups). Sub-space areas are calculated as the minimum convex polygon. Similarity metric (higher values means more similar). The minimum sample size (per group) must be 5 observations.
#' \item \code{distance}: mean euclidean pairwise distance between all observations of the compared sub-spaces. Dissimilarity metric (higher values means less similar). The minimum sample size (per group) must be 1 observation.
#' \item \code{centroid.distance}: euclidean distance between the centroid of the compared sub-spaces. Dissimilarity metric (higher values means less similar). The minimum sample size (per group) must be 1 observation.
#' \item \code{probability}: Bayesian probability of observations of one group being classified as belonging to the other group. Similarity metric (higher values means less similar). The minimum sample size (per group) must be higher the number of dimensions. Probabilities are calculated using the function \code{\link[nicheROVER]{overlap}} from the nicheROVER package. The following values are used internally by \code{\link[nicheROVER]{overlap}}: nreps = 1000, nprob = 1000, kappa = 0, Psi = 0, nu = number of predictors + 1. Random draws are taken from the posterior distribution with Normal-Inverse-Wishart (NIW) prior using the function \code{\link[nicheROVER]{niw.post}}. Take a look at the nicheROVER package for further details on this method.
#' }
#' In addition, machine learning classification models can also be used for quantify dissimilarity as a measured of how discriminable two groups are. These models can use more than two dimensions to represent phenotyypic spaces. The following classification models can be used: "AdaBag", "avNNet", "bam", "C5.0", "C5.0Cost", "C5.0Rules", "C5.0Tree", "gam", "gamLoess", "glmnet", "glmStepAIC", "kernelpls", "kknn", "lda", "lda2", "LogitBoost", "msaenet", "multinom", "nnet", "null", "ownn", "parRF", "pcaNNet", "pls", "plsRglm", "pre", "qda", "randomGLM", "rf", "rFerns", "rocc", "rotationForest", "rotationForestCp", "RRF", "RRFglobal", "sda", "simpls", "slda", "smda", "snn", "sparseLDA", "svmLinear2", "svmLinearWeights", "treebag", "widekernelpls" and "wsrf". See \url{https://topepo.github.io/caret/train-models-by-tag.html} for details on each of these models. Additional arguments can be pased using \code{...}. Note that some machine learning methods can significantly affect computational efficiency (i.e. take a long time to compute).
#' @param outliers Numeric vector of length 1. A value between 0 and 1 controlling the proportion of outlier observations to be excluded. Outliers are determined as those farthest away from the sub-space centroid. Ignored when using machine learning methods.
#' @param pairwise.scale Logical argument to control if pairwise phenotypic spaces are scaled (i.e. z-transformed) prior to similarity estimation. If so (\code{TRUE}) similarities are decoupled from the size of the global phenotypic space. Useful to compare similarities coming from different phenotypic spaces. Default is \code{FALSE}. Not available for 'density.overlap', 'mean.density.overlap' or any machine learning model.
#' @param distance.method Character vector of length 1 indicating the method to be used for measuring distances (hence only applicable when distances are calculated). Available distance measures are: "Euclidean" (default), "Manhattan", "supremum", "Canberra", "Wave", "divergence", "Bray", "Soergel", "Podani", "Chord", "Geodesic" and "Whittaker". If a similarity measure is used similarities are converted to distances.
#' @param seed Integer number containing the random number generator (RNG) state for random number generation in order to make results from the machine learning stochastic methods replicable.
#' @param ... Additional arguments to be passed to \code{\link[caret]{train}}.
#' @return A data frame containing the similarity metric for each pair of groups. If the similarity metric is not symmetric (e.g. the proportional area of A that overlaps B is not necessarily the same as the area of B that overlaps A, see \code{\link{space_similarity}}) separated columns are supplied for the two comparisons.
#' @export
#' @name space_similarity
#' @details The function quantifies pairwise similarity between phenotypic sub-spaces. The built-in methods quantify similarity as the overlap (similarity, or machine learning based discriminability) or distance (dissimilarity) between group. Machine learning methods implemented in the caret package function \code{\link[caret]{train}} are available to assess the similarity of spaces as the proportion of observations that are incorrectly classified. In this case group overlaps are the class-wise errors (if available) while the mean overlap is calculated as \code{1- model accuracy}.
#' @examples {
#' # load data
#' data("example_space")
#'
#' # get proportion of space that overlaps
#' prop_overlaps <- space_similarity(
#' formula = group ~ dimension_1 + dimension_2,
#' data = example_space,
#' method = "proportional.overlap")
#'
#' #' # get symmetric triangular matrix
#' rectangular_to_triangular(prop_overlaps)
#'
#' # get minimum convex polygon overlap for each group (non-symmetric)
#' mcp_overlaps <- space_similarity(
#' formula = group ~ dimension_1 + dimension_2,
#' data = example_space,
#' method = "mcp.overlap")
#'
#' # convert to non-symmetric triangular matrix
#' rectangular_to_triangular(mcp_overlaps, symmetric = FALSE)
#'
#' # check available distance measures
#' summary(proxy::pr_DB)
#'
#' # get eculidean distances (default)
#' area_dist <- space_similarity(
#' formula = group ~ dimension_1 + dimension_2,
#' data = example_space,
#' method = "distance",
#' distance.method = "Euclidean")
#'
#' # get Canberra distances
#' area_dist <- space_similarity(
#' formula = group ~ dimension_1 + dimension_2,
#' data = example_space,
#' method = "distance",
#' distance.method = "Canberra")
#'
#' ## using machine learning classification methods
#'
#' # check if caret package and needed dependencies are available
#' rlang::check_installed("caret")
#' rlang::check_installed("randomForest")
#'
#' # random forest 3 dimension data, using 5 repeats and repeated CV resampling
#' # extract data subset
#' sub_data <- example_space[example_space$group %in% c("G1", "G2", "G3"), ]
#'
#' # set method parameters
#' ctrl <- caret::trainControl(method = "repeatedcv", repeats = 5)
#'
#' # get similarities ("overlap")
#' space_similarity(
#' formula = group ~ dimension_1 + dimension_2 + dimension_3,
#' data = sub_data,
#' method = "rf",
#' trControl = ctrl,
#' tuneLength = 4,
#' seed = 123
#' )
#'
#' # Single C5.0 Tree using boot resampling
#' ctrl <- caret::trainControl(method = "boot")
#'
#' space_similarity(
#' formula = group ~ dimension_1 + dimension_2,
#' data = sub_data,
#' method = "C5.0Tree",
#' trControl = ctrl,
#' tuneLength = 3
#')
#' }
#' @seealso \code{\link{rarefact_space_similarity}}, \code{\link{space_size_difference}}
#' @author Marcelo Araya-Salas \email{marcelo.araya@@ucr.ac.cr})
#'
#' @references
#' Araya-Salas, M, & K. Odom. 2022, PhenotypeSpace: an R package to quantify and compare phenotypic trait spaces R package version 0.1.0.
space_similarity <-
function(formula,
data,
cores = 1,
method = "mcp.overlap",
pb = TRUE,
outliers = 0.95,
pairwise.scale = FALSE,
distance.method = "Euclidean",
seed = NULL,
...) {
# supported machine learning methods
ml_methods <- c("AdaBag",
"avNNet",
"bam",
"C5.0",
"C5.0Cost",
"C5.0Rules",
"C5.0Tree",
"gam",
"gamLoess",
"glmnet",
"glmStepAIC",
"kernelpls",
"kknn",
"lda",
"lda2",
"LogitBoost",
"msaenet",
"multinom",
"nnet",
"null",
"ownn",
"parRF",
"pcaNNet",
"pls",
"plsRglm",
"pre",
"qda",
"randomGLM",
"rf",
"rFerns",
"rocc",
"rotationForest",
"rotationForestCp",
"RRF",
"RRFglobal",
"sda",
"simpls",
"slda",
"smda",
"snn",
"sparseLDA",
"svmLinear2",
"svmLinearWeights",
"treebag",
"widekernelpls",
"wsrf")
if (!method %in% c(
# phenotype space methods
"density.overlap",
"mean.density.overlap",
"mcp.overlap",
"mean.mcp.overlap",
"proportional.overlap",
"distance",
"centroid.distance",
"probability",
ml_methods
))
stop2("Unsupported 'method' declared")
if (!distance.method %in% c("Euclidean", "Manhattan", "supremum", "Canberra", "Wave", "divergence", "Bray", "Soergel", "Podani", "Chord", "Geodesic", "Whittaker"))
stop2("Unsupported 'distance.method' declared")
# check if model package is installed and offer user to install it if not
if (method %in% ml_methods) {
rlang::check_installed("caret")
# install dependenciesfor machine learning method
caret::checkInstall(caret::getModelInfo(method, regex = FALSE)[[1]]$library)
}
# get term names from formula
form_terms <- terms(formula, data = data)
dimensions <- attr(form_terms, "term.labels")
group <- as.character(form_terms[[2]])
# center data
# if (method == "centroid.distance") {
#
# data[, dimensions[1]] <- scale(data[, dimensions[1]])
# data[, dimensions[2]] <- scale(data[, dimensions[2]])
#
# } else {
# # force dimensions to range 0-1
# data[, dimensions[1]] <- (data[, dimensions[1]] - min( data[, dimensions[1]])) / max((data[, dimensions[1]] - min( data[, dimensions[1]])))
# data[, dimensions[2]] <- (data[, dimensions[2]] - min( data[, dimensions[2]])) / max((data[, dimensions[2]] - min( data[, dimensions[2]])))
#
# }
# split in a list of vectors
split_data_list <- split(x = data, f = data[, group])
# stop if too small sample sizes
if (min(sapply(split_data_list, nrow)) < 2 & method != "probability")
stop2(
"There is at least one group with less than 2 observations which is the minimum needed for similarity estimation"
)
if (min(sapply(split_data_list, nrow)) <= length(dimensions) & method == "probability")
stop2(
"There is at least one group with less observations than the dimensions which is the minimum needed for probability estimation"
)
# get densities
if (method %in% c("density.overlap", "mean.density.overlap")) {
total_coors <-
spatstat.geom::as.ppp(as.matrix(data[, dimensions]), c(range(data[, dimensions[1]]), range(data[, dimensions[2]])))
total_space <-
raster::raster(spatstat.explore::density.ppp(total_coors))
input_data <-
lapply(split_data_list, function(Y, dims = dimensions) {
coors <-
spatstat.geom::as.ppp(as.matrix(Y[, dimensions]), c(range(Y[, dimensions[1]]), range(Y[, dimensions[2]])))
raster_dens <-
raster::raster(spatstat.explore::density.ppp(coors))
raster_dens <-
raster::crop(raster::extend(raster_dens, total_space), total_space)
# convert to range 0-1
raster::values(raster_dens) <-
(raster::values(raster_dens) - min(raster::values(raster_dens), na.rm = TRUE))
raster::values(raster_dens) <-
raster::values(raster_dens) / max(raster::values(raster_dens), na.rm = TRUE)
# keep 95% interval
raster::values(raster_dens)[raster::values(raster_dens) < 1 - outliers] <-
NA
# same resolution and extend as total space
raster_dens <- raster::resample(raster_dens, total_space)
raster::extent(raster_dens) <- c(0, 1, 0, 1)
return(raster_dens)
})
names(input_data) <- names(split_data_list)
} else {
input_data <- split_data_list
}
# function to calculate areas
ovlp_fun <- function(W, Z, WZ, tp, dims, dist.meth, seed) {
# get area
# raster density
if (tp %in% c("density.overlap", "mean.density.overlap")) {
# convert NAs to 0
raster::values(W)[is.na(raster::values(W))] <- 0
raster::values(Z)[is.na(raster::values(Z))] <- 0
# 1 vs 2
wgt_ovlp.WvZ <-
sum((raster::values(W) * raster::values(Z)) / raster::values(Z),
na.rm = TRUE) / sum(raster::values(Z), na.rm = TRUE)
# 2 vs 1
wgt_ovlp.ZvW <-
sum((raster::values(Z) * raster::values(W)) / raster::values(W),
na.rm = TRUE) / sum(raster::values(W), na.rm = TRUE)
# convert to 1 if higher than 1
if (wgt_ovlp.WvZ > 1)
wgt_ovlp.WvZ <- 1
if (wgt_ovlp.ZvW > 1)
wgt_ovlp.ZvW <- 1
# put results in am matrix
out <- matrix(c(wgt_ovlp.WvZ, wgt_ovlp.ZvW, mean(c(wgt_ovlp.WvZ, wgt_ovlp.ZvW))), nrow = 1)
}
if (tp %in% c("mcp.overlap", "mean.mcp.overlap")) {
sp::coordinates(W) <-
stats::as.formula(paste("~", paste(dims, collapse = "+")))
sp::coordinates(Z) <-
stats::as.formula(paste("~", paste(dims, collapse = "+")))
# Create the MCPs (Minimum Convex Polygons) for W and Z
mcp_W <- sf::st_as_sf(adehabitatHR::mcp(W))
mcp_Z <- sf::st_as_sf(adehabitatHR::mcp(Z))
# Perform the intersection
intrsct <- sf::st_intersection(mcp_W, mcp_Z)
if (nrow(intrsct) == 0) {
ovlp1in2 <- ovlp2in1 <- 0
} else {
# Calculate the area of the intersection (in square meters by default)
intrsctspace_size <- sf::st_area(intrsct)
ovlp1in2 <-
intrsctspace_size / sf::st_area(mcp_W)
ovlp2in1 <-
intrsctspace_size / sf::st_area(mcp_Z)
}
out <- matrix(c(ovlp2in1, ovlp1in2, mean(c(ovlp2in1, ovlp1in2))), nrow = 1)
}
if (tp == "proportional.overlap") {
sp::coordinates(W) <-
stats::as.formula(paste("~", paste(dims, collapse = "+")))
sp::coordinates(Z) <-
stats::as.formula(paste("~", paste(dims, collapse = "+")))
# both combined
Y <- rbind(W, Z)
# get intersect
mcp_W <- sf::st_as_sf(adehabitatHR::mcp(W, percent = outliers * 100))
mcp_Z <- sf::st_as_sf(adehabitatHR::mcp(Z, percent = outliers * 100))
# Perform the intersection
intrsct <- sf::st_intersection(mcp_W, mcp_Z)
if (nrow(intrsct) == 0) {
ovlp1in2 <- ovlp2in1 <- 0
} else {
# intrsctspace_size <- raster::area(intrsct)
intrsctspace_size <- sf::st_area(intrsct)
# totalspace_size <-
# raster::area(adehabitatHR::mcp(Y, percent = outliers * 100))
mcp_Y <- sf::st_as_sf(adehabitatHR::mcp(Y), percent = outliers * 100)
totalspace_size <- sf::st_area(mcp_Y)
ovlp1in2 <-
ovlp2in1 <- intrsctspace_size / totalspace_size
}
out <- matrix(c(ovlp2in1, ovlp1in2, mean(c(ovlp2in1, ovlp1in2))), nrow = 1)
}
# distance among points
if (tp == "distance") {
U <- rbind(W, Z)
dists <-
as.matrix(proxy::dist(U[, dims], method = dist.meth, convert_similarities = TRUE))
dists <-
dists[U[, group] == W[, group][1], U[, group] == Z[, group][1]]
out <- matrix(mean(dists), nrow = 1)
}
if (tp == "centroid.distance") {
# both combined
Y <- rbind(as.data.frame(W), as.data.frame(Z))
frmla <-
stats::as.formula(paste("cbind(", paste(dims, collapse = ","), ") ~ ", group))
centroid.dist <-
proxy::dist(
stats::aggregate(
x = frmla,
data = Y,
FUN = mean
)[, -1],
method = dist.meth,
convert_similarities = TRUE
)
out <- matrix(centroid.dist, nrow = 1)
}
# if any machine learning model
if (tp %in% ml_methods) {
if (!is.null(seed))
set.seed(seed)
capture.output(train_caret <-
caret::train(form = formula,
data = WZ,
method = tp,
verbose = FALSE,
...
))
out <- matrix(1 - train_caret$results$Accuracy[nrow(train_caret$results)], nrow = 1)
if (!is.null(train_caret$finalModel$confusion)){
conf_matrix <- train_caret$finalModel$confusion
out <- matrix(c(conf_matrix[rownames(conf_matrix) == W[1, group], 3], conf_matrix[rownames(conf_matrix) == Z[1, group], 3], 1 - train_caret$results$Accuracy[nrow(train_caret$results)]), nrow = 1)
}
}
if (tp == "probability") {
wz.par <- tapply(1:nrow(WZ), WZ$group,
function(ii) nicheROVER::niw.post(nsamples = 1000, X = WZ[ii, dims]))
# overlap calculation
over <- nicheROVER::overlap(wz.par, nreps = 1000, nprob = 1000,
alpha = c(outliers))
# summarize reps
mat <- do.call(rbind, lapply(1:dim(over)[3],function(i) c(over[, , i])))
# extract similarity
mean_mat <- apply(mat,2, mean)
ovlp1in2 <- mean_mat[3]
ovlp2in1 <- mean_mat[2]
# put it in matrix format
out <- matrix(c(ovlp2in1, ovlp1in2, mean(c(ovlp2in1, ovlp1in2))), nrow = 1)
}
return(out)
}
# get all combinations to get pairwise overlaps
group_combs <- t(utils::combn(sort(unique(data[, group])), 2))
# calculate all similarities
similarities_l <-
pblapply_phtpspc_int(1:nrow(group_combs), pbar = pb, cl = cores, function(i,
dat = input_data,
gc = group_combs,
dims = dimensions,
typ = method,
pair.scale = pairwise.scale,
dist.meth = distance.method,
sed = seed
) {
if (typ %in% c("density.overlap", "mean.density.overlap")) {
W <- dat[[which(names(dat) == gc[i, 1])]]
Z <- dat[[which(names(dat) == gc[i, 2])]]
}
if (typ %in% c(
"distance",
"mcp.overlap",
"mean.mcp.overlap",
"proportional.overlap",
"centroid.distance",
"probability",
ml_methods)) {
W <- split_data_list[[which(names(split_data_list) == gc[i, 1])]]
Z <-
split_data_list[[which(names(split_data_list) == gc[i, 2])]]
# pairwise scale
if (pair.scale) {
nrow_W <- nrow(W)
# bind and scale
U <- rbind(W, Z)
U[, dimensions[1]] <- scale(U[, dimensions[1]])
U[, dimensions[2]] <- scale(U[, dimensions[2]])
# split back
W <- U[1:nrow_W, ]
Z <- U[(nrow_W + 1):nrow(U), ]
}
if (typ %in% c(ml_methods, "probability")) {
WZ <- rbind(W, Z)
WZ <- droplevels(WZ)
} else {
WZ <- NULL
}
}
suppressWarnings(similarities <-
ovlp_fun(W, Z, WZ, typ, dims, dist.meth, seed = sed))
# put in a data frame
out_df <-
if (typ %in% c(
"mean.density.overlap",
"mean.mcp.overlap",
"centroid.distance",
"proportional.overlap",
"distance"
) | ncol(similarities) == 1) {
data.frame(
group.1 = gc[i, 1],
group.2 = gc[i, 2],
similarity = mean(similarities)
)
} else
data.frame(
group.1 = gc[i, 1],
group.2 = gc[i, 2],
similarity.1in2 = similarities[2],
similarity.2in1 = similarities[1],
mean.similarity = similarities[3]
)
return(out_df)
})
# put in a data frame
space_similarities <- do.call(rbind, similarities_l)
# rename columns
if (grepl("distance", method) & method != "probability") {
names(space_similarities) <-gsub("similarity", "distance", names(space_similarities))
}
if (!grepl("distance", method) & method != "probability") {
names(space_similarities) <- gsub("similarity", "overlap", names(space_similarities))
}
if (method == "probability") {
names(space_similarities) <-gsub("similarity", "probability", names(space_similarities))
}
return(space_similarities)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.