Nothing
#' Brainerd-Robinson similarity coefficient matrix calculation, with permutation-based p-values and optional clustering
#'
#' @description This function calculates the Brainerd-Robinson (BR) similarity coefficient for each pair of row
#' of the input table (Robinson-Brainerd 1951, 1952).
#' It also performs a permutation test to assess the significance of each BR coefficient (DeBoer-Kintigh-Rostoker 1996), and allows to carry
#' out a hierarchical agglomerative clustering. An optimal cluster solution can be established using the silhouette method
#' (see details provided below). The function produces a correlation matrix in tabular form, which is also visually plotted as an heatmap.
#' In the heatmap (which is built using the \code{corrplot} package), the size and the color of the
#' squares are proportional to the Brainerd-Robinson coefficients. Optionally, the heatmap can be reordered on the basis
#' of the hierachical clustering, with clusters enclosed by red rectangles.\cr
#' Visit this \href{https://drive.google.com/file/d/1LSC5VE_QZNM2KOCPkItn--1zsjfXnHbH/view?usp=share_link}{LINK} to access the package's vignette.\cr
#'
#' @details \strong{Permutation-based p-values}\cr
#'
#' The rationale behind the p-value calculation is as follows: for each pair of assemblages in the input data,
#' the function first calculates the observed Brainerd-Robinson (BR) coefficient. This is a measure of the similarity
#' between the two assemblages.
#'
#' The function then performs a certain number of permutations (default is 1000).
#' In each permutation, it generates two new assemblages (each featuring a sample size corresponding to the size of
#' each assemblage being compared) by randomly sampling from the global pool (the combined data of all assemblages),
#' and calculates the BR coefficient for this new pair of assemblages (see DeBoer-Kintigh-Rostoker 1996). This creates
#' a distribution of BR coefficients that we would expect to see by chance.
#'
#' The p-value is then calculated as the proportion of the permuted BR coefficients that are less than or equal
#' to the observed BR coefficient. A small p-value (typically < 0.05) suggests that the observed similarity
#' between the two assemblages is statistically significant; it is unlikely to have occurred just by chance.\cr
#'
#' In simple terms, the p-value calculation is trying to answer the question: if there were no real
#' similarity between these two assemblages, what is the probability that I would observe a similarity as extreme as
#' (or more extreme than) the one I actually observed, just by chance?\cr
#'
#' The p-values are returned in two matrices: in the first, the p-values are reported as they are, whereas in the second
#' they are classified as <0.05, <0.01, <0.001, or not significant.\cr
#'
#' \strong{Hierarchical agglomerative clustering}\cr
#'
#' By setting the parameter \code{clust} to \code{TRUE}, the units (rows) for which the BR coefficients have been
#' calculated will be clustered. Note that the clustering is based on a dissimilarity matrix which
#' is internally calculated as the maximum value of the BR coefficient (200) minus the observed BR coefficient.
#'
#' This allows a simpler reading of the dendrogram which is produced by the function, where the less dissimilar
#' (i.e., more similar) units will be placed at lower levels, while more dissimilar (i.e., less similar) units
#' will be placed at higher levels within the dendrogram. The latter depicts the hierarchical clustering based
#' (by default) on the Ward's agglomeration method; rectangles identify the selected cluster partition.
#'
#' Optionally, by setting the \code{sort.map} to \code{TRUE}, the heatmap can be reordered on the basis of the hierarchical clustering,
#' with clusters indicated by red rectangles. The number of clusters indicated depends on what requested by the user (see
#' the next section). Note that, internally, the reordering is based on the same agglomeration method
#' selected by the user via the \code{aggl.method} parameter, which is set to \code{ward.D2} by default.\cr
#'
#' \strong{Number of clusters and silhouette method}\cr
#'
#' Besides the dendrogram, a silhouette plot is produced, which allows to measure how 'good' is the selected cluster solution.
#' If the parameter \code{part} is left empty (default), an optimal cluster solution is
#' obtained.
#'
#' The optimal partition is selected via an iterative procedure which identifies at which
#' cluster solution the highest average silhouette width is achieved. The cluster solution ranges from a minimum of
#' 2 to a maximum which is equal to the number of units (i.e., the rows of the input dataset) minus 1.
#' The number of units essentially represents the maximum number of clusters that could potentially be formed
#' if each row were its own cluster. However, since a cluster solution requires at least two groups,
#' the maximum number of meaningful clusters is one less than the number of rows.
#'
#' If a user-defined partition is needed, the user can input the desired number of clusters using the parameter \code{part}.
#'
#' In either case, an additional plot is returned besides the cluster dendrogram and the silhouette plot; it
#' displays a scatterplot in which the cluster solution (x-axis) is plotted against the average
#' silhouette width (y-axis). A black dot represents the partition selected either by the iterative
#' procedure or by the user.
#'
#' Note that in the silhouette plot, the labels on the left-hand side of the
#' chart show the units' names and the cluster number to which each unit is closer.\cr
#'
#' The silhouette plot is obtained from the \code{silhouette()} function out from the \code{cluster} package.
#' For a detailed description of the silhouette plot, its rationale, and its interpretation, see
#' Rousseeuw 1987.
#'
#' \strong{Descriptive by-cluster dotplots}\cr
#'
#' The function also provides a Cleveland's dotplots that represent by-cluster proportions. The
#' clustered units are grouped according to their cluster membership, the frequencies are summed, and
#' then expressed as percentages. The latter are represented by the dotplots, along with the average
#' percentage. The latter provides a frame of reference to understand which percentage is below,
#' above, or close to the average. The raw data on which the plots are based are stored in the
#' list returned by the function (see below).\cr
#'
#' @param df A table (dataframe format) where each row represents an assemblage and each column represents an item.
#' @param num.perm A numeric value indicating the number of permutations to perform in each test (default is 1000).
#' @param clust TRUE (default) or FALSE if the user does or does not want a agglomerative hierarchical clustering to be performed.
#' @param part Desired number of clusters; if NULL (default), an optimal partition is calculated (see Details).
#' @param aggl.meth Agglomeration method ("ward.D2" by default) to be used; the selected method is internally used for the reordering
#' of the heatmap if \code{order.map} is set to \code{TRUE}; for other methods see \code{\link[stats]{hclust}}.
#' @param sort.map TRUE or FALSE (default) if the user does or does not want the rendered heatmap to be ordered on the basis of
#' the selected hierachical clustering.
#' @param number.cex Numeric. Set the size of the labels used for the coefficients displayed in the rendered heatmap.
#' @param cex.dndr.lab Numeric. Set the size of the labels used in the dendrogram.
#' @param cex.sil.lab Numeric. Set the size of the labels used in the silhouette plot.
#' @param cex.dot.plt.lab Numeric. Set the size of the labels used in the Cleveland's dotplots representing by-cluster proportions.
#' @param oneplot TRUE (default) or FALSE if the user wants or does not want the plots to be visualized in a single window.
#'
#'@return The function returns a list storing the following components
#'\itemize{
##' \item \code{BR.similarity.matrix:} similarity matrix reporting the BR coefficients.
##' \item \code{P-value.matrix:} matrix reporting the permuted p-values.
##' \item \code{classified.P-values.matrix:} matrix reporting the permuted p-value classified as <0.05, <0.01, <0.001, or not significant.
##' \item \code{BR.distance_matrix:} distance matrix on which the hierarchical clustering is performed (returned if clustering is selected).
##' \item \code{avr.silh.width.by.n.of.clusters:} average silhouette width by number of clusters (returned if clustering is selected).
##' \item \code{partition.silh.data:} silhouette data for the selected partition (returned if clustering is selected).
##' \item \code{data.with.cluster.membership:} copy of the input data table with an additional column storing the cluster membership for each row (returned if clustering is selected).
##' \item \code{by.cluster.proportion:} table reporting the proportion of column categories across each cluster; rows sum to 100 percent (returned if clustering is selected).
##' }
#'
#' @keywords similarity
#'
#' @export
#'
#' @importFrom corrplot corrplot
#' @importFrom stats as.dist hclust cutree aggregate rect.hclust
#' @importFrom cluster silhouette
#' @importFrom RcmdrMisc assignCluster
#' @importFrom grDevices colorRampPalette
#' @importFrom graphics axis layout par
#'
#' @references Robinson, W. S. (1951). A Method for Chronologically Ordering Archaeological Deposits.
#' In American Antiquity (Vol. 16, Issue 4, pp. 293–301). Cambridge University Press.
#'
#' @references Robinson, W. S., & Brainerd, G. W. (1952). Robinson’s Coefficient of Agreement – A Rejoinder.
#' In American Antiquity (Vol. 18, Issue 1, pp. 60–61). Cambridge University Press.
#'
#' @references Rousseeuw P J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster
#' analysis. In Journal of Computational and Applied Mathematics 20, 53-65.
#'
#' @references DeBoer, W. R., Kintigh, K., & Rostoker, A. G. (1996). Ceramic Seriation and Site Reoccupation in Lowland South America.
#' In Latin American Antiquity (Vol. 7, Issue 3, pp. 263–278). Cambridge University Press.
#'
#'
#' @examples
#' # build a toy dataset (subset of the 'Nelson' dataset out of the 'archdata' package )
#'
#' mytable <- structure(list(Biscuit = c(10, 17, 2, 10, 2, 1),
#' Type_I = c(2,2, 10, 40, 118, 107),
#' Type_II_Red = c(24, 64, 68, 91, 45, 3),
#' Type_II_Yellow = c(23, 90, 18, 20, 1, 0),
#' Type_II_Gray = c(34,76, 48, 15, 5, 0)),
#' row.names = c("1", "2", "3", "7", "8", "9"),
#' class = "data.frame")
#'
#' # run the function and store the results in the 'result' object
#'
#' result <- brsim(mytable, clust=TRUE)
#'
#' # same as above, but with an user-defined cluster partition
#'
#' result <- brsim(mytable, clust=TRUE, part=3)
#'
#' # same as above, but rendering with a reordered heatmap
#'
#' result <- brsim(mytable, clust=TRUE, part=3, sort.map=TRUE)
#'
#'
#' @seealso \code{\link[corrplot]{corrplot}} , \code{\link[cluster]{silhouette}}
#'
#'
brsim <- function(df, num.perm=1000, clust=FALSE, part=NULL, aggl.meth="ward.D2", sort.map=FALSE, number.cex = 0.70, cex.dndr.lab = 0.70, cex.sil.lab = 0.70, cex.dot.plt.lab = 0.70, oneplot=TRUE){
# Save current par settings
oldpar <- par(no.readonly = TRUE)
# Ensure settings are restored when function exits
on.exit(par(oldpar))
# Convert df to matrix
matrix <- as.matrix(df)
# Get the row names to be later attached to the rows and columns of the output matrices
df.row.names <- rownames(df)
# Local function for Brainerd-Robinson Similarity Coefficient
brainard_robinson <- function(assemblage1, assemblage2){
prop_assemblage1 <- assemblage1 / sum(assemblage1)
prop_assemblage2 <- assemblage2 / sum(assemblage2)
min_sum <- sum(pmin(prop_assemblage1, prop_assemblage2))
return(200 * min_sum)
}
# Local function for permutation test
perm_test_br <- function(assemblage1, assemblage2, num.perm=1000){
observed_br <- brainard_robinson(assemblage1, assemblage2)
permuted_br <- numeric(num.perm)
# Calculate the global pool of all assemblages
global_pool <- sum(matrix)
for(i in 1:num.perm){
# Draw two random samples from the global pool
permuted_assemblage1 <- sample(global_pool, size = length(assemblage1), replace = TRUE)
permuted_assemblage2 <- sample(global_pool, size = length(assemblage2), replace = TRUE)
permuted_br[i] <- brainard_robinson(permuted_assemblage1, permuted_assemblage2)
}
# Calculate the p-value as the proportion of permuted BR coefficients
# that are less than or equal to the observed BR coefficient
p_value <- mean(permuted_br <= observed_br) # notice: smaller or equal as per https://www.mattpeeples.net/BR.html
return(p_value)
}
num_assemblages <- nrow(matrix)
br_matrix <- matrix(0, nrow=num_assemblages, ncol=num_assemblages)
p_matrix <- matrix(0, nrow=num_assemblages, ncol=num_assemblages)
p_class_matrix <- matrix("-", nrow=num_assemblages, ncol=num_assemblages)
for(i in 1:(num_assemblages-1)){
for(j in (i+1):num_assemblages){
br_matrix[i,j] <- brainard_robinson(matrix[i,], matrix[j,])
br_matrix[j,i] <- br_matrix[i,j]
p_value <- perm_test_br(matrix[i,], matrix[j,], num.perm)
p_matrix[i,j] <- p_value
p_matrix[j,i] <- p_value
# Classify the p-values
if(p_value < 0.001) {
p_class <- "<0.001"
} else if(p_value < 0.01) {
p_class <- "<0.01"
} else if(p_value < 0.05) {
p_class <- "<0.05"
} else {
p_class <- "n.s."
}
p_class_matrix[i,j] <- p_class
p_class_matrix[j,i] <- p_class
}
}
# Set the diagonal elements of br_matrix to 200
diag(br_matrix) <- 200
# Attach the input table's row names to the output matrices
rownames(br_matrix) <- df.row.names
colnames(br_matrix) <- df.row.names
rownames(p_matrix) <- df.row.names
colnames(p_matrix) <- df.row.names
rownames(p_class_matrix) <- df.row.names
colnames(p_class_matrix) <- df.row.names
# Define the color palette to be used later on in the corrplot
col1 <- colorRampPalette(c("lightblue", "blue", "darkblue"))
if(clust==TRUE & oneplot==TRUE){
m <- rbind(c(1,2), c(3,4), c(5,6))
layout(m)
}
if(clust==TRUE){
dist.matrix <- stats::as.dist(200 - br_matrix, diag=TRUE, upper=TRUE)
fit <- stats::hclust(dist.matrix, method = aggl.meth)
# Calculate the max number of clusters to be later used in the loop
# which iterates the calculation of the average silhouette width
rd <- dim(matrix)[1]
max.ncl <- rd-1
# Create a slot to store the values of the average silhouette value at increasing numbers of cluster solutions
# this will be used inside the loop
sil.width.val <- numeric(max.ncl - 1)
# Create a slot to store the increasing number of clusters whose silhouette is iteratively computed
sil.width.step <- c(2:max.ncl)
# Calculate the average silhouette width at each increasing cluster solution
for (i in min(sil.width.step):max(sil.width.step)) {
counter <- i - 1
clust.sol <- cluster::silhouette(cutree(fit, k = i), dist.matrix)
sil.width.val[counter] <- mean(clust.sol[, 3])
}
# Create a dataframe that stores the number of cluster solutions and the corresponding average silhouette values
sil.res <- as.data.frame(cbind(sil.width.step, sil.width.val))
# If the user does not enter the desired partition, the latter is equal to the optimal one,
# ie the one with the largest average silhouette; otherwise use the user-defined partition
ifelse(is.null(part)==TRUE,
select.clst.num <- sil.res$sil.width.step[sil.res$sil.width.val == max(sil.res$sil.width.val)],
select.clst.num <- part)
# Create the final silhouette data using the partition established at the preceding step
final.sil.data <- cluster::silhouette(stats::cutree(fit, k = select.clst.num), dist.matrix)
# Add the cluster membership to a copy of the input dataframe
data.w.cluster.membership <- df
data.w.cluster.membership$clust<- assignCluster(df, df, cutree(fit, k = select.clst.num))
# Aggregate the rows of the input dataframe by summing by cluster membership
# x[-ncol(x)] tells R to work on all the dataframe excluding the last columnn since, as per the preceding step, it contains the
# cluster membership
sum.by.clust <- stats::aggregate(data.w.cluster.membership[-ncol(data.w.cluster.membership)], list(data.w.cluster.membership$clust), sum)[,-1]
# Number of rows of the preceding dataframe
nrows <- nrow(sum.by.clust)
# Add the columns total as a new row of the preceding dataframe
sum.by.clust[nrows+1,] <- colSums(sum.by.clust, dims=1)
# Turn the counts to percentages
prop.by.clust <- as.data.frame(round(prop.table(as.matrix(sum.by.clust), 1)*100,3))
# Add row names to the above dataframe:
# add reference to cluster membership to all the rows but the last one
rownames(prop.by.clust)[seq(1,nrows,1)] <- paste0("cluster ", seq(1,select.clst.num, 1))
# Add reference to the average percentage as name of the last row of the dataframe
rownames(prop.by.clust)[nrows+1] <- "average"
# Plot the Brainerd-Robinson matrix, upon condition
if(sort.map==TRUE){
corrplot::corrplot(br_matrix,
method="square",
addCoef.col="red",
number.cex = number.cex,
is.corr=FALSE,
col = col1(100),
tl.col="black",
tl.cex=0.8,
order="hclust",
hclust.method=aggl.meth,
addrect = select.clst.num,
rect.col="red")
} else {
corrplot::corrplot(br_matrix,
method="square",
addCoef.col="red",
number.cex = number.cex,
is.corr=FALSE,
col = col1(100),
tl.col="black",
tl.cex=0.8)
}
# Plot the dendrogram
graphics::plot(fit, main = "Clusters Dendrogram based on dissimilarity (1-BR sim. coeff.)",
sub = paste0("\nAgglomeration method: ", aggl.meth),
xlab = "",
cex = cex.dndr.lab,
cex.main = 0.85,
cex.sub = 0.75)
# Add the rectangles representing the selected partitions
solution <- stats::rect.hclust(fit, k = select.clst.num, border="black")
# Modify the row names of the final silhouette dataframe to also store the cluster to which each feature is closest
rownames(final.sil.data) <- paste(rownames(df), final.sil.data[, 2], sep = "_")
# Plot the silhouette chart
graphics::plot(final.sil.data,
cex.names = cex.sil.lab,
max.strlen = 30,
nmax.lab = rd + 1,
main = "Silhouette plot")
# Add a line representing the average silhouette width
graphics::abline(v = mean(final.sil.data[, 3]), lty = 2)
# Plot the average silhouette value at each cluster solution
graphics::plot(sil.res, xlab = "number of clusters",
ylab = "average silhouette width",
ylim = c(0, 1),
xaxt = "n",
type = "b",
main = "Average silhouette width vs. number of clusters",
sub = paste0("values on the y-axis represent the average silhouette width at each cluster solution"),
cex.main=0.85,
cex.sub = 0.75)
axis(1, at = 0:max.ncl, cex.axis = 0.7)
# Draw a black dot corresponding to the selected partition
graphics::points(x=select.clst.num,
y=sil.res[select.clst.num-1,2],
pch=20)
# Draw the label selecting the row from the sil.res dataframe corresponding to the selected partition
graphics::text(x = select.clst.num,
y = sil.res[select.clst.num-1,2],
labels= round(sil.res[select.clst.num-1,2],3),
cex = 0.65,
pos = 3,
offset = 1.2,
srt = 90)
# Plot the Cleveland's dotcharts of the proportions by cluster
graphics::dotchart(as.matrix(prop.by.clust),
main="By-cluster proportions",
xlab="%",
xlim=c(0,100),
pt.cex=0.85,
cex=cex.dot.plt.lab,
pch=20)
graphics::dotchart(as.matrix(t(prop.by.clust)),
main="By-cluster proportions",
xlab="%",
xlim=c(0,100),
pt.cex=0.85,
cex=cex.dot.plt.lab,
pch=20)
# Rename the columns of the sil.res dataframe to give
# more meaningful labels before returning it
colnames(sil.res)[1] <- "cluster solution"
colnames(sil.res)[2] <- "average silhouette width"
}
if(clust==FALSE) {
corrplot::corrplot(br_matrix,
method="square",
addCoef.col="red",
number.cex = number.cex,
is.corr=FALSE,
col = col1(100),
tl.col="black",
tl.cex=0.8)
dist.matrix <- NULL
sil.res <- NULL
final.sil.data <- NULL
prop.by.clust <- NULL
data.w.cluster.membership <- NULL
}
return (list("BR.similarity.matrix"=br_matrix,
"P-value.matrix"=p_matrix,
"classified.P-values.matrix"=p_class_matrix,
"BR.distance_matrix"=dist.matrix,
"avr.silh.width.by.n.of.clusters"=sil.res,
"partition.silh.data"=final.sil.data,
"data.with.cluster.membership"=data.w.cluster.membership,
"by.cluster.proportion"=prop.by.clust))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.