#' Identify studies contributing to heterogeneity patterns found in GOSH plots
#'
#' This function uses three unsupervised learning learning algorithms
#' (\emph{k}-means, DBSCAN and Gaussian Mixture Models) to identify studies
#' contributing to the heterogeneity-effect size patterns found in GOSH (graphic
#' display of study heterogeneity) plots.
#'
#' @usage gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE,
#' km.params = list(centers = 3,
#' iter.max = 10, nstart = 1,
#' algorithm = c("Hartigan-Wong",
#' "Lloyd", "Forgy","MacQueen"),
#' trace = FALSE),
#' db.params = list(eps = 0.15, MinPts = 5,
#' method = c("hybrid", "raw", "dist")),
#' gmm.params = list(G = NULL, modelNames = NULL,
#' prior = NULL, control = emControl(),
#' initialization = list(hcPairs = NULL,
#' subset = NULL,
#' noise = NULL),
#' Vinv = NULL,
#' warn = mclust.options("warn"),
#' x = NULL, verbose = FALSE),
#' seed = 123,
#' verbose = TRUE)
#'
#' @param data An object of class \code{gosh.rma} created through the
#' \code{\link[metafor]{gosh}} function.
#' @param km Logical. Should the \emph{k}-Means algorithm be used to identify
#' patterns in the GOSH plot matrix? \code{TRUE} by default.
#' @param db Logical. Should the DBSCAN algorithm be used to identify patterns
#' in the GOSH plot matrix? \code{TRUE} by default.
#' @param gmm Logical. Should a bivariate Gaussian Mixture Model be used to
#' identify patterns in the GOSH plot matrix? \code{TRUE} by default.
#' @param km.params A list containing the parameters for the \emph{k}-Means algorithm
#' as implemented in \code{kmeans}. Run \code{?kmeans} for further details.
#' @param db.params A list containing the parameters for the DBSCAN algorithm
#' as implemented in \code{\link[fpc]{dbscan}}. Run \code{?fpc::dbscan} for further details.
#' @param gmm.params A list containing the parameters for the Gaussian Mixture Models
#' as implemented in \code{\link[mclust]{mclustBIC}}. Run \code{?mclust::mclustBIC} for further details.
#' @param seed Seed used for reproducibility. Default seed is \code{123}.
#' @param verbose Logical. Should a progress bar be printed in the console during clustering?
#'
#' @details
#'
#' \strong{GOSH Plots}
#'
#' GOSH (\emph{graphic display of study
#' heterogeneity}) plots were proposed by Olkin, Dahabreh and Trikalinos
#' (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots
#' facilitate the detection of both (i) outliers and (ii) distinct homogeneous
#' subgroups within the modeled data.
#'
#' Data for the plots is generated by fitting a random-effects-model with the
#' same specifications as in the meta-analysis to all \eqn{\mathcal{P}(k),
#' \emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6} possible
#' subsets of studies in an analysis. For \eqn{|\mathcal{P}(k)| > 10^6}, 1
#' million subsets are randomly sampled and used for model fitting when using
#' the \code{\link[metafor]{gosh}} function.
#'
#'
#' \strong{GOSH Plot Diagnostics}
#'
#' Although GOSH plots allow to detect heterogeneity patterns and distinct
#' subgroups within the data, interpretation which studies contribute to a
#' certain subgroup or pattern is often difficult or computationally
#' intensive. To facilitate the detection of studies responsible for specific
#' patterns within the GOSH plots, this function randomly samples \eqn{10^4}
#' data points from the GOSH Plot data (to speed up computation). Of the data
#' points, only the \eqn{z}-transformed \eqn{I^2} and effect size value is
#' used (as other heterogeneity metrics produced for the GOSH plot data using
#' the \code{\link[metafor]{gosh}} function are linear combinations of
#' \eqn{I^2}). To this data, three clustering algorithms are applied.
#' \itemize{ \item The first algorithm is \emph{k}-Means clustering using the
#' algorithm by Hartigan & Wong (1979) and \eqn{m_k = 3} cluster centers by
#' default. The functions uses the \code{\link[stats]{kmeans}} implementation
#' to perform \emph{k}-Means clustering. \item As \emph{k}-Means does not
#' perform well in the presence of distinct arbitrary subclusters and noise,
#' the function also applies \strong{DBSCAN} (\emph{density reachability and
#' connectivity clustering}; Schubert et al., 2017). The hyperparameters
#' \eqn{\epsilon} and \eqn{MinPts} can be tuned for each analysis to maintain
#' a reasonable amount of granularity while not producing too many
#' subclusters. The function uses the \code{\link[fpc]{dbscan}} implementation
#' to perform the DBSCAN clustering. \item Lastly, as a clustering approach
#' using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002)
#' are integrated in the function using an internal call to the
#' \code{\link[mclust]{mclustBIC}} implementation. Clustering hyperparameters can
#' be tuned by providing a list of parameters of the \code{mclustBIC}
#' function in the \code{mclust} package.}
#'
#' To assess which studies predominantly contribute to a detected cluster,
#' the function calculates the cluster imbalance of a specific study using the
#' difference between (i) the expected share of subsets containing a specific
#' study if the cluster makeup was purely random (viz., representative for the
#' full sample), and the (ii) actual share of subsets containing a specific
#' study within a cluster. Cook's distance for each study is then calculated
#' based on a linear intercept model to determine the leverage of a specific
#' study for each cluster makeup. Studies with a leverage value three times
#' above the mean in any of the generated clusters (for all used clustering
#' algorithms) are returned as potentially influential cases and the GOSH plot
#' is redrawn highlighting these specific studies.
#'
#' @references
#' Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis
#' and density estimation, \emph{Journal of the American Statistical Association},
#' 97/458, pp. 611-631.
#'
#' Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm.
#' \emph{Journal of the Royal Statistical Society. Series C (Applied Statistics), 28} (1). 100–108.
#'
#' Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity.
#' \emph{Research Synthesis Methods 3}, (3). 214–23.
#'
#' Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited:
#' Why and How You Should (Still) Use DBSCAN. \emph{ACM Transactions on Database Systems (TODS) 42}, (3). ACM: 19.
#'
#'
#' @author Mathias Harrer & David Daniel Ebert
#'
#' @import grid gridExtra ggplot2
#' @importFrom fpc dbscan
#' @importFrom stats kmeans cooks.distance lm complete.cases
#' @importFrom mclust mclustBIC Mclust emControl mclust.options predict.Mclust
#' @importFrom grDevices rainbow
#'
#'
#' @export gosh.diagnostics
#'
#' @seealso \code{\link{InfluenceAnalysis}}
#'
#' @examples
#' # Example: load gosh data (created with metafor's 'gosh' function),
#' # then use function
#' \dontrun{
#' data("m.gosh")
#' res <- gosh.diagnostics(m.gosh)
#'
#' # Look at results
#' summary(res)
#'
#' # Plot detected clusters
#' plot(res, which = "cluster")
#'
#' # Plot outliers
#' plot(res, which = "outlier")
#' }
gosh.diagnostics = function(data,
km = TRUE,
db = TRUE,
gmm = TRUE,
km.params = list(centers = 3,
iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy","MacQueen"),
trace = FALSE),
db.params = list(eps = 0.15, MinPts = 5,
method = c("hybrid", "raw", "dist")),
gmm.params = list(G = NULL, modelNames = NULL,
prior = NULL, control = emControl(),
initialization = list(hcPairs = NULL,
subset = NULL,
noise = NULL),
Vinv = NULL, warn = mclust.options("warn"),
x = NULL, verbose = FALSE),
seed = 123,
verbose = TRUE) {
# Redefine Variables
data = data
sav = data
do.km = km; rm(km)
do.db = db; rm(db)
do.gmm = gmm; rm(gmm)
# set seed
set.seed(seed)
# Check input
if (class(data) != "gosh.rma"){
stop("Argument 'data' provided does not have class 'gosh.rma'.")
}
if (do.km == FALSE & do.db == FALSE & do.gmm == FALSE){
stop("At least one of 'km', 'db', or 'gmm' must be set to TRUE.")
}
# Start loading bar
if (verbose == TRUE){
cat(" ", "\n", "Perform Clustering...", "\n")
cat(" |")
}
# Create full dataset from gosh output
dat.full = sav$res[complete.cases(sav$res),]
dat.full = cbind(dat.full, sav$incl[complete.cases(sav$res),])
# Create dataset for k-Means
dat.km = data.frame(scale(dat.full$I2, center = TRUE, scale = TRUE),
scale(dat.full$estimate, center = TRUE,
scale = TRUE))
colnames(dat.km) = c("I2", "estimate")
# Create dataset for DBSCAN
# DBSCAN can become too computationally intensive
# for very large GOSH data. For N_gosh > 10.000, N = 10.000 data points are
# therefore randomly sampled.
if (nrow(dat.full) < 10000) {
dat.db.full = dat.full
} else {
dat.db.full = dat.full[sample(1:nrow(dat.full), 10000), ] #Sample 10.000 rows
}
dat.db = data.frame(scale(dat.db.full$I2, center = TRUE, scale = TRUE),
scale(dat.db.full$estimate, center = TRUE, scale = TRUE))
colnames(dat.db) = c("I2", "estimate")
if (verbose == TRUE){
cat("==========")
}
# K-Means
km.params$x = dat.km
do.call(stats::kmeans, km.params)
km = do.call(stats::kmeans, km.params)
# Only use 5000 rows for plotting to increase speed
if (length(as.numeric(km$cluster)) > 5000){
km.plot.mask = sample(1:length(as.numeric(km$cluster)), 5000)
km.plot = km
km.plot$cluster = km$cluster[km.plot.mask]
dat.km.plot = dat.km[km.plot.mask,]
} else {
km.plot = km
dat.km.plot = dat.km
}
levels.km = unique(km.plot$cluster)[order(unique(km.plot$cluster))]
dat.km.plot$cluster = factor(km.plot$cluster, levels = levels.km)
km.clusterplot = ggplot(data = dat.km.plot,
aes(x = estimate, y = I2, color = cluster)) +
geom_point(cex = 0.5, alpha = 0.8) +
ylab(expression(italic(I)^2~(z-score))) +
xlab("Effect Size (z-score)") +
theme_minimal() +
ggtitle("K-means Algorithm") +
labs(color = "Cluster")
# DBSCAN
db.params$data = dat.db
db = do.call(fpc::dbscan, db.params)
# Only use 5000 rows for plotting to increase speed
if (length(as.numeric(db$cluster)) > 5000){
db.plot.mask = sample(1:length(as.numeric(db$cluster)), 5000)
db.plot = db
db.plot$cluster = db$cluster[db.plot.mask]
dat.db.plot = dat.db[db.plot.mask,]
} else {
db.plot = db
dat.db.plot = dat.db
}
if (verbose == TRUE){
cat("==========")
}
levels.db = unique(db.plot$cluster)[order(unique(db.plot$cluster))]
dat.db.plot$cluster = factor(db.plot$cluster, levels = levels.db)
levels(dat.db.plot$cluster)[levels(dat.db.plot$cluster) == "0"] = "Outlier"
color.db = rainbow(nlevels(dat.db.plot$cluster))
color.db[1] = "#000000"
db.clusterplot = ggplot(data = dat.db.plot,
aes(x = estimate, y = I2, color = cluster)) +
geom_point(cex = 0.5, alpha = 0.7) +
ylab(expression(italic(I)^2~(z-score))) +
xlab("Effect Size (z-score)") +
theme_minimal() +
ggtitle("DBSCAN Algorithm (black dots are outliers)") +
scale_color_manual(values = color.db) +
labs(color = "Cluster")
# GMM
# Use same data as used for DBSCAN
dat.gmm.full = dat.db.full
dat.gmm = dat.db
gmm.params$data = dat.gmm
# Search for optimal solution
gmm.bic = do.call(mclust::mclustBIC, gmm.params)
gmm = mclust::Mclust(data = dat.gmm, x = gmm.bic)
# Only use 5000 rows for plotting to increase speed
if (length(as.numeric(gmm$classification)) > 5000){
gmm.plot.mask = sample(1:length(as.numeric(gmm$classification)), 5000)
dat.gmm.plot = dat.gmm[gmm.plot.mask,]
dat.gmm.plot$cluster = predict.Mclust(gmm)$classification[gmm.plot.mask]
} else {
dat.gmm.plot = dat.gmm
dat.gmm.plot$cluster = predict.Mclust(gmm)$classification
}
if (verbose == TRUE){
cat("==========")
}
levels.gmm = unique(dat.gmm.plot$cluster)[order(unique(dat.gmm.plot$cluster))]
dat.gmm.plot$cluster = factor(dat.gmm.plot$cluster, levels = levels.gmm)
gmm.clusterplot = ggplot(data = dat.gmm.plot,
aes(x = estimate, y = I2, color = cluster)) +
geom_point(cex = 0.5, alpha = 0.8) +
ylab(expression(italic(I)^2~(z-score))) +
xlab("Effect Size (z-score)") +
theme_minimal() +
ggtitle("Gaussian Mixture Model") +
labs(color = "Cluster")
if (verbose == TRUE){
cat("==========")
}
# Add to dfs
dat.km.full = dat.full
dat.km.full$cluster = km$cluster
dat.db.full$cluster = db$cluster
dat.gmm.full$cluster = gmm$classification
####################################################
# Extract the Percentages###########################
# K-Means############################################
dat.km.full$cluster = as.factor(dat.km.full$cluster)
n.levels.km = nlevels(dat.km.full$cluster)
# Loop for the total n of studies
dat.km.full.total = dat.km.full[, -c(1:6, ncol(dat.km.full))]
n.cluster.tots = apply(dat.km.full.total, 2, sum)
n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
colnames(n.cluster.tots) = c("n.tots")
if (verbose == TRUE){
cat("==========")
}
# Loop for the cluster-wise n of studies
n = sapply(split(dat.km.full.total, dat.km.full$cluster), function(x) apply(x, 2, sum))
# Calculate Percentages
deltas = as.data.frame(apply(n, 2,
function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))
# Generate the plot
Cluster = factor(rep(1:n.levels.km, each = nrow(deltas)))
Study = rep(1:nrow(deltas), n.levels.km)
Delta_Percentage = unlist(deltas)
delta.df = data.frame(Cluster, Delta_Percentage, Study)
km.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (K-Means algorithm)") + geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
####################################################
# Cook's Distance Plot###########################
# K-Means############################################
m.cd.km = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
m.cd.km$`0` = NULL
m.cd.km = lapply(m.cd.km, cooks.distance)
m.cd.km.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.km)))
m.cd.km.df$Cluster = as.factor(rep(1:(n.levels.km), each = nrow(deltas)))
m.cd.km.df$Study = rep(1:nrow(deltas), times = (n.levels.km))
outlier.cd.km = 3 * mean(m.cd.km.df$Cooks.Distance)
if (n.levels.km <= 2){
m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] = m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] + 0.01
km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
geom_line(aes(color=Cluster), alpha = 0.5) +
geom_point(aes(color = Cluster)) +
scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
scale_y_continuous(name = "Cook's Distance") +
theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Cook's Distance)") +
geom_hline(yintercept = outlier.cd.km, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
} else {
km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
geom_line(aes(color=Cluster), alpha = 0.5) +
geom_point(aes(color = Cluster)) +
scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
scale_y_continuous(name = "Cook's Distance") +
theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Cook's Distance)") +
geom_hline(yintercept = outlier.cd.km, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
}
if (verbose == TRUE){
cat("==========")
}
####################################################
# Extract the Percentages###########################
# DBSCAN############################################
dat.db.full$cluster = as.factor(dat.db.full$cluster)
n.levels.db = nlevels(dat.db.full$cluster)
# Loop for the total n of studies
dat.db.full.total = dat.db.full[, -c(1:6, ncol(dat.db.full))]
n.cluster.tots = apply(dat.db.full.total, 2, sum)
n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
colnames(n.cluster.tots) = c("n.tots")
# Loop for the cluster-wise n of studies
n = sapply(split(dat.db.full.total, dat.db.full$cluster), function(x) apply(x, 2, sum))
# Calculate Percentages
deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))
# Generate the plot
Cluster = factor(rep(colnames(deltas), each = nrow(deltas)))
Study = rep(1:nrow(deltas), n.levels.db)
Delta_Percentage = unlist(deltas)
delta.df = data.frame(Cluster, Delta_Percentage, Study)
delta.df = delta.df[delta.df$Cluster != 0,] #Zero Class (Outliers are removed)
db.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Density-Based Clustering)") + geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
if (verbose == TRUE){
cat("==========")
}
####################################################
# Cook's Distance Plot###########################
# DBSCAN############################################
m.cd.db = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
m.cd.db$`0` = NULL
m.cd.db = lapply(m.cd.db, cooks.distance)
m.cd.db.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.db)))
m.cd.db.df$Cluster = as.factor(rep(1:(n.levels.db - 1), each = nrow(deltas)))
m.cd.db.df$Study = rep(1:nrow(deltas), times = (n.levels.db - 1))
outlier.cd.db = 3 * mean(m.cd.db.df$Cooks.Distance)
if (n.levels.db <= 2){
m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] = m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] + 0.01
db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
geom_line(aes(color=Cluster), alpha = 0.5) +
geom_point(aes(color = Cluster)) +
scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
scale_y_continuous(name = "Cook's Distance") +
theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Cook's Distance)") +
geom_hline(yintercept = outlier.cd.db, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
} else {
db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
geom_line(aes(color=Cluster), alpha = 0.5) +
geom_point(aes(color = Cluster)) +
scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
scale_y_continuous(name = "Cook's Distance") +
theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Cook's Distance)") +
geom_hline(yintercept = outlier.cd.db, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
}
if (verbose == TRUE){
cat("==========")
}
####################################################
# Extract the Percentages###########################
# GMM ############################################
dat.gmm.full$cluster = as.factor(dat.gmm.full$cluster)
n.levels.gmm = nlevels(dat.gmm.full$cluster)
# Loop for the total n of studies
dat.gmm.full.total = dat.gmm.full[, -c(1:6, ncol(dat.gmm.full))]
n.cluster.tots = apply(dat.gmm.full.total, 2, sum)
n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
colnames(n.cluster.tots) = c("n.tots")
# Loop for the cluster-wise n of studies
n = sapply(split(dat.gmm.full.total, dat.gmm.full$cluster), function(x) apply(x, 2, sum))
# Calculate Percentages
deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))
# Generate the plot
Cluster = factor(rep(colnames(deltas), each = nrow(deltas)))
Study = rep(1:nrow(deltas), n.levels.gmm)
Delta_Percentage = unlist(deltas)
delta.df = data.frame(Cluster, Delta_Percentage, Study)
gmm.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (GMM)") + geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
####################################################
# Cook's Distance Plot###########################
# GMM ############################################
m.cd.gmm = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
m.cd.gmm$`0` = NULL
m.cd.gmm = lapply(m.cd.gmm, cooks.distance)
m.cd.gmm.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.gmm)))
m.cd.gmm.df$Cluster = as.factor(rep(1:(n.levels.gmm), each = nrow(deltas)))
m.cd.gmm.df$Study = rep(1:nrow(deltas), times = (n.levels.gmm))
outlier.cd.gmm = 3 * mean(m.cd.gmm.df$Cooks.Distance)
if (n.levels.gmm <= 2){
m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] = m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] + 0.01
gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
geom_line(aes(color=Cluster), alpha = 0.5) +
geom_point(aes(color = Cluster)) +
scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
scale_y_continuous(name = "Cook's Distance") +
theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Cook's Distance)") +
geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
} else {
gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
geom_line(aes(color=Cluster), alpha = 0.5) +
geom_point(aes(color = Cluster)) +
scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
scale_y_continuous(name = "Cook's Distance") +
theme(axis.text = element_text(size = 5)) +
ggtitle("Cluster imbalance (Cook's Distance)") +
geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_minimal()
}
#######################
# Generate Output Plot#
#######################
returnlist = list()
if (do.km == TRUE){
returnlist$km.clusters = n.levels.km
km.sideplots = gridExtra::arrangeGrob(km.plot, km.cd.plot, nrow=2)
returnlist$km.plot = gridExtra::arrangeGrob(km.clusterplot, km.sideplots, ncol = 2)
}
if (do.db == TRUE){
returnlist$db.clusters = n.levels.db-1
db.sideplots = gridExtra::arrangeGrob(db.plot, db.cd.plot, nrow=2)
returnlist$db.plot = gridExtra::arrangeGrob(db.clusterplot, db.sideplots, ncol = 2)
}
if (do.gmm == TRUE){
returnlist$gmm.clusters = n.levels.gmm
gmm.sideplots = gridExtra::arrangeGrob(gmm.plot, gmm.cd.plot, nrow=2)
returnlist$gmm.plot = gridExtra::arrangeGrob(gmm.clusterplot, gmm.sideplots, ncol = 2)
}
############################################
# Plot GOSH for potential outlying studies #
############################################
# Get outlying studies
# Kmeans
outlier.studies.km.df = m.cd.km.df[m.cd.km.df$Cooks.Distance>outlier.cd.km,]
outlier.studies.km = unique(outlier.studies.km.df$Study)
# DBSCAN
outlier.studies.db.df = m.cd.db.df[m.cd.db.df$Cooks.Distance>outlier.cd.db,]
outlier.studies.db = unique(outlier.studies.db.df$Study)
# GMM
outlier.studies.gmm.df = m.cd.gmm.df[m.cd.gmm.df$Cooks.Distance>outlier.cd.gmm,]
outlier.studies.gmm = unique(outlier.studies.gmm.df$Study)
# Use all identified outliers
outlier.studies.all = unique(c(outlier.studies.km, outlier.studies.db, outlier.studies.gmm))
outlier.studies.all.mask = outlier.studies.all + 6 # Add 6 to use as mask
# Get plotting dataset and only choose outlier studies as mask, use db data
if (length(as.numeric(db$cluster)) > 5000){
dat.all.outliers = dat.db.full[db.plot.mask, c(3,6, outlier.studies.all.mask)]
} else {
dat.all.outliers = dat.db.full[,c(3,6, outlier.studies.all.mask)]
}
if (length(outlier.studies.all) > 0){
# Loop through all identified outliers
for (i in 1:length(outlier.studies.all)){
outlier.plot = ggplot(data = dat.all.outliers, aes(x = estimate,
y = I2,
color = dat.all.outliers[,i+2])) +
geom_point(alpha=0.8) +
scale_color_manual(values = c("lightgrey", "#00BFC4")) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold")) +
xlab("Effect Size") +
ylab("I-squared")
density.db.upper = ggplot(data = dat.all.outliers, aes(x = estimate,
fill = dat.all.outliers[,i+2])) +
geom_density(alpha = 0.5) +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_blank(),
axis.line.x = element_blank(),
axis.title.y = element_text(color="white"),
axis.text.y = element_text(color="white"),
axis.line.y = element_line(color="white")
) +
scale_fill_manual(values = c("lightgrey", "#00BFC4"))
blankPlot = ggplot()+geom_blank(aes(1,1))+
theme(plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.line.x = element_blank(),
axis.line.y = element_blank()
)
density.db.right = ggplot(data = dat.all.outliers, aes(x = I2,
fill = dat.all.outliers[,i+2])) +
geom_density(alpha = 0.5) +
theme_classic() +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "none",
plot.background = element_blank(),
axis.line.y = element_blank(),
axis.title.x = element_text(color="white"),
axis.text.x = element_text(color="white"),
axis.line.x = element_line(color="white")
) +
scale_fill_manual(values = c("lightgrey", "#00BFC4")) +
coord_flip()
returnlist[[paste0("plot.study", outlier.studies.all[i], ".removed")]] =
gridExtra::arrangeGrob(density.db.upper,
blankPlot,
outlier.plot,
density.db.right,
nrow = 2,
ncol = 2,
heights = c(1,5),
widths = c(4,1),
top = paste("Study ", outlier.studies.all[i]))
}
}
if (do.km == TRUE){
returnlist$outlier.studies.km = outlier.studies.km
}
if (do.db == TRUE){
returnlist$outlier.studies.db = outlier.studies.db
}
if (do.gmm == TRUE){
returnlist$outlier.studies.gmm = outlier.studies.gmm
}
if (verbose == TRUE){
cat("==========| DONE \n")
cat("\n")
}
class(returnlist) = c("gosh.diagnostics", "list")
return(returnlist)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.