inst/shiny/funcs/gosh.diagnostics.R

#' Identify studies contributing to heterogeneity patterns found in GOSH plots
#'
#' This function uses three unsupervised learning learning algorithms
#' (\emph{k}-means, DBSCAN and Gaussian Mixture Models) to identify studies
#' contributing to the heterogeneity-effect size patterns found in GOSH (graphic
#' display of study heterogeneity) plots.
#'
#' @usage gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE,
#'                  km.params = list(centers = 3,
#'                                   iter.max = 10, nstart = 1,
#'                                   algorithm = c("Hartigan-Wong",
#'                                   "Lloyd", "Forgy","MacQueen"),
#'                                   trace = FALSE),
#'                  db.params = list(eps = 0.15, MinPts = 5,
#'                                   method = c("hybrid", "raw", "dist")),
#'                  gmm.params = list(G = NULL, modelNames = NULL,
#'                                    prior = NULL, control = emControl(),
#'                                    initialization = list(hcPairs = NULL,
#'                                    subset = NULL,
#'                                    noise = NULL),
#'                                    Vinv = NULL,
#'                                    warn = mclust.options("warn"),
#'                                    x = NULL, verbose = FALSE),
#'                  seed = 123,
#'                  verbose = TRUE)
#'
#' @param data An object of class \code{gosh.rma} created through the
#'   \code{\link[metafor]{gosh}} function.
#' @param km Logical. Should the \emph{k}-Means algorithm be used to identify
#'   patterns in the GOSH plot matrix? \code{TRUE} by default.
#' @param db Logical. Should the DBSCAN algorithm be used to identify patterns
#'   in the GOSH plot matrix? \code{TRUE} by default.
#' @param gmm Logical. Should a bivariate Gaussian Mixture Model be used to
#'   identify patterns in the GOSH plot matrix? \code{TRUE} by default.
#' @param km.params A list containing the parameters for the \emph{k}-Means algorithm
#' as implemented in \code{kmeans}. Run \code{?kmeans} for further details.
#' @param db.params A list containing the parameters for the DBSCAN algorithm
#' as implemented in \code{\link[fpc]{dbscan}}. Run \code{?fpc::dbscan} for further details.
#' @param gmm.params A list containing the parameters for the Gaussian Mixture Models
#' as implemented in \code{\link[mclust]{mclustBIC}}. Run \code{?mclust::mclustBIC} for further details.
#' @param seed Seed used for reproducibility. Default seed is \code{123}.
#' @param verbose Logical. Should a progress bar be printed in the console during clustering?
#'
#' @details
#'
#' \strong{GOSH Plots}
#'
#'   GOSH (\emph{graphic display of study
#'   heterogeneity}) plots were proposed by Olkin, Dahabreh and Trikalinos
#'   (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots
#'   facilitate the detection of both (i) outliers and (ii) distinct homogeneous
#'   subgroups within the modeled data.
#'
#'   Data for the plots is generated by fitting a random-effects-model with the
#'   same specifications as in the meta-analysis to all \eqn{\mathcal{P}(k),
#'   \emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6} possible
#'   subsets of studies in an analysis. For \eqn{|\mathcal{P}(k)| > 10^6}, 1
#'   million subsets are randomly sampled and used for model fitting when using
#'   the \code{\link[metafor]{gosh}} function.
#'
#'
#'   \strong{GOSH Plot Diagnostics}
#'
#'   Although GOSH plots allow to detect heterogeneity patterns and distinct
#'   subgroups within the data, interpretation which studies contribute to a
#'   certain subgroup or pattern is often difficult or computationally
#'   intensive. To facilitate the detection of studies responsible for specific
#'   patterns within the GOSH plots, this function randomly samples \eqn{10^4}
#'   data points from the GOSH Plot data (to speed up computation). Of the data
#'   points, only the \eqn{z}-transformed \eqn{I^2} and effect size value is
#'   used (as other heterogeneity metrics produced for the GOSH plot data using
#'   the \code{\link[metafor]{gosh}} function are linear combinations of
#'   \eqn{I^2}). To this data, three clustering algorithms are applied.
#'   \itemize{ \item The first algorithm is \emph{k}-Means clustering using the
#'   algorithm by Hartigan & Wong (1979) and \eqn{m_k = 3} cluster centers by
#'   default. The functions uses the \code{\link[stats]{kmeans}} implementation
#'   to perform \emph{k}-Means clustering. \item As \emph{k}-Means does not
#'   perform well in the presence of distinct arbitrary subclusters and noise,
#'   the function also applies \strong{DBSCAN} (\emph{density reachability and
#'   connectivity clustering}; Schubert et al., 2017). The hyperparameters
#'   \eqn{\epsilon} and \eqn{MinPts} can be tuned for each analysis to maintain
#'   a reasonable amount of granularity while not producing too many
#'   subclusters. The function uses the \code{\link[fpc]{dbscan}} implementation
#'   to perform the DBSCAN clustering. \item Lastly, as a clustering approach
#'   using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002)
#'   are integrated in the function using an internal call to the
#'   \code{\link[mclust]{mclustBIC}} implementation. Clustering hyperparameters can
#'   be tuned by providing a list of parameters of the \code{mclustBIC}
#'   function in the \code{mclust} package.}
#'
#'   To assess which studies predominantly contribute to a detected cluster,
#'   the function calculates the cluster imbalance of a specific study using the
#'   difference between (i) the expected share of subsets containing a specific
#'   study if the cluster makeup was purely random (viz., representative for the
#'   full sample), and the (ii) actual share of subsets containing a specific
#'   study within a cluster. Cook's distance for each study is then calculated
#'   based on a linear intercept model to determine the leverage of a specific
#'   study for each cluster makeup. Studies with a leverage value three times
#'   above the mean in any of the generated clusters (for all used clustering
#'   algorithms) are returned as potentially influential cases and the GOSH plot
#'   is redrawn highlighting these specific studies.
#'
#' @references
#' Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis
#' and density estimation, \emph{Journal of the American Statistical Association},
#' 97/458, pp. 611-631.
#'
#' Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm.
#' \emph{Journal of the Royal Statistical Society. Series C (Applied Statistics), 28} (1). 100–108.
#'
#' Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity.
#' \emph{Research Synthesis Methods 3}, (3). 214–23.
#'
#' Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited:
#' Why and How You Should (Still) Use DBSCAN. \emph{ACM Transactions on Database Systems (TODS) 42}, (3). ACM: 19.
#'
#'
#' @author Mathias Harrer & David Daniel Ebert
#'
#' @import grid gridExtra ggplot2
#' @importFrom fpc dbscan
#' @importFrom stats kmeans cooks.distance lm complete.cases
#' @importFrom mclust mclustBIC Mclust emControl mclust.options predict.Mclust
#' @importFrom grDevices rainbow
#'
#'
#' @export gosh.diagnostics
#'
#' @seealso \code{\link{InfluenceAnalysis}}
#'
#' @examples
#' # Example: load gosh data (created with metafor's 'gosh' function),
#' # then use function
#' \dontrun{
#' data("m.gosh")
#' res <- gosh.diagnostics(m.gosh)
#'
#' # Look at results
#' summary(res)
#'
#' # Plot detected clusters
#' plot(res, which = "cluster")
#'
#' # Plot outliers
#' plot(res, which = "outlier")
#' }



gosh.diagnostics = function(data,
                            km = TRUE,
                            db = TRUE,
                            gmm = TRUE,
                            km.params = list(centers = 3,
                                             iter.max = 10, nstart = 1,
                                             algorithm = c("Hartigan-Wong", "Lloyd", "Forgy","MacQueen"),
                                             trace = FALSE),
                            db.params = list(eps = 0.15, MinPts = 5,
                                             method = c("hybrid", "raw", "dist")),
                            gmm.params = list(G = NULL, modelNames = NULL,
                                              prior = NULL, control = emControl(),
                                              initialization = list(hcPairs = NULL,
                                                                    subset = NULL,
                                                                    noise = NULL),
                                              Vinv = NULL, warn = mclust.options("warn"),
                                              x = NULL, verbose = FALSE),
                            seed = 123,
                            verbose = TRUE) {
  
  # Redefine Variables
  data = data
  sav = data
  do.km = km; rm(km)
  do.db = db; rm(db)
  do.gmm = gmm; rm(gmm)
  
  # set seed
  set.seed(seed)
  
  # Check input
  if (class(data) != "gosh.rma"){
    stop("Argument 'data' provided does not have class 'gosh.rma'.")
  }
  if (do.km == FALSE & do.db == FALSE & do.gmm == FALSE){
    stop("At least one of 'km', 'db', or 'gmm' must be set to TRUE.")
  }
  
  
  # Start loading bar
  if (verbose == TRUE){
    cat(" ", "\n", "Perform Clustering...", "\n")
    
    cat(" |")
  }
  
  
  
  # Create full dataset from gosh output
  dat.full = sav$res[complete.cases(sav$res),]
  dat.full = cbind(dat.full, sav$incl[complete.cases(sav$res),])
  
  
  
  # Create dataset for k-Means
  dat.km = data.frame(scale(dat.full$I2, center = TRUE, scale = TRUE),
                      scale(dat.full$estimate, center = TRUE,
                            scale = TRUE))
  colnames(dat.km) = c("I2", "estimate")
  
  # Create dataset for DBSCAN
  # DBSCAN can become too computationally intensive
  # for very large GOSH data.  For N_gosh > 10.000, N = 10.000 data points are
  # therefore randomly sampled.
  
  if (nrow(dat.full) < 10000) {
    dat.db.full = dat.full
  } else {
    dat.db.full = dat.full[sample(1:nrow(dat.full), 10000), ]  #Sample 10.000 rows
  }
  
  dat.db = data.frame(scale(dat.db.full$I2, center = TRUE, scale = TRUE),
                      scale(dat.db.full$estimate, center = TRUE, scale = TRUE))
  colnames(dat.db) = c("I2", "estimate")
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  
  # K-Means
  km.params$x = dat.km
  do.call(stats::kmeans, km.params)
  km = do.call(stats::kmeans, km.params)
  
  # Only use 5000 rows for plotting to increase speed
  if (length(as.numeric(km$cluster)) > 5000){
    km.plot.mask = sample(1:length(as.numeric(km$cluster)), 5000)
    km.plot = km
    km.plot$cluster = km$cluster[km.plot.mask]
    dat.km.plot = dat.km[km.plot.mask,]
  } else {
    km.plot = km
    dat.km.plot = dat.km
  }
  
  levels.km = unique(km.plot$cluster)[order(unique(km.plot$cluster))]
  dat.km.plot$cluster = factor(km.plot$cluster, levels = levels.km)
  
  km.clusterplot = ggplot(data = dat.km.plot,
                          aes(x = estimate, y = I2, color = cluster)) +
    geom_point(cex = 0.5, alpha = 0.8) +
    ylab(expression(italic(I)^2~(z-score))) +
    xlab("Effect Size (z-score)") +
    theme_minimal() +
    ggtitle("K-means Algorithm") +
    labs(color = "Cluster")
  
  
  # DBSCAN
  db.params$data = dat.db
  db = do.call(fpc::dbscan, db.params)
  
  # Only use 5000 rows for plotting to increase speed
  if (length(as.numeric(db$cluster)) > 5000){
    db.plot.mask = sample(1:length(as.numeric(db$cluster)), 5000)
    db.plot = db
    db.plot$cluster = db$cluster[db.plot.mask]
    dat.db.plot = dat.db[db.plot.mask,]
  } else {
    db.plot = db
    dat.db.plot = dat.db
  }
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  levels.db = unique(db.plot$cluster)[order(unique(db.plot$cluster))]
  dat.db.plot$cluster = factor(db.plot$cluster, levels = levels.db)
  levels(dat.db.plot$cluster)[levels(dat.db.plot$cluster) == "0"] = "Outlier"
  color.db = rainbow(nlevels(dat.db.plot$cluster))
  color.db[1] = "#000000"
  
  db.clusterplot = ggplot(data = dat.db.plot,
                          aes(x = estimate, y = I2, color = cluster)) +
    geom_point(cex = 0.5, alpha = 0.7) +
    ylab(expression(italic(I)^2~(z-score))) +
    xlab("Effect Size (z-score)") +
    theme_minimal() +
    ggtitle("DBSCAN Algorithm (black dots are outliers)") +
    scale_color_manual(values = color.db) +
    labs(color = "Cluster")
  
  
  
  # GMM
  # Use same data as used for DBSCAN
  dat.gmm.full = dat.db.full
  dat.gmm = dat.db
  
  gmm.params$data = dat.gmm
  
  # Search for optimal solution
  gmm.bic = do.call(mclust::mclustBIC, gmm.params)
  gmm = mclust::Mclust(data = dat.gmm, x = gmm.bic)
  
  
  # Only use 5000 rows for plotting to increase speed
  if (length(as.numeric(gmm$classification)) > 5000){
    gmm.plot.mask = sample(1:length(as.numeric(gmm$classification)), 5000)
    dat.gmm.plot = dat.gmm[gmm.plot.mask,]
    dat.gmm.plot$cluster = predict.Mclust(gmm)$classification[gmm.plot.mask]
  } else {
    dat.gmm.plot = dat.gmm
    dat.gmm.plot$cluster = predict.Mclust(gmm)$classification
  }
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  levels.gmm = unique(dat.gmm.plot$cluster)[order(unique(dat.gmm.plot$cluster))]
  dat.gmm.plot$cluster = factor(dat.gmm.plot$cluster, levels = levels.gmm)
  
  gmm.clusterplot = ggplot(data = dat.gmm.plot,
                           aes(x = estimate, y = I2, color = cluster)) +
    geom_point(cex = 0.5, alpha = 0.8) +
    ylab(expression(italic(I)^2~(z-score))) +
    xlab("Effect Size (z-score)") +
    theme_minimal() +
    ggtitle("Gaussian Mixture Model") +
    labs(color = "Cluster")
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  
  # Add to dfs
  dat.km.full = dat.full
  dat.km.full$cluster = km$cluster
  dat.db.full$cluster = db$cluster
  dat.gmm.full$cluster = gmm$classification
  
  
  ####################################################
  # Extract the Percentages###########################
  # K-Means############################################
  
  dat.km.full$cluster = as.factor(dat.km.full$cluster)
  n.levels.km = nlevels(dat.km.full$cluster)
  
  # Loop for the total n of studies
  
  dat.km.full.total = dat.km.full[, -c(1:6, ncol(dat.km.full))]
  n.cluster.tots = apply(dat.km.full.total, 2, sum)
  n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
  colnames(n.cluster.tots) = c("n.tots")
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  
  # Loop for the cluster-wise n of studies
  n = sapply(split(dat.km.full.total, dat.km.full$cluster), function(x) apply(x, 2, sum))
  
  
  # Calculate Percentages
  deltas = as.data.frame(apply(n, 2,
                               function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))
  
  # Generate the plot
  Cluster = factor(rep(1:n.levels.km, each = nrow(deltas)))
  Study = rep(1:nrow(deltas), n.levels.km)
  Delta_Percentage = unlist(deltas)
  delta.df = data.frame(Cluster, Delta_Percentage, Study)
  
  km.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
    geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
                                                                                       1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
    ggtitle("Cluster imbalance (K-Means algorithm)") + geom_hline(yintercept = 0, linetype = "dashed") +
    theme_minimal()
  
  
  ####################################################
  # Cook's Distance Plot###########################
  # K-Means############################################
  
  m.cd.km = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
  m.cd.km$`0` = NULL
  m.cd.km = lapply(m.cd.km, cooks.distance)
  m.cd.km.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.km)))
  m.cd.km.df$Cluster = as.factor(rep(1:(n.levels.km), each = nrow(deltas)))
  m.cd.km.df$Study = rep(1:nrow(deltas), times = (n.levels.km))
  outlier.cd.km = 3 * mean(m.cd.km.df$Cooks.Distance)
  
  if (n.levels.km <= 2){
    
    m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] = m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] + 0.01
    
    km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
      geom_line(aes(color=Cluster), alpha = 0.5) +
      geom_point(aes(color = Cluster)) +
      scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
      scale_y_continuous(name = "Cook's Distance") +
      theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (Cook's Distance)") +
      geom_hline(yintercept = outlier.cd.km, linetype = "dashed") +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()
    
    
  } else {
    
    km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
      geom_line(aes(color=Cluster), alpha = 0.5) +
      geom_point(aes(color = Cluster)) +
      scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
      scale_y_continuous(name = "Cook's Distance") +
      theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (Cook's Distance)") +
      geom_hline(yintercept = outlier.cd.km, linetype = "dashed") +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()
    
  }
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  ####################################################
  # Extract the Percentages###########################
  # DBSCAN############################################
  
  dat.db.full$cluster = as.factor(dat.db.full$cluster)
  n.levels.db = nlevels(dat.db.full$cluster)
  
  # Loop for the total n of studies
  
  dat.db.full.total = dat.db.full[, -c(1:6, ncol(dat.db.full))]
  
  n.cluster.tots = apply(dat.db.full.total, 2, sum)
  n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
  colnames(n.cluster.tots) = c("n.tots")
  
  
  # Loop for the cluster-wise n of studies
  
  n = sapply(split(dat.db.full.total, dat.db.full$cluster), function(x) apply(x, 2, sum))
  
  
  # Calculate Percentages
  
  deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))
  
  # Generate the plot
  
  Cluster = factor(rep(colnames(deltas), each = nrow(deltas)))
  Study = rep(1:nrow(deltas), n.levels.db)
  Delta_Percentage = unlist(deltas)
  delta.df = data.frame(Cluster, Delta_Percentage, Study)
  delta.df = delta.df[delta.df$Cluster != 0,] #Zero Class (Outliers are removed)
  
  db.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
    geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
                                                                                       1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
    ggtitle("Cluster imbalance (Density-Based Clustering)") + geom_hline(yintercept = 0, linetype = "dashed") +
    theme_minimal()
  
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  ####################################################
  # Cook's Distance Plot###########################
  # DBSCAN############################################
  
  m.cd.db = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
  m.cd.db$`0` = NULL
  m.cd.db = lapply(m.cd.db, cooks.distance)
  m.cd.db.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.db)))
  m.cd.db.df$Cluster = as.factor(rep(1:(n.levels.db - 1), each = nrow(deltas)))
  m.cd.db.df$Study = rep(1:nrow(deltas), times = (n.levels.db - 1))
  outlier.cd.db = 3 * mean(m.cd.db.df$Cooks.Distance)
  
  if (n.levels.db <= 2){
    
    m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] = m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] + 0.01
    
    db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
      geom_line(aes(color=Cluster), alpha = 0.5) +
      geom_point(aes(color = Cluster)) +
      scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
      scale_y_continuous(name = "Cook's Distance") +
      theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (Cook's Distance)") +
      geom_hline(yintercept = outlier.cd.db, linetype = "dashed") +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()
    
    
  } else {
    
    db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
      geom_line(aes(color=Cluster), alpha = 0.5) +
      geom_point(aes(color = Cluster)) +
      scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
      scale_y_continuous(name = "Cook's Distance") +
      theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (Cook's Distance)") +
      geom_hline(yintercept = outlier.cd.db, linetype = "dashed") +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()
    
  }
  
  if (verbose == TRUE){
    cat("==========")
  }
  
  ####################################################
  # Extract the Percentages###########################
  # GMM   ############################################
  
  dat.gmm.full$cluster = as.factor(dat.gmm.full$cluster)
  n.levels.gmm = nlevels(dat.gmm.full$cluster)
  
  # Loop for the total n of studies
  
  dat.gmm.full.total = dat.gmm.full[, -c(1:6, ncol(dat.gmm.full))]
  n.cluster.tots = apply(dat.gmm.full.total, 2, sum)
  n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
  colnames(n.cluster.tots) = c("n.tots")
  
  
  # Loop for the cluster-wise n of studies
  
  n = sapply(split(dat.gmm.full.total, dat.gmm.full$cluster), function(x) apply(x, 2, sum))
  
  
  # Calculate Percentages
  
  deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))
  
  # Generate the plot
  
  Cluster = factor(rep(colnames(deltas), each = nrow(deltas)))
  Study = rep(1:nrow(deltas), n.levels.gmm)
  Delta_Percentage = unlist(deltas)
  delta.df = data.frame(Cluster, Delta_Percentage, Study)
  
  
  gmm.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
    geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
                                                                                       1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
    ggtitle("Cluster imbalance (GMM)") + geom_hline(yintercept = 0, linetype = "dashed") +
    theme_minimal()
  
  
  ####################################################
  # Cook's Distance Plot###########################
  # GMM ############################################
  
  m.cd.gmm = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
  m.cd.gmm$`0` = NULL
  m.cd.gmm = lapply(m.cd.gmm, cooks.distance)
  m.cd.gmm.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.gmm)))
  m.cd.gmm.df$Cluster = as.factor(rep(1:(n.levels.gmm), each = nrow(deltas)))
  m.cd.gmm.df$Study = rep(1:nrow(deltas), times = (n.levels.gmm))
  outlier.cd.gmm = 3 * mean(m.cd.gmm.df$Cooks.Distance)
  
  if (n.levels.gmm <= 2){
    
    m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] = m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] + 0.01
    
    gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
      geom_line(aes(color=Cluster), alpha = 0.5) +
      geom_point(aes(color = Cluster)) +
      scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
      scale_y_continuous(name = "Cook's Distance") +
      theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (Cook's Distance)") +
      geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()
    
    
  } else {
    
    gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
      geom_line(aes(color=Cluster), alpha = 0.5) +
      geom_point(aes(color = Cluster)) +
      scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
      scale_y_continuous(name = "Cook's Distance") +
      theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (Cook's Distance)") +
      geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") +
      geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()
    
  }
  
  
  #######################
  # Generate Output Plot#
  #######################
  
  returnlist = list()
  
  if (do.km == TRUE){
    returnlist$km.clusters = n.levels.km
    km.sideplots = gridExtra::arrangeGrob(km.plot, km.cd.plot, nrow=2)
    returnlist$km.plot = gridExtra::arrangeGrob(km.clusterplot, km.sideplots, ncol = 2)
  }
  
  if (do.db == TRUE){
    returnlist$db.clusters = n.levels.db-1
    db.sideplots = gridExtra::arrangeGrob(db.plot, db.cd.plot, nrow=2)
    returnlist$db.plot = gridExtra::arrangeGrob(db.clusterplot, db.sideplots, ncol = 2)
  }
  
  if (do.gmm == TRUE){
    returnlist$gmm.clusters = n.levels.gmm
    gmm.sideplots = gridExtra::arrangeGrob(gmm.plot, gmm.cd.plot, nrow=2)
    returnlist$gmm.plot = gridExtra::arrangeGrob(gmm.clusterplot, gmm.sideplots, ncol = 2)
  }
  
  ############################################
  # Plot GOSH for potential outlying studies #
  ############################################
  
  # Get outlying studies
  
  # Kmeans
  outlier.studies.km.df = m.cd.km.df[m.cd.km.df$Cooks.Distance>outlier.cd.km,]
  outlier.studies.km = unique(outlier.studies.km.df$Study)
  
  # DBSCAN
  outlier.studies.db.df = m.cd.db.df[m.cd.db.df$Cooks.Distance>outlier.cd.db,]
  outlier.studies.db = unique(outlier.studies.db.df$Study)
  
  # GMM
  outlier.studies.gmm.df = m.cd.gmm.df[m.cd.gmm.df$Cooks.Distance>outlier.cd.gmm,]
  outlier.studies.gmm = unique(outlier.studies.gmm.df$Study)
  
  # Use all identified outliers
  outlier.studies.all = unique(c(outlier.studies.km, outlier.studies.db, outlier.studies.gmm))
  outlier.studies.all.mask = outlier.studies.all + 6 # Add 6 to use as mask
  
  
  # Get plotting dataset and only choose outlier studies as mask, use db data
  if (length(as.numeric(db$cluster)) > 5000){
    
    dat.all.outliers = dat.db.full[db.plot.mask, c(3,6, outlier.studies.all.mask)]
    
  } else {
    
    dat.all.outliers = dat.db.full[,c(3,6, outlier.studies.all.mask)]
    
  }
  
  if (length(outlier.studies.all) > 0){
    
    # Loop through all identified outliers
    for (i in 1:length(outlier.studies.all)){
      
      outlier.plot = ggplot(data = dat.all.outliers, aes(x = estimate,
                                                         y = I2,
                                                         color = dat.all.outliers[,i+2])) +
        geom_point(alpha=0.8) +
        scale_color_manual(values = c("lightgrey", "#00BFC4")) +
        theme_minimal() +
        theme(legend.position = "none",
              plot.title = element_text(hjust = 0.5, face = "bold")) +
        xlab("Effect Size") +
        ylab("I-squared")
      
      
      
      density.db.upper = ggplot(data = dat.all.outliers, aes(x = estimate,
                                                             fill = dat.all.outliers[,i+2])) +
        geom_density(alpha = 0.5) +
        theme_classic() +
        theme(axis.title.x = element_blank(),
              axis.text.x = element_blank(),
              axis.ticks = element_blank(),
              legend.position = "none",
              plot.background = element_blank(),
              axis.line.x = element_blank(),
              axis.title.y = element_text(color="white"),
              axis.text.y = element_text(color="white"),
              axis.line.y = element_line(color="white")
        ) +
        scale_fill_manual(values = c("lightgrey", "#00BFC4"))
      
      blankPlot = ggplot()+geom_blank(aes(1,1))+
        theme(plot.background = element_blank(),
              panel.grid.major = element_blank(),
              panel.grid.minor = element_blank(),
              panel.border = element_blank(),
              panel.background = element_blank(),
              axis.title.x = element_blank(),
              axis.title.y = element_blank(),
              axis.text.x = element_blank(),
              axis.text.y = element_blank(),
              axis.ticks = element_blank(),
              axis.line.x = element_blank(),
              axis.line.y = element_blank()
        )
      
      density.db.right = ggplot(data = dat.all.outliers, aes(x = I2,
                                                             fill = dat.all.outliers[,i+2])) +
        geom_density(alpha = 0.5) +
        theme_classic() +
        theme(axis.title.y = element_blank(),
              axis.text.y = element_blank(),
              axis.ticks = element_blank(),
              legend.position = "none",
              plot.background = element_blank(),
              axis.line.y = element_blank(),
              axis.title.x = element_text(color="white"),
              axis.text.x = element_text(color="white"),
              axis.line.x = element_line(color="white")
        ) +
        scale_fill_manual(values = c("lightgrey", "#00BFC4")) +
        coord_flip()
      
      returnlist[[paste0("plot.study", outlier.studies.all[i], ".removed")]] =
        gridExtra::arrangeGrob(density.db.upper,
                               blankPlot,
                               outlier.plot,
                               density.db.right,
                               nrow = 2,
                               ncol = 2,
                               heights = c(1,5),
                               widths = c(4,1),
                               top = paste("Study ", outlier.studies.all[i]))
      
    }
    
  }
  
  
  if (do.km == TRUE){
    returnlist$outlier.studies.km = outlier.studies.km
  }
  if (do.db == TRUE){
    returnlist$outlier.studies.db = outlier.studies.db
  }
  if (do.gmm == TRUE){
    returnlist$outlier.studies.gmm = outlier.studies.gmm
  }
  
  if (verbose == TRUE){
    cat("==========| DONE \n")
    cat("\n")
  }
  
  class(returnlist) = c("gosh.diagnostics", "list")
  
  return(returnlist)
  
}
Ibrahimhassan94/MAAS documentation built on Feb. 24, 2022, 8:14 a.m.