R/gosh.diagnostics.R

Defines functions gosh.diagnostics

Documented in gosh.diagnostics

#' Identify studies contributing to heterogeneity patterns found in GOSH plots
#'
#' This function uses three unsupervised learning learning algorithms
#' (\emph{k}-means, DBSCAN and Gaussian Mixture Models) to identify studies
#' contributing to the heterogeneity-effect size patterns found in GOSH (graphic
#' display of study heterogeneity) plots.
#'
#' @usage gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE,
#'                  km.params = list(centers = 3,
#'                                   iter.max = 10, nstart = 1,
#'                                   algorithm = c("Hartigan-Wong",
#'                                   "Lloyd", "Forgy","MacQueen"),
#'                                   trace = FALSE),
#'                  db.params = list(eps = 0.15, MinPts = 5,
#'                                   method = c("hybrid", "raw", "dist")),
#'                  gmm.params = list(G = NULL, modelNames = NULL,
#'                                    prior = NULL, control = emControl(),
#'                                    initialization = list(hcPairs = NULL,
#'                                    subset = NULL,
#'                                    noise = NULL),
#'                                    Vinv = NULL,
#'                                    warn = mclust.options("warn"),
#'                                    x = NULL, verbose = FALSE),
#'                  seed = 123,
#'                  verbose = TRUE)
#'
#' @param data An object of class \code{gosh.rma} created through the
#'   \code{\link[metafor]{gosh}} function.
#' @param km Logical. Should the \emph{k}-Means algorithm be used to identify
#'   patterns in the GOSH plot matrix? \code{TRUE} by default.
#' @param db Logical. Should the DBSCAN algorithm be used to identify patterns
#'   in the GOSH plot matrix? \code{TRUE} by default.
#' @param gmm Logical. Should a bivariate Gaussian Mixture Model be used to
#'   identify patterns in the GOSH plot matrix? \code{TRUE} by default.
#' @param km.params A list containing the parameters for the \emph{k}-Means algorithm
#' as implemented in \code{kmeans}. Run \code{?kmeans} for further details.
#' @param db.params A list containing the parameters for the DBSCAN algorithm
#' as implemented in \code{\link[fpc]{dbscan}}. Run \code{?fpc::dbscan} for further details.
#' @param gmm.params A list containing the parameters for the Gaussian Mixture Models
#' as implemented in \code{\link[mclust]{mclustBIC}}. Run \code{?mclust::mclustBIC} for further details.
#' @param seed Seed used for reproducibility. Default seed is \code{123}.
#' @param verbose Logical. Should a progress bar be printed in the console during clustering?
#'
#' @details
#'
#' \strong{GOSH Plots}
#'
#'   GOSH (\emph{graphic display of study
#'   heterogeneity}) plots were proposed by Olkin, Dahabreh and Trikalinos
#'   (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots
#'   facilitate the detection of both (i) outliers and (ii) distinct homogeneous
#'   subgroups within the modeled data.
#'
#'   Data for the plots is generated by fitting a random-effects-model with the
#'   same specifications as in the meta-analysis to all \eqn{\mathcal{P}(k),
#'   \emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6} possible
#'   subsets of studies in an analysis. For \eqn{|\mathcal{P}(k)| > 10^6}, 1
#'   million subsets are randomly sampled and used for model fitting when using
#'   the \code{\link[metafor]{gosh}} function.
#'
#'
#'   \strong{GOSH Plot Diagnostics}
#'
#'   Although GOSH plots allow to detect heterogeneity patterns and distinct
#'   subgroups within the data, interpretation which studies contribute to a
#'   certain subgroup or pattern is often difficult or computationally
#'   intensive. To facilitate the detection of studies responsible for specific
#'   patterns within the GOSH plots, this function randomly samples \eqn{10^4}
#'   data points from the GOSH Plot data (to speed up computation). Of the data
#'   points, only the \eqn{z}-transformed \eqn{I^2} and effect size value is
#'   used (as other heterogeneity metrics produced for the GOSH plot data using
#'   the \code{\link[metafor]{gosh}} function are linear combinations of
#'   \eqn{I^2}). To this data, three clustering algorithms are applied.
#'   \itemize{ \item The first algorithm is \emph{k}-Means clustering using the
#'   algorithm by Hartigan & Wong (1979) and \eqn{m_k = 3} cluster centers by
#'   default. The functions uses the \code{\link[stats]{kmeans}} implementation
#'   to perform \emph{k}-Means clustering. \item As \emph{k}-Means does not
#'   perform well in the presence of distinct arbitrary subclusters and noise,
#'   the function also applies \strong{DBSCAN} (\emph{density reachability and
#'   connectivity clustering}; Schubert et al., 2017). The hyperparameters
#'   \eqn{\epsilon} and \eqn{MinPts} can be tuned for each analysis to maintain
#'   a reasonable amount of granularity while not producing too many
#'   subclusters. The function uses the \code{\link[fpc]{dbscan}} implementation
#'   to perform the DBSCAN clustering. \item Lastly, as a clustering approach
#'   using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002)
#'   are integrated in the function using an internal call to the
#'   \code{\link[mclust]{mclustBIC}} implementation. Clustering hyperparameters can
#'   be tuned by providing a list of parameters of the \code{mclustBIC}
#'   function in the \code{mclust} package.}
#'
#'   To assess which studies predominantly contribute to a detected cluster,
#'   the function calculates the cluster imbalance of a specific study using the
#'   difference between (i) the expected share of subsets containing a specific
#'   study if the cluster makeup was purely random (viz., representative for the
#'   full sample), and the (ii) actual share of subsets containing a specific
#'   study within a cluster. Cook's distance for each study is then calculated
#'   based on a linear intercept model to determine the leverage of a specific
#'   study for each cluster makeup. Studies with a leverage value three times
#'   above the mean in any of the generated clusters (for all used clustering
#'   algorithms) are returned as potentially influential cases and the GOSH plot
#'   is redrawn highlighting these specific studies.
#'
#' @references
#' Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis
#' and density estimation, \emph{Journal of the American Statistical Association},
#' 97/458, pp. 611-631.
#'
#' Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm.
#' \emph{Journal of the Royal Statistical Society. Series C (Applied Statistics), 28} (1). 100–108.
#'
#' Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity.
#' \emph{Research Synthesis Methods 3}, (3). 214–23.
#'
#' Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited:
#' Why and How You Should (Still) Use DBSCAN. \emph{ACM Transactions on Database Systems (TODS) 42}, (3). ACM: 19.
#'
#'
#' @author Mathias Harrer & David Daniel Ebert
#'
#' @import grid gridExtra ggplot2
#' @importFrom fpc dbscan
#' @importFrom stats kmeans cooks.distance lm complete.cases
#' @importFrom mclust mclustBIC Mclust emControl mclust.options predict.Mclust
#' @importFrom grDevices rainbow
#'
#'
#' @export gosh.diagnostics
#'
#' @seealso \code{\link{InfluenceAnalysis}}
#'
#' @examples
#' # Example: load gosh data (created with metafor's 'gosh' function),
#' # then use function
#' \dontrun{
#' data("m.gosh")
#' res <- gosh.diagnostics(m.gosh)
#'
#' # Look at results
#' summary(res)
#'
#' # Plot detected clusters
#' plot(res, which = "cluster")
#'
#' # Plot outliers
#' plot(res, which = "outlier")
#' }



gosh.diagnostics = function(data,
                            km = TRUE,
                            db = TRUE,
                            gmm = TRUE,
                            km.params = list(centers = 3,
                                             iter.max = 10, nstart = 1,
                                             algorithm = c("Hartigan-Wong", "Lloyd", "Forgy","MacQueen"),
                                             trace = FALSE),
                            db.params = list(eps = 0.15, MinPts = 5,
                                             method = c("hybrid", "raw", "dist")),
                            gmm.params = list(G = NULL, modelNames = NULL,
                                              prior = NULL, control = emControl(),
                                              initialization = list(hcPairs = NULL,
                                                                    subset = NULL,
                                                                    noise = NULL),
                                              Vinv = NULL, warn = mclust.options("warn"),
                                              x = NULL, verbose = FALSE),
                            seed = 123,
                            verbose = TRUE) {

    # Redefine Variables
    data = data
    sav = data
    do.km = km; rm(km)
    do.db = db; rm(db)
    do.gmm = gmm; rm(gmm)

    # set seed
    set.seed(seed)

    # Check input
    if (class(data) != "gosh.rma"){
      stop("Argument 'data' provided does not have class 'gosh.rma'.")
    }
    if (do.km == FALSE & do.db == FALSE & do.gmm == FALSE){
      stop("At least one of 'km', 'db', or 'gmm' must be set to TRUE.")
    }


    # Start loading bar
    if (verbose == TRUE){
      cat(" ", "\n", "Perform Clustering...", "\n")

      cat(" |")
    }



    # Create full dataset from gosh output
    dat.full = sav$res[complete.cases(sav$res),]
    dat.full = cbind(dat.full, sav$incl[complete.cases(sav$res),])



    # Create dataset for k-Means
    dat.km = data.frame(scale(dat.full$I2, center = TRUE, scale = TRUE),
                        scale(dat.full$estimate, center = TRUE,
                              scale = TRUE))
    colnames(dat.km) = c("I2", "estimate")

    # Create dataset for DBSCAN
    # DBSCAN can become too computationally intensive
    # for very large GOSH data.  For N_gosh > 10.000, N = 10.000 data points are
    # therefore randomly sampled.

    if (nrow(dat.full) < 10000) {
        dat.db.full = dat.full
    } else {
        dat.db.full = dat.full[sample(1:nrow(dat.full), 10000), ]  #Sample 10.000 rows
    }

    dat.db = data.frame(scale(dat.db.full$I2, center = TRUE, scale = TRUE),
                        scale(dat.db.full$estimate, center = TRUE, scale = TRUE))
    colnames(dat.db) = c("I2", "estimate")

    if (verbose == TRUE){
      cat("==========")
    }


    # K-Means
    km.params$x = dat.km
    do.call(stats::kmeans, km.params)
    km = do.call(stats::kmeans, km.params)

    # Only use 5000 rows for plotting to increase speed
    if (length(as.numeric(km$cluster)) > 5000){
      km.plot.mask = sample(1:length(as.numeric(km$cluster)), 5000)
      km.plot = km
      km.plot$cluster = km$cluster[km.plot.mask]
      dat.km.plot = dat.km[km.plot.mask,]
    } else {
      km.plot = km
      dat.km.plot = dat.km
    }

    levels.km = unique(km.plot$cluster)[order(unique(km.plot$cluster))]
    dat.km.plot$cluster = factor(km.plot$cluster, levels = levels.km)

    km.clusterplot = ggplot(data = dat.km.plot,
                            aes(x = estimate, y = I2, color = cluster)) +
      geom_point(cex = 0.5, alpha = 0.8) +
      ylab(expression(italic(I)^2~(z-score))) +
      xlab("Effect Size (z-score)") +
      theme_minimal() +
      ggtitle("K-means Algorithm") +
      labs(color = "Cluster")


    # DBSCAN
    db.params$data = dat.db
    db = do.call(fpc::dbscan, db.params)

    # Only use 5000 rows for plotting to increase speed
    if (length(as.numeric(db$cluster)) > 5000){
      db.plot.mask = sample(1:length(as.numeric(db$cluster)), 5000)
      db.plot = db
      db.plot$cluster = db$cluster[db.plot.mask]
      dat.db.plot = dat.db[db.plot.mask,]
    } else {
      db.plot = db
      dat.db.plot = dat.db
    }

    if (verbose == TRUE){
      cat("==========")
    }

    levels.db = unique(db.plot$cluster)[order(unique(db.plot$cluster))]
    dat.db.plot$cluster = factor(db.plot$cluster, levels = levels.db)
    levels(dat.db.plot$cluster)[levels(dat.db.plot$cluster) == "0"] = "Outlier"
    color.db = rainbow(nlevels(dat.db.plot$cluster))
    color.db[1] = "#000000"

    db.clusterplot = ggplot(data = dat.db.plot,
                            aes(x = estimate, y = I2, color = cluster)) +
      geom_point(cex = 0.5, alpha = 0.7) +
      ylab(expression(italic(I)^2~(z-score))) +
      xlab("Effect Size (z-score)") +
      theme_minimal() +
      ggtitle("DBSCAN Algorithm (black dots are outliers)") +
      scale_color_manual(values = color.db) +
      labs(color = "Cluster")



    # GMM
    # Use same data as used for DBSCAN
    dat.gmm.full = dat.db.full
    dat.gmm = dat.db

    gmm.params$data = dat.gmm

    # Search for optimal solution
    gmm.bic = do.call(mclust::mclustBIC, gmm.params)
    gmm = mclust::Mclust(data = dat.gmm, x = gmm.bic)


    # Only use 5000 rows for plotting to increase speed
    if (length(as.numeric(gmm$classification)) > 5000){
      gmm.plot.mask = sample(1:length(as.numeric(gmm$classification)), 5000)
      dat.gmm.plot = dat.gmm[gmm.plot.mask,]
      dat.gmm.plot$cluster = predict.Mclust(gmm)$classification[gmm.plot.mask]
    } else {
      dat.gmm.plot = dat.gmm
      dat.gmm.plot$cluster = predict.Mclust(gmm)$classification
    }

    if (verbose == TRUE){
      cat("==========")
    }

    levels.gmm = unique(dat.gmm.plot$cluster)[order(unique(dat.gmm.plot$cluster))]
    dat.gmm.plot$cluster = factor(dat.gmm.plot$cluster, levels = levels.gmm)

    gmm.clusterplot = ggplot(data = dat.gmm.plot,
                            aes(x = estimate, y = I2, color = cluster)) +
      geom_point(cex = 0.5, alpha = 0.8) +
      ylab(expression(italic(I)^2~(z-score))) +
      xlab("Effect Size (z-score)") +
      theme_minimal() +
      ggtitle("Gaussian Mixture Model") +
      labs(color = "Cluster")

    if (verbose == TRUE){
      cat("==========")
    }


    # Add to dfs
    dat.km.full = dat.full
    dat.km.full$cluster = km$cluster
    dat.db.full$cluster = db$cluster
    dat.gmm.full$cluster = gmm$classification


    ####################################################
    # Extract the Percentages###########################
    # K-Means############################################

    dat.km.full$cluster = as.factor(dat.km.full$cluster)
    n.levels.km = nlevels(dat.km.full$cluster)

    # Loop for the total n of studies

    dat.km.full.total = dat.km.full[, -c(1:6, ncol(dat.km.full))]
    n.cluster.tots = apply(dat.km.full.total, 2, sum)
    n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
    colnames(n.cluster.tots) = c("n.tots")

    if (verbose == TRUE){
      cat("==========")
    }


    # Loop for the cluster-wise n of studies
    n = sapply(split(dat.km.full.total, dat.km.full$cluster), function(x) apply(x, 2, sum))


    # Calculate Percentages
    deltas = as.data.frame(apply(n, 2,
                                 function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))

    # Generate the plot
    Cluster = factor(rep(1:n.levels.km, each = nrow(deltas)))
    Study = rep(1:nrow(deltas), n.levels.km)
    Delta_Percentage = unlist(deltas)
    delta.df = data.frame(Cluster, Delta_Percentage, Study)

    km.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
        geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
        1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (K-Means algorithm)") + geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()


    ####################################################
    # Cook's Distance Plot###########################
    # K-Means############################################

    m.cd.km = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
    m.cd.km$`0` = NULL
    m.cd.km = lapply(m.cd.km, cooks.distance)
    m.cd.km.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.km)))
    m.cd.km.df$Cluster = as.factor(rep(1:(n.levels.km), each = nrow(deltas)))
    m.cd.km.df$Study = rep(1:nrow(deltas), times = (n.levels.km))
    outlier.cd.km = 3 * mean(m.cd.km.df$Cooks.Distance)

    if (n.levels.km <= 2){

      m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] = m.cd.km.df[m.cd.km.df$Cluster=="2", "Cooks.Distance"] + 0.01

      km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
        geom_line(aes(color=Cluster), alpha = 0.5) +
        geom_point(aes(color = Cluster)) +
        scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
        scale_y_continuous(name = "Cook's Distance") +
        theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Cook's Distance)") +
        geom_hline(yintercept = outlier.cd.km, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        theme_minimal()


    } else {

      km.cd.plot = ggplot(data = m.cd.km.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
        geom_line(aes(color=Cluster), alpha = 0.5) +
        geom_point(aes(color = Cluster)) +
        scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
        scale_y_continuous(name = "Cook's Distance") +
        theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Cook's Distance)") +
        geom_hline(yintercept = outlier.cd.km, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        theme_minimal()

    }

    if (verbose == TRUE){
      cat("==========")
    }

    ####################################################
    # Extract the Percentages###########################
    # DBSCAN############################################

    dat.db.full$cluster = as.factor(dat.db.full$cluster)
    n.levels.db = nlevels(dat.db.full$cluster)

    # Loop for the total n of studies

    dat.db.full.total = dat.db.full[, -c(1:6, ncol(dat.db.full))]

    n.cluster.tots = apply(dat.db.full.total, 2, sum)
    n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
    colnames(n.cluster.tots) = c("n.tots")


    # Loop for the cluster-wise n of studies

    n = sapply(split(dat.db.full.total, dat.db.full$cluster), function(x) apply(x, 2, sum))


    # Calculate Percentages

    deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))

    # Generate the plot

    Cluster = factor(rep(colnames(deltas), each = nrow(deltas)))
    Study = rep(1:nrow(deltas), n.levels.db)
    Delta_Percentage = unlist(deltas)
    delta.df = data.frame(Cluster, Delta_Percentage, Study)
    delta.df = delta.df[delta.df$Cluster != 0,] #Zero Class (Outliers are removed)

    db.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
        geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
        1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Density-Based Clustering)") + geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()


    if (verbose == TRUE){
      cat("==========")
    }

    ####################################################
    # Cook's Distance Plot###########################
    # DBSCAN############################################

    m.cd.db = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
    m.cd.db$`0` = NULL
    m.cd.db = lapply(m.cd.db, cooks.distance)
    m.cd.db.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.db)))
    m.cd.db.df$Cluster = as.factor(rep(1:(n.levels.db - 1), each = nrow(deltas)))
    m.cd.db.df$Study = rep(1:nrow(deltas), times = (n.levels.db - 1))
    outlier.cd.db = 3 * mean(m.cd.db.df$Cooks.Distance)

    if (n.levels.db <= 2){

      m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] = m.cd.db.df[m.cd.db.df$Cluster=="2", "Cooks.Distance"] + 0.01

      db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
        geom_line(aes(color=Cluster), alpha = 0.5) +
        geom_point(aes(color = Cluster)) +
        scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
        scale_y_continuous(name = "Cook's Distance") +
        theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Cook's Distance)") +
        geom_hline(yintercept = outlier.cd.db, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        theme_minimal()


    } else {

      db.cd.plot = ggplot(data = m.cd.db.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
        geom_line(aes(color=Cluster), alpha = 0.5) +
        geom_point(aes(color = Cluster)) +
        scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
        scale_y_continuous(name = "Cook's Distance") +
        theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Cook's Distance)") +
        geom_hline(yintercept = outlier.cd.db, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        theme_minimal()

    }

    if (verbose == TRUE){
      cat("==========")
    }

    ####################################################
    # Extract the Percentages###########################
    # GMM   ############################################

    dat.gmm.full$cluster = as.factor(dat.gmm.full$cluster)
    n.levels.gmm = nlevels(dat.gmm.full$cluster)

    # Loop for the total n of studies

    dat.gmm.full.total = dat.gmm.full[, -c(1:6, ncol(dat.gmm.full))]
    n.cluster.tots = apply(dat.gmm.full.total, 2, sum)
    n.cluster.tots = data.frame(unlist(as.matrix(n.cluster.tots)))
    colnames(n.cluster.tots) = c("n.tots")


    # Loop for the cluster-wise n of studies

    n = sapply(split(dat.gmm.full.total, dat.gmm.full$cluster), function(x) apply(x, 2, sum))


    # Calculate Percentages

    deltas = as.data.frame(apply(n, 2, function(x) (x/n.cluster.tots$n.tots) - mean(x/n.cluster.tots$n.tots)))

    # Generate the plot

    Cluster = factor(rep(colnames(deltas), each = nrow(deltas)))
    Study = rep(1:nrow(deltas), n.levels.gmm)
    Delta_Percentage = unlist(deltas)
    delta.df = data.frame(Cluster, Delta_Percentage, Study)


    gmm.plot = ggplot(data = delta.df, aes(x = Study, y = Delta_Percentage, group = Cluster)) + geom_line(aes(color = Cluster)) +
      geom_point(aes(color = Cluster)) + scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas),
                                                                                         1)) + scale_y_continuous(name = "Delta Percentage") + theme(axis.text = element_text(size = 5)) +
      ggtitle("Cluster imbalance (GMM)") + geom_hline(yintercept = 0, linetype = "dashed") +
      theme_minimal()


    ####################################################
    # Cook's Distance Plot###########################
    # GMM ############################################

    m.cd.gmm = by(delta.df, as.factor(delta.df$Cluster), function(x) lm(Delta_Percentage ~ 1, data = x))
    m.cd.gmm$`0` = NULL
    m.cd.gmm = lapply(m.cd.gmm, cooks.distance)
    m.cd.gmm.df = data.frame(Cooks.Distance = matrix(unlist(m.cd.gmm)))
    m.cd.gmm.df$Cluster = as.factor(rep(1:(n.levels.gmm), each = nrow(deltas)))
    m.cd.gmm.df$Study = rep(1:nrow(deltas), times = (n.levels.gmm))
    outlier.cd.gmm = 3 * mean(m.cd.gmm.df$Cooks.Distance)

    if (n.levels.gmm <= 2){

      m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] = m.cd.gmm.df[m.cd.gmm.df$Cluster=="2", "Cooks.Distance"] + 0.01

      gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
        geom_line(aes(color=Cluster), alpha = 0.5) +
        geom_point(aes(color = Cluster)) +
        scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
        scale_y_continuous(name = "Cook's Distance") +
        theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Cook's Distance)") +
        geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        theme_minimal()


    } else {

      gmm.cd.plot = ggplot(data = m.cd.gmm.df, aes(x = Study, y = Cooks.Distance, group = Cluster)) +
        geom_line(aes(color=Cluster), alpha = 0.5) +
        geom_point(aes(color = Cluster)) +
        scale_x_continuous(name = "Study", breaks = seq(0, nrow(deltas), 1)) +
        scale_y_continuous(name = "Cook's Distance") +
        theme(axis.text = element_text(size = 5)) +
        ggtitle("Cluster imbalance (Cook's Distance)") +
        geom_hline(yintercept = outlier.cd.gmm, linetype = "dashed") +
        geom_hline(yintercept = 0, linetype = "dashed") +
        theme_minimal()

    }


    #######################
    # Generate Output Plot#
    #######################

    returnlist = list()

    if (do.km == TRUE){
      returnlist$km.clusters = n.levels.km
      km.sideplots = gridExtra::arrangeGrob(km.plot, km.cd.plot, nrow=2)
      returnlist$km.plot = gridExtra::arrangeGrob(km.clusterplot, km.sideplots, ncol = 2)
    }

    if (do.db == TRUE){
      returnlist$db.clusters = n.levels.db-1
      db.sideplots = gridExtra::arrangeGrob(db.plot, db.cd.plot, nrow=2)
      returnlist$db.plot = gridExtra::arrangeGrob(db.clusterplot, db.sideplots, ncol = 2)
    }

    if (do.gmm == TRUE){
      returnlist$gmm.clusters = n.levels.gmm
      gmm.sideplots = gridExtra::arrangeGrob(gmm.plot, gmm.cd.plot, nrow=2)
      returnlist$gmm.plot = gridExtra::arrangeGrob(gmm.clusterplot, gmm.sideplots, ncol = 2)
    }

    ############################################
    # Plot GOSH for potential outlying studies #
    ############################################

    # Get outlying studies

    # Kmeans
    outlier.studies.km.df = m.cd.km.df[m.cd.km.df$Cooks.Distance>outlier.cd.km,]
    outlier.studies.km = unique(outlier.studies.km.df$Study)

    # DBSCAN
    outlier.studies.db.df = m.cd.db.df[m.cd.db.df$Cooks.Distance>outlier.cd.db,]
    outlier.studies.db = unique(outlier.studies.db.df$Study)

    # GMM
    outlier.studies.gmm.df = m.cd.gmm.df[m.cd.gmm.df$Cooks.Distance>outlier.cd.gmm,]
    outlier.studies.gmm = unique(outlier.studies.gmm.df$Study)

    # Use all identified outliers
    outlier.studies.all = unique(c(outlier.studies.km, outlier.studies.db, outlier.studies.gmm))
    outlier.studies.all.mask = outlier.studies.all + 6 # Add 6 to use as mask


    # Get plotting dataset and only choose outlier studies as mask, use db data
    if (length(as.numeric(db$cluster)) > 5000){

      dat.all.outliers = dat.db.full[db.plot.mask, c(3,6, outlier.studies.all.mask)]

    } else {

      dat.all.outliers = dat.db.full[,c(3,6, outlier.studies.all.mask)]

    }

    if (length(outlier.studies.all) > 0){

      # Loop through all identified outliers
      for (i in 1:length(outlier.studies.all)){

        outlier.plot = ggplot(data = dat.all.outliers, aes(x = estimate,
                                                           y = I2,
                                                           color = dat.all.outliers[,i+2])) +
          geom_point(alpha=0.8) +
          scale_color_manual(values = c("lightgrey", "#00BFC4")) +
          theme_minimal() +
          theme(legend.position = "none",
                plot.title = element_text(hjust = 0.5, face = "bold")) +
          xlab("Effect Size") +
          ylab("I-squared")



        density.db.upper = ggplot(data = dat.all.outliers, aes(x = estimate,
                                                               fill = dat.all.outliers[,i+2])) +
          geom_density(alpha = 0.5) +
          theme_classic() +
          theme(axis.title.x = element_blank(),
                axis.text.x = element_blank(),
                axis.ticks = element_blank(),
                legend.position = "none",
                plot.background = element_blank(),
                axis.line.x = element_blank(),
                axis.title.y = element_text(color="white"),
                axis.text.y = element_text(color="white"),
                axis.line.y = element_line(color="white")
          ) +
          scale_fill_manual(values = c("lightgrey", "#00BFC4"))

        blankPlot = ggplot()+geom_blank(aes(1,1))+
          theme(plot.background = element_blank(),
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.border = element_blank(),
                panel.background = element_blank(),
                axis.title.x = element_blank(),
                axis.title.y = element_blank(),
                axis.text.x = element_blank(),
                axis.text.y = element_blank(),
                axis.ticks = element_blank(),
                axis.line.x = element_blank(),
                axis.line.y = element_blank()
          )

        density.db.right = ggplot(data = dat.all.outliers, aes(x = I2,
                                                               fill = dat.all.outliers[,i+2])) +
          geom_density(alpha = 0.5) +
          theme_classic() +
          theme(axis.title.y = element_blank(),
                axis.text.y = element_blank(),
                axis.ticks = element_blank(),
                legend.position = "none",
                plot.background = element_blank(),
                axis.line.y = element_blank(),
                axis.title.x = element_text(color="white"),
                axis.text.x = element_text(color="white"),
                axis.line.x = element_line(color="white")
          ) +
          scale_fill_manual(values = c("lightgrey", "#00BFC4")) +
          coord_flip()

        returnlist[[paste0("plot.study", outlier.studies.all[i], ".removed")]] =
        gridExtra::arrangeGrob(density.db.upper,
                        blankPlot,
                        outlier.plot,
                        density.db.right,
                        nrow = 2,
                        ncol = 2,
                        heights = c(1,5),
                        widths = c(4,1),
                        top = paste("Study ", outlier.studies.all[i]))

      }

    }


      if (do.km == TRUE){
        returnlist$outlier.studies.km = outlier.studies.km
      }
      if (do.db == TRUE){
        returnlist$outlier.studies.db = outlier.studies.db
      }
      if (do.gmm == TRUE){
        returnlist$outlier.studies.gmm = outlier.studies.gmm
      }

    if (verbose == TRUE){
      cat("==========| DONE \n")
      cat("\n")
    }

    class(returnlist) = c("gosh.diagnostics", "list")

    return(returnlist)

}
MathiasHarrer/dmetar documentation built on April 4, 2024, 6:57 p.m.