View source: R/gosh.diagnostics.R
gosh.diagnostics | R Documentation |
This function uses three unsupervised learning learning algorithms (k-means, DBSCAN and Gaussian Mixture Models) to identify studies contributing to the heterogeneity-effect size patterns found in GOSH (graphic display of study heterogeneity) plots.
gosh.diagnostics(data, km = TRUE, db = TRUE, gmm = TRUE,
km.params = list(centers = 3,
iter.max = 10, nstart = 1,
algorithm = c("Hartigan-Wong",
"Lloyd", "Forgy","MacQueen"),
trace = FALSE),
db.params = list(eps = 0.15, MinPts = 5,
method = c("hybrid", "raw", "dist")),
gmm.params = list(G = NULL, modelNames = NULL,
prior = NULL, control = emControl(),
initialization = list(hcPairs = NULL,
subset = NULL,
noise = NULL),
Vinv = NULL,
warn = mclust.options("warn"),
x = NULL, verbose = FALSE),
seed = 123,
verbose = TRUE)
data |
An object of class |
km |
Logical. Should the k-Means algorithm be used to identify
patterns in the GOSH plot matrix? |
db |
Logical. Should the DBSCAN algorithm be used to identify patterns
in the GOSH plot matrix? |
gmm |
Logical. Should a bivariate Gaussian Mixture Model be used to
identify patterns in the GOSH plot matrix? |
km.params |
A list containing the parameters for the k-Means algorithm
as implemented in |
db.params |
A list containing the parameters for the DBSCAN algorithm
as implemented in |
gmm.params |
A list containing the parameters for the Gaussian Mixture Models
as implemented in |
seed |
Seed used for reproducibility. Default seed is |
verbose |
Logical. Should a progress bar be printed in the console during clustering? |
GOSH Plots
GOSH (graphic display of study heterogeneity) plots were proposed by Olkin, Dahabreh and Trikalinos (2012) as a diagnostic plot to assess effect size heterogeneity. GOSH plots facilitate the detection of both (i) outliers and (ii) distinct homogeneous subgroups within the modeled data.
Data for the plots is generated by fitting a random-effects-model with the
same specifications as in the meta-analysis to all \mathcal{P}(k),
\emptyset \notin \mathcal{P}(k), \forall 2^{k-1} \leq 10^6
possible
subsets of studies in an analysis. For |\mathcal{P}(k)| > 10^6
, 1
million subsets are randomly sampled and used for model fitting when using
the gosh
function.
GOSH Plot Diagnostics
Although GOSH plots allow to detect heterogeneity patterns and distinct
subgroups within the data, interpretation which studies contribute to a
certain subgroup or pattern is often difficult or computationally
intensive. To facilitate the detection of studies responsible for specific
patterns within the GOSH plots, this function randomly samples 10^4
data points from the GOSH Plot data (to speed up computation). Of the data
points, only the z
-transformed I^2
and effect size value is
used (as other heterogeneity metrics produced for the GOSH plot data using
the gosh
function are linear combinations of
I^2
). To this data, three clustering algorithms are applied.
The first algorithm is k-Means clustering using the
algorithm by Hartigan & Wong (1979) and m_k = 3
cluster centers by
default. The functions uses the kmeans
implementation
to perform k-Means clustering.
As k-Means does not
perform well in the presence of distinct arbitrary subclusters and noise,
the function also applies DBSCAN (density reachability and
connectivity clustering; Schubert et al., 2017). The hyperparameters
\epsilon
and MinPts
can be tuned for each analysis to maintain
a reasonable amount of granularity while not producing too many
subclusters. The function uses the dbscan
implementation
to perform the DBSCAN clustering.
Lastly, as a clustering approach
using a probabilistic model, Gaussian Mixture Models (GMM; Fraley & Raftery, 2002)
are integrated in the function using an internal call to the
mclustBIC
implementation. Clustering hyperparameters can
be tuned by providing a list of parameters of the mclustBIC
function in the mclust
package.
To assess which studies predominantly contribute to a detected cluster, the function calculates the cluster imbalance of a specific study using the difference between (i) the expected share of subsets containing a specific study if the cluster makeup was purely random (viz., representative for the full sample), and the (ii) actual share of subsets containing a specific study within a cluster. Cook's distance for each study is then calculated based on a linear intercept model to determine the leverage of a specific study for each cluster makeup. Studies with a leverage value three times above the mean in any of the generated clusters (for all used clustering algorithms) are returned as potentially influential cases and the GOSH plot is redrawn highlighting these specific studies.
Mathias Harrer & David Daniel Ebert
Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.
Hartigan, J. A., & Wong, M. A. (1979). Algorithm as 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28 (1). 100–108.
Olkin, I., Dahabreh, I. J., Trikalinos, T. A. (2012). GOSH–a Graphical Display of Study Heterogeneity. Research Synthesis Methods 3, (3). 214–23.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P. & Xu, X. (2017). DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Transactions on Database Systems (TODS) 42, (3). ACM: 19.
InfluenceAnalysis
# Example: load gosh data (created with metafor's 'gosh' function),
# then use function
## Not run:
data("m.gosh")
res <- gosh.diagnostics(m.gosh)
# Look at results
summary(res)
# Plot detected clusters
plot(res, which = "cluster")
# Plot outliers
plot(res, which = "outlier")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.