library("papaja")

Systematic reviews and quantitative research syntheses now lie at the heart of debates about scientific theories and guidance about evidence-based policy.

It has been argued that tests of publication bias are irrelevant and unnecessary because there is overwhelming evidence that selective publication is at work across multiple scientific fields [@morey2013ConsistencyTestDoes]. Need for powerful tests of selective publication.

The remainder of the article proceeds as follows. In the next section, I briefly review the TES and Vevea-Hedges selection model before developing the connection between the two methods and proposing a refinement to TES. The following section reports a small Monte Carlo simulation evaluating the size and power of the proposed method. A brief discussion section highlights limitations and future directions.

A selection of tests for selective publication {#tests}

Consider a meta-analysis of $k$ studies, where a conventional random effects model might be applied. Let $T_i$ denote the effect size estimate from study $i$, with standard error $\sigma_i$, each for $i = 1,...,k$. Let $\theta_i$ denote the true effect size parameter from study $i$. I shall assume that the included studies are large enough that it is reasonable to treat $T_i$ as following a normal distribution: $T_i \sim N(\theta_i, \sigma_i^2)$. Under a random effects model, and in the absence of selective publication, the true effects are assumed to follow a normal distribution with mean $\mu$ and standard deviation $\tau$.

Let $\alpha$ denote the conventional type-I error level used for one-sided hypothesis tests; in areas of research that typically use two-sided tests and the .05 level to determine significance, the one-sided level would be $\alpha = .025$. Let $\Phi(x)$ and $\phi(x)$ denote the standard normal cumulative distribution and density function, respectively. Let $z_\alpha$ denote the standard normal $\alpha$-level critical value, that is, $z_\alpha = \Phi^{-1}(1 - \alpha)$. Let $O_i$ be an indicator for the statistical significance of study $i$, so that $O_i = 1$ when $T_i / \sigma_i > z_\alpha$ and $O_i = 0$ otherwise.

Test of Excess Significance

As a test of selective publication, @ioannidis2007ExploratoryTestExcess proposed to compare the number of statistically significant effects among the set of $k$ studies to the number of significant effects expected if there were no selection. The observed number of significant effects is $O = \sum_{i=1}^K O_i$. In the absence of selection, the expected number of effects is the sum of the power of each study to detect a true effect of a given size. Letting $P_i(\mu,\tau^2)$ denote the power of study $i$ under a random effects model, which is equal to \begin{equation} P_i(\mu,\tau^2) = 1 - \Phi\left( \frac{\sigma_i z_\alpha - \mu}{\sqrt{\tau^2 + \sigma_i^2}}\right). (#eq:power) \end{equation} The expected number of significant effects is then $E(\mu, \tau^2) = \sum_{i=1}^k P_i(\mu, \tau^2)$, a quantity that depends on the unknown average effect $\mu$ and between-study heterogeneity $\tau$. @ioannidis2007ExploratoryTestExcess suggested estimating expected power based on a fixed effect meta-analysis, taking $\hat{E} = E(\hat\mu_F, 0)$, where $\hat\mu_F$ is the usual fixed effect average. They justify this approach by arguing that heterogeneity is often negligible and that, if there is selective publication, the fixed effect average is less biased than the random effects average. Subsequent applications of TES have typically followed this approach.\todo{Is this true?}

@ioannidis2007ExploratoryTestExcess proposed two approximate tests for drawing an inference about whether the set of included studies has been selected for statistical significance. First, they suggest using the test statistic \begin{equation} A = \frac{(O - \hat{E})^2}{\hat{E}(k - \hat{E}) / k}, (#eq:chisq-stat) \end{equation} compared to a $\chi^2_1$ reference distribution. Alternately, they suggest using a binomial test, comparing $O$ to a binomial reference distribution with size $k$ and probability $\hat{E} / k$. Applications of TES have typically followed the latter approach.\todo{Is this true?}

Both variants of TES involve approximations. One approximation arises from treating the expected number of studies as known with certainty, whereas in practice it must be estimated. Further, using a binomial reference distribution amounts to assuming that power is constant across studies, which will not be the case unless all included studies are equally precise. Calculating $\hat{E}$ under a fixed effect model entails the further assumption that between-study heterogeneity is negligible [@johnson2007CommentsExploratoryTest]. Thus, one might expect that the type I error rate of the tests may be distorted when studies vary in precision or are truly heterogeneous. Indeed, @vanassen2015MetaanalysisUsingEffect reported simulation results in which TES had below-nominal type I error, even when all studies are equally precise.

TES is an exploratory test for selective publication, intended to be used as a signal that a body of evidence may be unrepresentative [@ioannidis2013ClarificationsApplicationInterpretation]. It does not, however, invoke any particular model of the selection process. In contrast, other approaches are based on specific models of selective publication.

Weight function selection models

Many meta-analytic models have been developed that make specific assumptions about the process of selective publication. One such class of models, often called "weight function" models, assume that the probability of publication depends on a piece-wise constant function of the statistical significance of the effect size estimate. Building on earlier work by @iyengar1988SelectionModelsFile, @hedges1992ModelingPublicationSelection and @dear1992ApproachAssessingPublication proposed weight function models that allow for heterogeneity in true effect sizes. @vevea1995GeneralLinearModel further developed the approach to allow for moderators of effect size through a meta-regression model. Based on several extensive Monte Carlo simulations, a very simple, three-parameter version of the weight function model has recently been highlighted as a promising technique for dealing with selective publication [@mcshane2016AdjustingPublicationBias; @carter2018CorrectingBiasPsychology]. I limit consideration to this three-parameter model because it is mostly directly connected to TES. I discuss a more general form of the weight function model in the Appendix.

The weight function model involves two components: a sampling model and a selection model. Following @hedges1992ModelingPublicationSelection, the sampling model assumes that effect size estimates follow a basic random effects model as outlined previously. However, not all effect sizes are published (or more generally, not all effect sizes are available for inclusion in the meta-analysis). Rather, the probability that an effect size estimate is included is a multiple of the weight function \begin{equation} w(T_i, \sigma_i) = \begin{cases} 1 & \text{if} \quad T_i > \sigma_i z_\alpha \ \pi & \text{if} \quad T_i \leq \sigma_i z_\alpha \end{cases} (#eq:weight-function) \end{equation} where $\pi \geq 0$ is the probability that a statistically insignificant effect size is included, relative to the inclusion probability for an equally precise, statistically significant effect size. Note that if $\pi = 1$, then all studies appear with equal probability and there is no selective publication.

The weight function model and tests associated with it are usually based on maximum likelihood estimation models. Assuming that studies are mutually independent, the joint likelihood of the weight function model is \begin{equation} \mathcal{L}(\mu, \tau^2, \pi) = \prod_{i=1}^k \frac{w(T_i, \sigma_i) \phi\left(\frac{T_i - \mu}{\sqrt{\tau^2 + \sigma_i^2}}\right)}{\sqrt{\tau^2 + \sigma_i^2} A_i(\mu, \tau^2, \pi)}, (#eq:Likelihood) \end{equation} where $A_i$ is a normalizing constant given by $$ A_i(\mu, \tau^2, \pi) = 1 - (1 - \pi)\Phi\left( \frac{\sigma_i z_\alpha - \mu}{\sqrt{\tau^2 + \sigma_i^2}}\right) = P_i(\mu, \tau^2) + \pi \left[1 - P_i(\mu, \tau^2)\right], $$ with $P_i(\mu, \tau^2)$ as given in (\ref{eq:power}). The log likelihood is thus (up to a constant): \begin{equation} l(\mu, \tau^2, \pi) = \sum_{i=1}^k \ln w(T_i, \sigma_i) - \frac{1}{2} \sum_{i=1}^k \frac{(T_i - \mu)^2}{\tau^2 + \sigma_i^2} - \frac{1}{2} \ln(\tau^2 + \sigma_i^2) - \sum_{i=1}^k \ln A_i(\mu, \tau^2, \pi). (#eq:log-likelihood) \end{equation} Let $\hat\mu$, $\hat\tau^2$, and $\hat\pi$ denote the values that maximize \@ref(eq:log-likelihood)---that is, the maximum likelihood estimates of the model parameters.

@hedges1992ModelingPublicationSelection proposed to a test for the null hypothesis that $\pi = 1$ (i.e., no selective publication) using a likelihood ratio criterion. Let $\hat\mu_R$ and $\hat\tau^2_R$ denote the values that maximize (\ref{eq:log-likelihood}) when $\pi$ is set equal to 1. The likelihood ratio test statistic is then \begin{equation} G^2 = 2 \left[l(\hat\mu, \hat\tau^2, \hat\pi) - l(\hat\mu_R, \hat\tau_R^2, 1)\right], (#eq:LRT) \end{equation} which is compared to a $\chi^2_1$ reference distribution.

In practice, a difficulty with the three-parameter weight function model is that maximum likelihood estimates do not converge when all included studies are statistically significant or when no included studies are statistically significant at level $\alpha$.\todo{Are both of these conditions correct?} If one of these conditions occurs, a researcher might choose to adjust the $\alpha$ level defining statistical significance so that at least one study is statistically significant and at least one is statistically insignificant. I implement this ad hoc modification when evaluating the operating characteristics of the likelihood ratio test in the Monte Carlo simulations.

Refined excess significance tests

TES and the three-parameter weight function model are closely connected, in TES is based on the null score of the weight function model. The score function is the derivative of the log likelihood with respect to its parameters. Note that the derivative of (\ref{eq:log-likelihood}) with respect to $\pi$ is \begin{equation} S^\pi(\mu, \tau^2, \pi) = \frac{\partial l}{\partial \pi} = \frac{1}{\pi} \sum_{i=1}^k (1 - S_i) - \sum_{i=1}^k \frac{1}{A_i} \Phi\left( \frac{\sigma_i z_\alpha - \mu}{\sqrt{\tau^2 + \sigma_i^2}}\right). (#eq:score-pi) \end{equation} Under the null hypothesis of $\pi = 1$, $A_i = 1$ for $i = 1,...,k$ and the score simplifies to $$ \begin{aligned} S^\pi(\mu, \tau^2, 1) &= \sum_{i=1}^k (1 - O_i) - \sum_{i=1}^k \Phi\left( \frac{\sigma_i z_\alpha - \mu}{\sqrt{\tau^2 + \sigma_i^2}}\right) \ &= (k - O) - \sum_{i=1}^k \left[1 - P_i(\mu, \tau^2)\right] \ &= E(\mu,\tau^2) - O. \end{aligned} $$ Thus, the score of the three-parameter weight function model, evaluated under the null, is equivalent to the discrepancy between the expected and observed number of significant effects---the same statistic that is used to construct TES. TES is typically calculated under a fixed effects model, in which case the discrepancy is $O - \hat{E} = - S^\pi(\hat\mu_{FE}, 0, 1)$. If expected power is instead calculated using the maximum likelihood estimates under a random effects model, then $O - E(\hat\mu_R, \hat\tau^2_R) = - S^\pi(\hat\mu_R, \hat\tau^2_R, 1)$.

This connection to the weight function model suggests that TES could be refined using score tests, a standard tool from mathematical statistics. Score tests [@rao1948LargeSampleTests] are asymptotically equivalent to likelihood ratio tests, but have the advantage that they do not require obtaining maximum likelihood estimates under the unrestricted model [@boos1992GeneralizedScoreTests]. This feature is advantageous in the present context because it allows one to circumvent potential convergence problems with the weight function model. Two forms of score tests are available---a parametric form and a robust form---which use different approaches to estimating the variance of the null score.

The parametric score test [@rao1948LargeSampleTests] uses the Fisher information matrix to estimate the variance of the null score. Denote the full parameter vector of the weight function model as $\bs\theta = (\mu, \tau^2, \pi)$. Let $\bm{S}$ be the full score vector, so $\bm{S}(\bs\theta) = \partial l(\bs\theta) / \partial \bs\theta$, with $\bm{\tilde{S}}R = \bm{S}(\hat\mu_R, \hat\tau^2_R, 1)$. Let $\mathcal{I}$ denote the Fisher information matrix: $$ \mathcal{I}(\bs\theta) = - \mathbb{E}\left(\frac{\partial^2 l(\bs\theta)}{\partial \bs\theta \partial \bs\theta'} \right), $$ where $\mathbb{E}$ denotes the expectation over $T_1,...,T_k$, and denote $\tilde{\mathcal{I}}_R = \mathcal{I}(\hat\mu_R, \hat\tau^2_R, 1)$. The parametric score test statistic is then given by $$ X^2_P = \bm{\tilde{S}}_R'\tilde{\mathcal{I}}_R^{-1} \bm{\tilde{S}}_R. $$ This quantity can be expressed more directly as: \begin{equation} X^2_P = \frac{\left[E(\hat\mu_R, \hat\tau^2_R) - O\right]^2}{V\pi - U_\mu - U_{\tau^2}}, (#eq:parametric-score) \end{equation} where $$ \begin{aligned} V_\pi &= \sum_{i=1}^k P_i(\hat\mu_R, \hat\tau^2_R) \left[1 - P_i(\hat\mu_R, \hat\tau^2_R)\right], \ U_\mu &= \left[\sum_{i=1}^k \frac{1}{\hat\tau_R^2 + \sigma_i^2}\right]^{-1}\left[\sum_{i=1}^k \frac{\phi(c_i)}{\sqrt{\hat\tau^2_R + \sigma^2}}\right]^2, \ U_{\tau^2} &= \left[\sum_{i=1}^k \frac{1}{\left(\hat\tau_R^2 + \sigma_i^2\right)^2}\right]^{-1}\left[\sum_{i=1}^k \frac{c_i\phi(c_i)}{\hat\tau^2_R + \sigma^2}\right]^2, \end{aligned} $$ and $c_i = \left(\sigma_i z_\alpha - \hat\mu_R\right) / \sqrt{\hat\tau^2_R + \sigma_i^2}$. The null hypothesis of $\pi = 1$ is rejected if $X^2_P$ exceeds a $\chi^2_1$ critical value. Alternately, a one-sided, level-$\alpha_S$ test can be calculated by taking \begin{equation} Z_P = \frac{E(\hat\mu_R, \hat\tau^2_R) - O}{\sqrt{V_\pi - U_\mu - U_{\tau^2}}} \end{equation} and rejecting if $Z_P < \Phi(\alpha_S)$.

This parametric score test requires using the maximum likelihood estimates (under the restriction of the null hypothesis). Alternative forms of the score test have been described by Kent (1982), White (1982), and Engle (1984), among others [see @boos1992GeneralizedScoreTests for a review]. These generalized, or robust, forms provide more flexibility in how $\mu$ and $\theta$ may be estimated, assuming that there is not selective publication.

Consider estimators $\acute\mu$ and $\acute\tau^2$ that are defined as the solutions of estimating equations $$ \begin{aligned} S^\mu\left(\mu, \tau^2\right) &= \sum_{i=1}^k S^\mu_i \left(\mu, \tau^2\right) = 0, \ S^{\tau^2}\left(\mu, \tau^2\right) &= \sum_{i=1}^k S^{\tau^2}i \left(\mu, \tau^2\right) = 0, \end{aligned} $$ where the estimating equation for $\mu$ has the form $$ S_i^\mu(\mu, \tau^2) = w_i (T_i - \mu) $$ for some set of weights $w_1,...,w_k$, which may depend on $\tau^2$. The usual fixed effect estimator uses $w_i = \sigma_i^{-2}$; the random effects estimator uses $w_i = \left(\tau^2 + \sigma_i^2\right)^{-1}$. For the between-study heterogeneity, many of the available estimators of $\tau^2$ can be expressed as solutions to estimating equations. For example, the maximum likelihood estimator uses $$ S_i^{\tau^2}(\mu,\tau^2) = \frac{1}{\tau^2 + \sigma_i^2} \left[\frac{(T_i - \mu)^2}{\tau^2 + \sigma_i^2} - 1\right]. $$ and the restricted maximum likelihood estimator uses $$ S_i^{\tau^2}(\mu,\tau^2) = \frac{1}{\tau^2 + \sigma_i^2} \left[\frac{(T_i - \mu)^2}{\tau^2 + \sigma_i^2} + \frac{1}{\left(\tau^2 + \sigma_i^2\right)\sum{i=1}^k \left(\tau^2 + \sigma_i^2\right)^{-1}}- 1\right]. $$

In conjunction with estimating equations for $\mu$ and $\tau^2$, a robust score test can be defined based on the null score function from the three-parameter weight function model: $$ S^\pi\left(\mu, \tau^2\right) = E(\mu, \tau^2) - O = \sum_{i=1}^k S^\pi_i\left(\mu, \tau^2\right) $$ where $S^\pi_i\left(\mu, \tau^2\right) = P_i(\mu, \tau^2) - O_i$. Let $\mathcal{I}^\mu_\mu(\mu, \tau^2) = -\mathbb{E}\left(\partial S^\mu / \partial \mu\right)$, $\mathcal{I}^\pi_\mu(\mu, \tau^2) = -\mathbb{E}\left(\partial S^\pi / \partial \mu\right)$, etc., where all expectations are taken under the null with $\pi = 1$. Let $F_i$ denote the efficient score function [@tsiatis2006GeometryInfluenceFunctions], given by $$ F_i = S^\pi_i - \left(\frac{\mathcal{I}^\pi_\mu}{\mathcal{I}^\mu_\mu}\right) S^\mu_i - \left(\frac{\mathcal{I}^\pi_\mu}{\mathcal{I}^\mu_\mu}\right) S^\mu_i, $$ with all quantities evaluated at the parameter estimates $\acute\mu$ and $\acute\tau^2$. A robust score test statistic is then \begin{equation} X^2_R = \frac{\left[E(\acute\mu, \acute\tau^2) - O\right]^2}{\sum_{i=1}^k F_i^2}, (#eq:robust-score) \end{equation} compared to a $\chi^2_1$ reference distribution. Alternately, a one-sided test can be obtained by taking \begin{equation} Z_R = \frac{E(\acute\mu, \acute\tau^2) - O}{\sqrt{\sum_{i=1}^k F_i^2}} \end{equation} and rejecting if $Z_R < \Phi(\alpha_S)$.

The robust form of the score test is attractive because it can be applied with any of an array of estimators for $\mu$ and $\tau^2$. For instance, an analyst might prefer to use the fixed effect estimator $\hat\mu_F$ for the average effect size because it is less biased than the random effects estimator under selective publication, along with the restricted maximum likelihood estimator for $\tau^2$ because it is less biased than the maximum likelihood estimator [cf. @henmi2010ConfidenceIntervalsRandom; @stanley2015NeitherFixedRandom]. It is known that the parametric score test and likelihood ratio test have equivalent power, asymptotically, under a fixed alternative model [@rao2005ScoreTestHistorical]. Beyond that, however, it is difficult to determine which test is optimal. To investigate whether either the parametric or robust test offers advantages over the existing TES or the likelihood ratio test from the three-parameter weight function model, and whether $\hat\mu_F$ offers any advantage over the use of random effects maximum likelihood estimation, I turn to Monte Carlo simulations.

Size and power comparisons {#simulations}

In this section, I report simulations examining the size (Type-I error rates) and power properties of these tests under the random effects models and a basic form of selective publication---consistent

Discussion

Limitations

\newpage

References

r_refs(file = "r-references.bib")

\begingroup \setlength{\parindent}{-0.5in} \setlength{\leftskip}{0.5in}

\endgroup



jepusto/metascore documentation built on June 25, 2019, 8:15 p.m.