Intended Audience:

No technical background presumed, but if you know what Cohen's D is and why it's used, you can skip this.

Motivating Example:

Let's say you're interested in the auto-immune disease scleroderma, and you want to know which therapies work against it. Your begin by searching academic databases for high-quality studies on the subject, i.e. double-blinded, randomized controlled trials. You find 15 of these, and luckily for you, each presents its main result as a positive effect on Quality-adjusted life years (QALY) -- so it's straightforward to make an apples-to-apples comparison between them.

Scanning the abstracts, you find that 13 of 15 papers claim that their therapies increase QALY by a statistically significant amount; the remaining two find significant results forwomen under 50. This sounds like good news! But because you're wary of publication bias, which is a problem for meta-analyses in particular, you resolve to take a close look at each rather than taking their results at face value.

To help with this, you put your notes about the 15 papers into a spreadsheet, where each row represents one study, and each coumn is a piece of information about that study: average age of the patients, the reported effect on QALY, how many people were in the treatment group, etc.

Once you've read all the studies, you see that one therapy improved QALY by almost twice as much as the next closest -- but that study had just 50 patients in the treatment group, while most other studies had more than 100 treated patients. You also read a troubling footnote in a different paper alluding to difficulties in maintaining a double-blinded procedure (the pills it looked at are very distinctive, so the doctors might know on sight the difference between treatment and placebo).

When you write up your results, you argue that one therapy is the most promising, with the caveat that we'd be more sure about its effects if they were replicated with a larger sample. You also show readers a scatterplot and line graph plotting QALY against sample size, showing that larger effects tend to be found in smaller sample sizes (this is one way that people check for publication bias). Finally, you end with an 'extensions and limitations' section that lays out your concerns about double-blinding, and calls for future researchers to address this forthrightly.

And that's the gist of writing a meta-analysis.

Simplifying assumptions

In this example, each study:

If you're looking at a literature where all three conditions hold, you're in luck -- making apples-to-apples comparisons between studies is going to be relatively easy. But none of these has been true for the three meta-analyses I've worked on.[^1] If you want to find what interventions reduce racism, for instance, you might find a lot of scholarly dispute about both what racism is and how to measure it (especially in a cross-country context). You'll also probably find studies with really different designs and aims -- one paper might be a longitidudinal study with 5000 adults but no randomization, and another could be a randomized study of 300 students in 10 classrooms. Synthesizing evidence on a question like this requires judgements calls about, e.g., what papers to include in your search (based on some criteria like target population or study design) and which outcomes to report from them.

When you've done so, the next problem is creating a framework for comparing a lot of different outcomes. Returning to the example of 'what interventions reduce racism', a reviewer is likely to find a mix of IAT tests (which some researchers think are "not good predictors of ethnic or racial discrimination"), explicit attitudes, and behavior (e.g. cooperation in a prisoner's dilemma game). If one study produces an average change of 1 point on a 7 point attitude scale, and another shows the treatment group was 10% more likely to cooperate in a game, how do you average these, or say whether one effect is 'bigger' than another?

Converting to a common statistical framework

Enter Cohen's D, which divides the mean difference between two groups (sometimes called the Average Treatment Effect, or ATE) of an outcome by the standard deviation of that outcome. To make this concrete:[^2]

You then do this for every outcome of each study you look at.[^3] This enables you to aggregate all studies into estimates of, e.g., which interventions work best, the average treatment effect of a literature as a whole, the relationship between magnitude of effect size and the precision of each estimate, and anything else you want to know (and think the data can tell you).

What if a study doesn't report means and the standard deviation?

Then you use the statistical information available to you to try to derive them. But this process is more fraught than I once realized, and one particular challenge of doing so -- estimating Cohen's D when all results are reported after control variables are accounted for -- is the subject of the next essay.

[^1]: The contact hypothesis re-evaluated, published July 2018; A systematic review and meta-analysis of primary prevention strategies for sexual violence perpetration, in the data gathering phase; and another I’m currently working as an RA on.↩ [^2]: This example is a simplified version of work by Sohad Murrar, profiled nicely here. [^3]: You would also calculate the standard error of each Cohen's D estimate, which estimates the precision of your Cohen's D estimate.



setgree/ResultsStandardizeR documentation built on June 2, 2020, 11:48 a.m.