10.1 Introduction

It is often the case in science that we have several studies asking more or less the same question, and, hence, several estimates of some parameter such as the strength of association between two variables, or the causal effect of some intervention on some outcome. At this point, we want to synthesize the evidence available to us to come to a conclusion. What, on the basis of all we know so far, is our best current belief on the topic?

In the past, people wrote review articles (often known these days, disparagingly, as ‘narrative reviews’). In these reviews, which were often taken as settling a question, an author wrote some text about a number of related studies, compared and contrasted their findings, and informally came to some kind of conclusion about the state of knowledge. They often based their conclusion on what proportion of the papers they had found reported a ‘significant’ result. This is known as vote counting: 3 significant results, 2 non-significant, significant is the winner, this is probably a thing.

This is a terrible way of synthesizing evidence, for a number of reasons. First, the practice of narrative reviewing is unsystematic and probably biased. The author of the narrative review talked about studies they knew about. But what if the ones they knew about were not all the studies that had been done? What if they knew about ones that tended to favour their prejudice? To include studies in the review, the review author had to find them. It’s hard to fund studies unless they are published in a journal; and, traditionally, studies reporting a ‘significant’ finding were more likely to be accepted by a journal because they were somehow deemed more interesting. This is known as publication bias. The combination of non-systematic study selection, and publication bias, means that the set of studies included in a review is really unlikely to be representative of all studies that have been done on a topic; and therefore any conclusions about the state of knowledge were pretty unsafe. These days, to be credible, review articles these days have to be systematic, something we will come back to in section 10.4.6. Systematic reviewing helps, though it does not completely solve the problem of publication bias.

Second, adding up how many studies find a ‘significant’ effect of X on Y (and how many do not) tells you very little about what the balance of evidence is for an effect of X on Y. The reasons are to do with things we have already encountered: a non-significant null hypothesis test of an effect is not the same as evidence that the effect is equivalent to zero (section 4.3); and difference of significance is not significance of difference (section 5.6). In the next section, I will show with examples why vote counting is bad, and introduce the alternative approach, quantitative meta-analysis.