Understanding Statistics BMJ Learning 2004

BMJ Learning

There is an  on-line resource called ‘BMJ Learning’. This was initially a free service, targeted particularly at Primary Care in the UK, but of general educational interest. You now have to take out a subscription, but it is free if you are a member of the British Medical Association and your institution might also have a subscription. There is a module that I have written called ‘Understanding Statistics’ under the interactive cases section and is due to be updated at the end of 2016. This covers the topic of presenting results in relative and absolute terms using Clopidogrel as an example. There is also a more advanced module (Understanding Statistics 2) covering confidence intervals and statistical significance.

The web address is http://www.bmjlearning.com

Delayed Antibiotic Prescriptions

Those interested in the use of deferred antibiotic prescriptions may be interested in two articles in the November 2003 issue of the British Journal of General Practice and an accompanying editorial outlining our ongoing experience with children who have acute otitis media.

The references are:

  • Cates C. Delayed prescriptions in primary care. Br J Gen Pract 2003;53(496):836-7.
  • Edwards M, Dennison J, Sedgwick P. Patients’ responses to delayed antibiotic prescription for acute upper respiratory tract infections. Br J Gen Pract 2003;53(496):845-50.
  • Arroll B, Kenealy T, Kerse N. Do delayed prescriptions reduce antibiotic use in respiratory tract infections? A systematic review. Br J Gen Pract 2003;53(496):871-7.
  • There is also an interesting editorial in the BMJ on the same topic by Bruce Arroll published in December of this year. This can be found athttp://bmj.bmjjournals.com/cgi/content/full/327/7428/1361 or in the print journal the reference is:
  • Arroll B, Kenealy T, Goodyear-Smith F, Kerse N. Delayed prescriptions. BMJ 2003;327(7428):1361-1362.
  • Finally a paper from the USA has coined a new acronym for delayed prescriptions as SNAP (safety net antibiotic prescription). I wonder if this will catch on?
  • Siegel RM, Kiely M, Bien JP, Joseph EC, Davis JB, Mendel SG, et al. Treatment of Otitis Media With Observation and a Safety-Net Antibiotic Prescription. Pediatrics 2003;112(3):527-531.

Systematic Reviews and Meta-analyses (Prescriber 2003)

Information Overload

The sheer volume of material published in medical journals each week is well beyond any of us to keep up with, and in order to save us from drowning in information the writers of systematic reviews aim to collect together and appraise all the evidence from appropriate studies addressing a focussed clinical question. The Cochrane Collaboration has been working at this task for the past twenty years and, in September 2016, there were 7038 completed reviews on the Cochrane Database of Systematic reviews and a further 2520 protocols that will become reviews in the future.

The File-Drawer Problem

So what was wrong with the traditional narrative review from an expert in the field? The previous emphasis has been on understanding the mechanisms of disease and combining this with clinical experience to guide practice.(1) The main problem with this approach is that we all have our preferred way of doing things, and there is a natural tendency to take note of articles that fit in with our view. We may cut these out and keep them in our filing cabinet, whilst articles that do not agree are filed in the rubbish bin. This means that when asked to review a topic it is natural for an expert to go the drawer and quote all the data that supports their favoured approach.

What is a Systematic Review?

So how is a systematic review different? Let’s start with a definition:

Systematic review (synonym: systematic overview): A review of a clearly formulated question that uses systematic and explicit methods to identify, select and critically appraise relevant research, and to collect and analyse data from the studies that are included in the review. Statistical methods (meta-analysis) may or may not be used to analyse and summarise the results of the included studies.

The difference here is that the way the papers were found and analysed is clearly stated. The reader still needs to be satisfied that the search for papers was wide enough to obtain all the relevant data. Searching Medline alone is rarely enough, and if only English language papers are included this may leave out potentially important evidence.

All Cochrane reviews start as a published protocol; this states in advance how the review will be carried out (searching for data, appraising and combining study data). There is therefore some protection against the danger of post-hoc analysis, in which reviewers find that by dividing up the trials in a particular way spurious statistical significance can be generated in sub-groups of patients or treatment types.

Is the Question focussed?

But we have moved on to thinking about how the review was carried out before checking whether the question being addressed is an important one. The PICO structure set out in the first article in this series(2) can be used here to check that the Patient groups, Interventions used, Comparator treatment and Outcomes are sensible. Watch out in particular for surrogate outcomes that may not relate well to the outcome that matters to the patient. One example of this can be found in trials relating influenza vaccine to the prevention of asthma exacerbations. Some trials measure antibody levels to the flu-vaccine given, but what really matters is whether asthmatics have fewer exacerbations or admissions to hospital, and there is precious little data from randomised controlled trials about this (3).

What was the quality of the trials found?

A further issue to think about in Systematic Reviews is whether the type of included studies is appropriate to the question being asked. In a previous article in this series(4) the problems of bias was discussed. In general in questions related to treatment I would expect the review to focus on randomised controlled trials, as this will minimise the bias present in the included studies. Whilst Meta-analysis can be used to combine the results of observational studies, this is unreliable because they may all suffer from the same bias, and this will be combined in the pooled result from all the trials.

When looking at randomised controlled trials the reviewers should report whether the allocation of patients to the treatment and control groups was adequately concealed (allocation concealment). Allocation is best decided remotely after the patient is entered into the trial; even opaque sealed envelopes can be held up to a bright light by trialists who want to check which treatment the next patient will receive. Poor allocation concealment, failure to blind and poor reporting quality in reviews have all been shown to be associated with overoptimistic results of randomised controlled trials.(5)

Publication bias remains a problem, in that studies that may happen to produce results that are statistically significant are more likely to be published than ones that do not, since editors of medical journals like to have a story to present. This will never be fully overcome until all trials are registered in advance and the publication of results becomes mandatory (whether they show significant differences or not).

Forest Plots

The results of a Systematic Review are often shown graphically as a Forest plot (6). An example from the 2013 update of a Cochrane Review comparing Spacers with Nebulisers for delivery of Beta-agonists(7) is shown below.

Figure 1 Forest Plot of Hospital Admissions for Adults and Children with Acute Asthma when treated with Beta-agonist delivered by Holding Chamber (Spacer) compared to Nebuliser (edited to include data in the 2013 update of the review).

The left hand column lists the included studies, which have been sub-grouped into those relating to adults and children. The columns listed ‘Holding Chamber’ and ‘Nebuliser’ list the proportion of patients in each group admitted to hospital and the Relative Risk of admission is shown next to them as a graphical display. Admission is undesirable so the squares and diamonds to the left of the vertical line favour the spacer group. The size of the blue square relates to the weight given to each study in the analysis; this is listed in the next column and generally increases for larger studies. The width of the horizontal line is the 95% confidence interval for each study and this is reported in text in the final column.

The pooled results from adults are shown in the top diamond, for children in the lower diamond . This shows that by combining all the studies in children we can be 95% sure that the true risk of admission when using a spacer lies between 0.47 and 1.08 in comparison with using a nebuliser. There is no significant difference between the two methods and the confidence interval suggests that, in children, nebulisers are at best no more than 8% better than spacers and may be up to 53% worse.

So how can these results be translated into clinical practice? This question will be the focus of the next article in this series.


1. Haynes RB. What kind of evidence is it that Evidence-Based Medicine advocates want health care providers and consumers to pay attention to? BMC Health Serv Res 2002;2(1):3 http://www.biomedcentral.com/1472-6963/2/3

2. Cates C. Evidence-based medicine: asking the right question. Prescriber 2002;13(6):105-9.

3. Cates CJ, Jefferson TO, Bara AI, Rowe BH. Vaccines for preventing influenza in people with asthma (Cochrane Review). In: Cochrane Library: Update Software (Oxford); 2000.

4. Po A. Hierarchy of evidence: data from different trials. Prescriber 2002;13(12):18-23.

5. Bandolier. Bias. Bandolier 2000;80-2:1-5 http://www.jr2.ox.ac.uk/bandolier/band80/b80-2.html

6. Lewis S, Clarke M. Forest plots: trying to see the wood and the trees. BMJ 2001;322(7300):1479-1480.

7.Cates CJ, Welsh EJ, Rowe BH. Holding chambers (spacers) versus nebulisers for beta-agonist treatment of acute asthma.Cochrane Database of Systematic Reviews 2013, Issue 9. Art. No.: CD000052.DOI: 10.1002/14651858.CD000052.pub3. (added in 2016)

Reproduced with permission and edited October 2016.

Subgroups compared (BMJ 2003)

Statistical notes http://bmj.com/cgi/content/full/326/7382/219 in the BMJ in January 2003 contain a nice article by Altman and Bland demonstrating a way of testing whether there is a statistically significant difference between subgroups in a clinical trial or meta-analysis.

The key question to answer is whether the effect in one sub-group of patients is significantly different from another group and the point estimate and confidence interval for each group can be used to test this.The P-value for each group should NOT be compared because this addresses the wrong question.This was discussed in a previous set of statistical notes in 1996 http://bmj.com/cgi/content/full/313/7060/808 .

The problem is that the individual P-values for each subgroup merely tell us the likelihood of the trial results in that subgroup occurring if the null hypothesis is true .Now for both subgroups the null hypothesis is the same, namely that there is no difference between the experimental and control treatments.However, the chance any given result occurring is highly dependent upon the number of patients in the group.Exactly the same point estimate will have a very different P-value as the size of the group gets larger.Thus a small group may have a non-significant P-value for the same point estimate and a large group with the same point estimate can have a much smaller significant P-value.

This is not surprising when likened to tossing a coin, where the null hypothesis is that you will toss a head as often as a tail.If you obtain 60% heads after tossing a coin ten times you will not be surprised, but if it is still 60% after a thousand tosses the coin becomes decidedly suspect.The random chance of tossing 6/10 heads with an unbiased coin is much higher than the random chance of tossing 600/1000 heads.

This demonstrates one of the weaknesses of being over reliant on P-values when looking at the results of clinical trials, as it is very much influenced by the number of patients included in each group.The confidence interval (CI) is much more informative, as they give an idea of where the trial suggests that the true effect of the treatment lies (technically if the trial were repeated 100 times the 95% confidence interval would include the true population effect in 95 of those trials).In a simplified form we can be 95% sure that the true population effect of the treatment is within the 95% confidence interval.

The width of the confidence interval is also affected by the number of patients included, and will get narrow for larger groups, but using the confidence intervals from two subgroups is much more informative than just comparing the P-values.If the two confidence intervals from the subgroups do not overlap, you can start to wonder if there is an important difference between the two subgroups.The Altman paper shows how to assess this more accurately.

A word of warning on subgroups; before attaching too much importance between subgroups you need to check if the groups were defined a priori and whether the division is based on good biological or other grounds.Richard Horton reminded us of the danger of relying on subgroup analysis in his Lancet Editorial (From star signs to trial guidelines. Lancet 2000;355:1033-4.).There is also an entertaining paper with a useful set of questions to ask about subgroups by Freemantle in the BMJ in 2001 entitled “Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt in the attic?” http://bmj.com/cgi/content/full/322/7292/989

Whilst as clinicians we would like to know how well a treatment would work in the patient sitting in front of me, we have data from clinical trials outlining the average effect of the treatment on a population of patients.It takes very large numbers of patients to tease out whether individual groups benefit more or less than the overall average.One recent example of this comes from the ALLHAT trial ( Major outcomes in high-risk hypertensive patients randomised to angiotensin-converting enzyme inhibitor or calcium channel blocker vs diuretic: The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT). JAMA 2002;288(23):2981-97).In this trial a total of 33,357 participants aged 55 years or older with hypertension and at least 1 other CHD risk factor from 623 North American centers were randomised to different drug regimes. The overall conclusion was that chlorthalidone produced cardiovascular outcomes that were at least as good as lisinopril or amlodipine, but is this true in diabetics as well as in non-diabetics? There were a large enough number of patients in the trial to compare the results in different subgroups and compare the findings in diabetic and non-diabetic patients

The relative risk of combined cardiovascular disease in non-diabetics in ALLHAT comparing lisinopril with chlorthalidone was 1.12 (95% CI 1.05 to 1.19) whilst in diabetics it was 1.08 (95% CI 1.00 to 1.07).The diabetics are a smaller group so do not reach statistical significance as the 95% CI just includes no difference (relative risk 1.0). It is tempting to conclude from this that diabetics do not show the same advantage for chlorthalidone that is seen in the non-diabetics.However, this conclusion is unreliable because it is based on the wrong question.We do not want to know what is the chance (P-value) that chlorthalidone matches lisinopril in the diabetics, because this is dependent upon how many diabetics are included in the trial.

What we want to test is whether the diabetics differed significantly from non-diabetics in their response to the different treatments. This can be tested using the method outlined in the recent Altman paper, and the results in diabetics compared to the non-diabetics show that both are very similar.The Risk Ratio between the subgroups is 1.04 (95% CI 0.86 to 1.25). In other words the difference between the patient groups is not statistically significant, nor is the confidence interval very wide, so I would therefore take the overall trial result to apply to diabetics as well.The data from this trial suggest that thiazides may be back in first place for hypertension in diabetics as well as non-diabetics.

Measuring the costs and effectiveness of treatment (Prescriber 2003)

The Prescriber series on evidence-based medicine aims to provide the reader with an easy-to-follow guide to a complex topic. Using practical examples, the articles will help you apply evidence-based medicine to daily practice. In this final article, we look at how cost-effectiveness is calculated.

Previous articles in this series have described the statistical methods used to find out whether treatments are effective in clinical trials and, before embarking on cost-effectiveness analysis, it is wise to check first that there is good evidence that the treatment works. There are many extra levels of uncertainty when costs are considered, as this article demonstrates, and it is important to ensure that the foundational evidence of clinical benefit is in place before building a cost-effectiveness analysis that may rest on a treatment that has not reliably been shown to be better than placebo.

Let us take as an example the recent report on the benefit of ramipril (Tritace) in the secondary prevention of stroke from the Heart Outcomes Prevention Evaluation (HOPE) investigations.1 This was a large study of 9297 high-risk patients over 55 who were treated with either 10mg ramipril daily or placebo for an average of 4.5 years. The study showed a highly statistically significant 32 per cent reduction in the risk of stroke – 95 per cent confidence interval (CI) of 16-44 per cent reduction. The risk of fatal stroke was reduced by 61 per cent (95 per cent CI of 33-78 per cent reduction) over the 4.5 years of the study.

So here we have convincing evidence that ramipril was better than placebo. Subsequent correspondence, however, has pointed out that the presentation of the results concentrates on relative rather than absolute benefits and there is no mention of the potential costs involved in preventing strokes with ramipril.2 Various points are made in the letters, both about the way the results are presented and about the remaining uncertainty in relation to whether this effect is specific to ramipril (or ACE inhibitors in general) or is a general benefit of blood pressure reduction.

Cost-effectiveness analysis

In order to carry out a cost-effectiveness analysis, the consequences of a treatment must be measurable in suitable units (those that measure an important outcome),3 so in this case the unit could be one stroke. By making the unit of analysis ‘one stroke prevented’, the costs of caring for stroke can be set on one side and the costs of different treatments can be calculated. There is debate about whether non-NHS costs should be included, but for simplicity we will restrict ourselves to the ingredient costs of the drugs used for stroke prevention and ignore the costs of blood tests for monitoring.

We find that because strokes only occurred in 4.9 per cent of the patients in the placebo group, the impressive 32 per cent relative risk reduction actually translates into an absolute risk reduction of 1.5 per cent and a number needed to treat (NNT) of 66 people (95 per cent CI of 49 to 128) for 4.5 years to prevent one stroke.

The cost of the drug for 66 people for this length of time is about £58 000 (£196 per year each), and the confidence intervals of the NNT translate into a range of £43 000 to £113 000 to prevent one stroke. The temptation to make direct cost comparisons with the results of other drugs in reducing stroke is strong but care needs to be exercised.

There is a recognised difficulty in comparing NNTs for different treatments that do not have the same duration4 and this can be overcome by looking at cost-effectiveness as the duration of treatment is taken into account. This is because more events will be prevented with longer trial durations but the costs of treatment will go up in parallel, so the cost per event should stay the same whatever the duration considered.

There is, however, a residual problem in relation to any kind of absolute treatment effect (including cost-effectiveness). The size of absolute benefit is closely related to the baseline risk of the patient being treated, so high-risk patients will tend to show lower NNT and lower costs per event saved. This is because the relative risk reduction tends to be fairly consistent across different levels of baseline risk. This is demonstrated in the ramipril results where the relative risk reduction is very similar for patients with high and normal blood pressure,1 but those with higher blood pressure have higher absolute risks of stroke and therefore derive more benefit from treatment.

A further example of this relates to the cost of using statins. To prevent one cardiovascular event, fewer patients need to be given a statin when they are used for secondary prevention (where the baseline risk is high) in comparison with primary prevention (lower baseline risk). For this reason, before comparing costs between trials or meta-analyses of different treatments against placebo, it is important to check that the baseline risk of the patients in the placebo group is similar.  In fact the patients included in the Heart Protection Study5 did have similar baseline risks of stroke and a similar duration of treatment.

Here it is reasonable to compare the costs of using a statin and this works out as more expensive, at around £100 000 to prevent one stroke – this allows for the fact that some placebo arm patients ended up on a statin and not all the active patients stayed on treatment. Aspirin is many orders of magnitude cheaper at around £500 per stroke prevented, but hopefully most patients will be receiving this already.

Head-to-head comparisons of different interventions in a single trial can overcome the above difficulties, but in order to generate the power required to reliably detect small differences, prohibitively large numbers of patients need to be recruited. This in turn raises a further question about whether the costs of finding the answer outweigh the benefits of knowing it!

Cost minimisation

It is a mistake to think that economic analysis is only about minimising the costs of the treatment itself; if this were the only concern all asthmatics would be treated with oral steroids (the cheapest option). Clearly this ignores the known risks of long-term systemic treatment with oral steroids and would be entirely unethical.

In some situations, however, there is enough reliable information to persuade us that different treatments lead to similar outcomes, and in this instance a cost minimisation approach can be used. An example of this is the use of different delivery devices in asthma. A systematic search of the literature6 found that there is little evidence for any of the devices producing superior outcomes in clinical trials, so a cost minimisation analysis was carried out in which the costs of the devices were directly compared. Since a metered-dose inhaler with spacer is the cheapest method available this is the preferred first-line delivery method to try, but of course this does not mean that some patients will not need dry powder devices or breath-activated inhalers.

Cost-utility analysis

In some cases treatments cannot be directly compared using one of the simpler methods above as the treatments alter quality and quantity of life. Many of the treatments used in cancer fall into this category and assessments have to be made that incorporate both mortality and quality of life (QoL).

One way of judging how much people value their current health status is by using a standard gamble technique. Patients are asked to consider the theoretical possibility of having a treatment for their condition that had a chance of leaving them in perfect health or causing death; the odds of each outcome are adjusted until they are unsure whether to accept the treatment or not, and this can be used to rate their current QoL. This information can then be turned into quality-adjusted-life-years (QALYs) to allow the results of treatments for different diseases to be compared.

Sensitivity analysis

Since all economical analysis requires assumptions to be made about the cost of treatments and the value of outcomes, it is usual to carry out a sensitivity analysis to see how much the results of the analysis vary when the assumptions are altered. In particular, it may be necessary to predict what would happen beyond the timescale of the trials by using modelling techniques. If the results are very unstable when the assumptions are adjusted, this should be made clear and the reader will need to interpret the analysis with more caution.

Decisions have to be made

In the real world medical needs will always exceed the ability of any healthcare system to provide them. Hard choices have to be made every day about how best to use the resources that are available to us. The best available evidence of treatment efficacy (usually from systematic review of the results of randomised controlled trials) has to be combined with an economic analysis. Then hard choices must sometimes be made.

These are the processes used by the National Institute for Clinical Excellence (NICE), and they should be as transparent as possible so that we can see how the decisions were reached, even if we do not agree with all of them.

Table 1. Glossary of terms

Cost-effectiveness analysis

A form of economic study design in which consequences of different interventions may vary but can be expressed in identical natural units; competing interventions are compared in terms of cost per unit of consequence

Cost-minimisation analysis

An economic study design in which the consequences of competing interventions are the same and in which only inputs are taken into consideration; the aim is to decide which is the cheapest way of achieving the same outcome

Cost-utility analysis

A form of economic study design in which interventions producing different consequences in both quality and quantity of life are expressed as utilities; the best known utility measure is the quality-adjusted-life-year or QALY; competing interventions can be compared in terms of cost per QALY

Sensitivity analysis

A technique that repeats the comparison between inputs and consequences, varying the assumptions underlying the estimates – in doing so, sensitivity analysis tests the robustness of the conclusions by varying the items around which there is uncertainty

I would like to thank Professor Miranda Mugford for permission to use the glossary terms from Elementary Economic Evaluation and for helpful comments on this article.


I would like to thank Professor Miranda Mugford for permission to use the glossary of terms from Elementary Economic Evaluation in Health Care and for helpful comments on this article.


1. Bosch J, Yusuf S, Pogue J, et al. Use of ramipril in preventing stroke: double blind randomised trial. BMJ 2002; 324:699-702.
2. Badrinath P, Wakeman AP, Wakeman JG, et al. Preventing stroke with ramipril. BMJ 2002;325:439.
3. Jefferson TO, Demicheli V, Mugford M. Elementary economic evaluation in health care. 2nd ed. London: BMJ Books, 2000;132.
4. Smeeth L, Haines A, Ebrahim S. Numbers needed to treat derived from meta-analyses – sometimes informative, usually misleading. BMJ 1999;318:1548-51.
5. MRC/BHF Heart Protection Study of cholesterol lowering with simvastatin in 20 536 high-risk individuals: a placebo-controlled randomised controlled trial. Lancet 2002;360:7-22.
6. Brocklebank D, Ram F, Wright J, et al. Comparison of the effectiveness of inhaler devices in asthma and chronic obstructive airways disease; a systematic review of the literature. Health Technol Assess 2001;5:1-149.

Combining the results from Clinical Trials (Pulse Article 2001)

This article is part of a series on Critical Reading.

In an article on sub-group comparisons I warned about the danger of paying too much attention to results from patients in particular sub-groups of a trial, arguing that the overall treatment effect is usually the best measure for all the patients.

In the same way, when the results of all available clinical trials are combined in a Systematic Review (for example in a Cochrane review) care is still required in the interpretation of the results from each individual trial, and the main focus is on the pooled result giving the average from all the trials. The results are often displayed in a forest plot as demonstrated below. The result of each trial is represented by a rectangle (which is larger for the bigger trials) and the horizontal lines indicate the 95% confidence interval of each trial. The diamond at the bottom is the pooled result and its confidence interval is the width of the diamond.1

As hospital admissions for acute asthma were rare in each trial (shown in the columns of data for Holding Chamber and nebuliser) the uncertainty of the individual trials is seen in wide confidence intervals but when these are pooled together the uncertainty shrinks to a much narrower estimate. The pooled odds ratio of one indicates no difference shown between delivery methods for beta-agonists in acute asthma as far as admission rates are concerned, but the estimate is still imprecise and compatible with both a halving or a doubling of the odds of being admitted to hospital. So we have to say that we do not know whether there is a difference in the rate of admissions between the two delivery methods.

Before all the results are combined it is wise to carry out statistical tests to look for Publication Bias. There is evidence that positive results from Clinical trials are more likely to be published in major journals, and in the English language than similar trials that report negative results. When published studies are combined this leads to a tendency to overestimate the benefits of treatment. The easiest way to look for this is using a funnel plot of the results from the trials, where the results of each trial are plotted against the size of each study. Chance variations mean that small studies should show more random scatter in both directions around the pooled result. If all the small studies are showing positive results there is a suspicion that other small studies exist with negative results but were not published. The funnel plot shown below is taken from a Cochrane review of the use of Nicotine gum for smoking cessation and is reasonably symmetrical.


A further important check is to look for Heterogeneity. The individual trials will again show chance variation in their results and in a Systematic Review it is usual to test whether the differences are larger than those expected than by chance alone. The Forest plot above shows that the Heterogeneity in this set of trials is quite low. However if significant Heterogeneity is shown (in other words the results are more diverse than expected) it is recommended to explore the reasons why this may be. Although statistical adjustments can be made to incorporate such Heterogeneity (using a so called Random Effects Model) this should not be accepted uncritically. It may be more sensible not to try to combine the trial results at all.

An example of this can be found in the BMJ in October 1999 in which a group from Toronto published a meta-analysis of Helicobacter eradication (1). The statistical tests showed considerable Heterogeneity between the trials that was largely ignored by the authors. Inspection of the trials shows that there were two types; some with outcomes measured at six weeks using single treatments and others using triple therapy and measuring dyspepsia at one year. There is no good clinical reason to put these together and this may well explain the diversity of the results (2).

The message is to use your common sense when deciding whether the differences between the outcomes measured and the treatments used in each trial mean that it is safer not to calculate a single average result (not least because the average is not easy to interpret and apply to clinical practice).

1. Jaakimainen RL, Boyle E, Tuciver F. Is Helicobacter pylori associated with non-ulcer dyspepsia and will eradication improve symptoms? A meta-analysis. BMJ 1999;319:1040-4

2. Studies included in meta-analysis had heterogeneous, not homogeneous, results. Cates C. BMJ 2000;320:1208

The perils and pitfalls of sub-group analysis (Pulse Article 2001)

This article is part of a series on Critical Reading.

Controlled clinical trials are designed to investigate the effect of a treatment in a given population of patients, for example aspirin is given to patients with ischaemic heart disease. Inevitably there will be differences between the patients included in the trial (men versus women, older versus younger, hypertensive versus non-hypertensive).

It is tempting to look at the effects of treatment separately in different types of patient in order to decide who will benefit most from being given the treatment. Although this analysis of the sub-groups of patients is widely carried out in the medical literature, it is not very reliable. And the ISIS-2 trial gives a clear example of how this can be misleading [1]. The trial looked at the effect of aspirin given after acute myocardial infarction, and when the results were reported the editorial team at the Lancet wished to publish a table of sub-group analyses. The authors agreed as long as the first line in the table compared the effects in patients with different birth signs [2].

The analysis showed that aspirin was beneficial in all patients except those with the star signs of Libra and Gemini. This served as a warning against the over interpretation of the results of the other sub-groups reported in the paper. The problem is that the play of chance can lead to apparently significant differences between sub-groups, and these are really only helpful in very large trials which show really big overall differences in the treatment and control groups.

Two examples of the use of sub-group analysis are somewhat contentious. The first was reported in the Lancet and looked at the evidence from different trials of mammography to try to reduce deaths from breast cancer[3]. The overall result from all the trials together showed mammography to be of significant benefit, but the authors looked at the characteristics of the trials and felt that some were more reliable than others. The data from these selected trials did not show a benefit from mammography. On this basis the authors concluded that screening for breast cancer was unjustified.

Use of aspirin

Similarly a recent paper in the BMJ suggested that aspirin may not be useful for primary prevention in patients with mildly elevated blood pressure on the basis of the results of patients in this sub-group [4]. I would suggest that before deciding about aspirin for such patients you ask yourself whether you would still treat those with the Libra and Gemini birth signs with aspirin following an MI. Moreover if patients on aspirin for secondary prevention of ischaemic heart disease ask whether they should stop if their blood pressure is up a bit, my answer would be no.

The bottom line is that the best overall estimate of the effect of a treatment comes from the average effect on all the patients and not from the individual sub-groups [5]. Sub-group analysis is generally best restricted to the realm of generating hypotheses for further testing rather than evidence that should change practice.


1. Horton R. From star signs to trial guidelines. Lancet 2000;355:1033-34

2. ISIS-2 Collaboration group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected myocardial infarction. Lancet 1988; ii:39-60

3. Gotzche PC, Olsen O. Is screening for breast cancer with mammography justifiable? Lancet 2000;355:129-34

4. Meade TW, Brennan PJ, on behalf of the MRC General Practice Research framework. Determination of who may derive the most benefit from aspirin in primary prevention; subgroup results from a randomised controlled trial. BMJ 2000; 321:13-7.

5. Gotzsche PC. Why we need a broad perspective on meta-analysis. BMJ 2000; 321:585-6