Critical Reading Articles Overview

This section contains a series of articles on critical reading. Six of these were originally written for Pulse magazine in 2001 and have been edited in 2016. There are also articles from a series in Update in 2005. Other articles highlight bias that can occur in the way that research is reported and draw attention the sort of problems that may be worth looking for when reading the medical literature.

The perils and pitfalls of sub-group analysis (Pulse Article 2001)

This article is part of a series on Critical Reading.

Controlled clinical trials are designed to investigate the effect of a treatment in a given population of patients, for example aspirin is given to patients with ischaemic heart disease. Inevitably there will be differences between the patients included in the trial (men versus women, older versus younger, hypertensive versus non-hypertensive).

It is tempting to look at the effects of treatment separately in different types of patient in order to decide who will benefit most from being given the treatment. Although this analysis of the sub-groups of patients is widely carried out in the medical literature, it is not very reliable. And the ISIS-2 trial gives a clear example of how this can be misleading [1]. The trial looked at the effect of aspirin given after acute myocardial infarction, and when the results were reported the editorial team at the Lancet wished to publish a table of sub-group analyses. The authors agreed as long as the first line in the table compared the effects in patients with different birth signs [2].

The analysis showed that aspirin was beneficial in all patients except those with the star signs of Libra and Gemini. This served as a warning against the over interpretation of the results of the other sub-groups reported in the paper. The problem is that the play of chance can lead to apparently significant differences between sub-groups, and these are really only helpful in very large trials which show really big overall differences in the treatment and control groups.

Two examples of the use of sub-group analysis are somewhat contentious. The first was reported in the Lancet and looked at the evidence from different trials of mammography to try to reduce deaths from breast cancer[3]. The overall result from all the trials together showed mammography to be of significant benefit, but the authors looked at the characteristics of the trials and felt that some were more reliable than others. The data from these selected trials did not show a benefit from mammography. On this basis the authors concluded that screening for breast cancer was unjustified.

Use of aspirin

Similarly a recent paper in the BMJ suggested that aspirin may not be useful for primary prevention in patients with mildly elevated blood pressure on the basis of the results of patients in this sub-group [4]. I would suggest that before deciding about aspirin for such patients you ask yourself whether you would still treat those with the Libra and Gemini birth signs with aspirin following an MI. Moreover if patients on aspirin for secondary prevention of ischaemic heart disease ask whether they should stop if their blood pressure is up a bit, my answer would be no.

The bottom line is that the best overall estimate of the effect of a treatment comes from the average effect on all the patients and not from the individual sub-groups [5]. Sub-group analysis is generally best restricted to the realm of generating hypotheses for further testing rather than evidence that should change practice.


1. Horton R. From star signs to trial guidelines. Lancet 2000;355:1033-34

2. ISIS-2 Collaboration group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected myocardial infarction. Lancet 1988; ii:39-60

3. Gotzche PC, Olsen O. Is screening for breast cancer with mammography justifiable? Lancet 2000;355:129-34

4. Meade TW, Brennan PJ, on behalf of the MRC General Practice Research framework. Determination of who may derive the most benefit from aspirin in primary prevention; subgroup results from a randomised controlled trial. BMJ 2000; 321:13-7.

5. Gotzsche PC. Why we need a broad perspective on meta-analysis. BMJ 2000; 321:585-6

Relative or Absolute measures of effect (Pulse Article 2001)

This article is part of a series on Critical Reading.

Measuring outcomes in clinical trials can be done in a variety of ways, and presentation of the results may influence the way that readers respond. For any trial that reports dichotomous outcomes (that is where patients can only be in one of two categories, such as dead or alive, pregnant or not) the results can be shown simply as a two-by-two table.

Non-pregnant Pregnant
Levonelle 976 11
Yuzpe 997 31

The data shown indicates that 1% of patients given Levonelle-2 for post-coital contraception become pregnant in comparison with 3% of those who are given the older Yuzpe method (1). This can be reported in different ways. The Relative Risk of becoming pregnant is obtained by dividing the risks of pregnancy in the treated and untreated groups and comes out as 0.33 if you have Levonelle-2 (or in other words your risk of becoming pregnant is one third of the risk with Yuzpe). This sounds impressive in comparison to the Risk Difference which is obtained by subtracting the risk in the two groups and is only 0.02 because the pregnancy rate is low in both groups. The Number Needed to Treat (NNT) with Levonelle-2 rather than Yuzpe to avoid one extra pregnancy is the inverse of the Risk Difference and in this case works out as 63 patients.(2)

Each measure has its advantages and disadvantages. The relative risk of a given treatment (such as statins for the prevention of ischaemic heart disease) tends to be independent of the risk of the patients being treated. This makes it a good measure to use when combining the results of different trials in a meta-analysis (3).

Risk difference on the other hand is helpful when considering treatments for individual patients as the amount of difference a treatment will make to them depends on their level of risk. A good example of this comes from the comparison of different oral contraceptive pills in terms of the risk of deep vein thrombosis. Safety studies have indicated that third generation oral contraceptive pills carry twice the risk of venous thromboembolism as the older pills, this is a relative risk of 2 and caused a great deal of alarm amongst pill takers. However the absolute risks are very low, 1 in 10,000 with the older pills and 2 in 10,000 with the third generation ones; the risk difference is 0.0001. Put another way 10,000 women would have to take the third generation pills for one year before one of them suffered thromboembolic disease as a consequence giving a Number Needed to Harm (NNH) of 10,000.

Of course the interpretation of Numbers Needed to Treat may be dependent on how important the consequences are and some women opted to change pill to minimise their risks whilst others were happy to continue, as the individual risk to them was so low.

Those of you who are interested in seeing some examples of graphical displays of Numbers Needed to Treat in different clinical scenarios related to primary care will find examples in the Cates plots in other articles on this site (such as Vitamin D for asthma).

There are related article on this topic (Relatively Absolute and Communicating Risk).


1. Task Force on Postovulatory Methods of Fertility Regulation. Randomised controlled trial of levonorgestrel versus Yuzpe regimen of combined oral contraceptives for emergency contraception. Lancet 1998;352:428-33.

2. Which postcoital contraceptive? Cates C. BMJ 2000;321:664

3. Egger M, Davey Smith G, Phillips AN. Meta-analysis: principles and procedures. BMJ 1997; 315: 1533-1537.

Evidence from Randomised Trials and Systematic Reviews (Pulse Article 2001)

This article is part of a series on Critical Reading.

The main threats to validity in non-randomised studies is related to BIAS due to differences in the populations of patients who do and do not receive the experimental treatment. Randomisation should overcome this problem because the random allocation of patients to the treatment or control groups should create an equal spread of known and unknown risk factors between the two groups. Whilst statistical techniques can be used to adjust for known confounding factors in non-randomised studies, by definition the unknown ones can only be overcome with randomisation.

Allocation concealment

Even in randomised controlled trials it is important to check that the allocation of patients to the active and comparison groups is well concealed. The quality of allocation concealment is routinely used by the Cochrane Collaboration in grading trials included in systematic reviews because empirical research has shown that studies which do not have well concealed allocation tend to show more inflated results than those that do. Why should this be? The problem is selection bias: if I was carrying out a randomised trial of my favourite wart paint it is important that I do not know which treatment the next patient will receive, otherwise I can influence the results by choosing the milder wart infections for treatment with the paint. This is quite easy in practice as I would only have to find an excuse to rule the next patient out of the trial if they were due to have the paint and had a horrendous crop of warts!

Similarly if I know the treatment used I may be more optimistic in deciding that a wart has completely gone if the patient had my special paint than if they did not. This is a form of detection bias. Secure double blinding (using an identical wart paint substitute prepared by an outside agency) will overcome both problems, and again has been shown to reduce the size of treatment effects compared with the results of unblinded (open) studies.

What about other trials?

Finally after checking all these quality measures for the paper do not forget that the study being reported is only one of a larger group of other studies that have been carried out on the same topic all over the world. It is for this reason that the Cochrane Collaboration has set out to collect together all the evidence from controlled clinical trials that has a bearing of questions related to clinical practice and published the results in the Cochrane Database. Systematic reviews of this kind are one way to combat the increasing volume of papers published each year, but I am often asked what exactly is a systematic review and how does it differ from a meta-analysis.

Narrative reviews

Traditionally reviews of interesting topics have been commissioned by journals that ask an expert in the area to give a viewpoint; the problem is that all experts have their favourite approach to a topic and will tend to be most familiar with those papers that support their own view. (How often do you keep a copy of something that you have read that you think is wrong?) This type of narrative review is therefore inherently likely to be biased.

It is helpful to think of a review as being a scientific investigation but of papers rather than patients. Would you trust a trial that reported the results of a new drug where only a few of those treated have their data for you to see and the choice of which ones in entirely up to the investigator. I certainly would not, and in the same way caution is needed when reading the results of narrative reviews.

Systematic Reviews

So what exactly is a Systematic Review? Mulrow has defined a Systematic review as “an efficient scientific technique to identify and summarise evidence on the effectiveness of interventions and to allow the generalisability and consistency of research findings to be assessed and data inconsistencies to be explored.” (1)

The difference is that the review sets out to find all the appropriate evidence on a topic, not just the bits that suit the writer. Ideally the review should start with a protocol that is decided in advance, and for Cochrane reviews these are also published on the Cochrane database. This helps to avoid data-dredging for results that happen to be show ‘statistical significance’. Post hoc analysis done after the data is collected is equivalent to firing an arrow into a large wooden wall and then drawing a target around the place the arrow lands – much easier that drawing the target first and then hitting the bulls-eye!

The methods section of the systematic review should make clear how the search for evidence was carried out, how the identified trials were selected for inclusion or exclusion from the review, and how the data from the trials was combined. The data pooling is termed Meta-analysis and is no more than using mathematical techniques to combine the results from two or more individual trials. A systematic review sometimes does not include Meta-analysis if the data is not suitable for pooling, and nor does a Meta-analysis mean that all the data has been systematically searched out.

In other articles I unpack some of the techniques used in Meta-analysis and explore the use of meta-analysis in systematic reviews.


1. Mulrow CD. Rationale for systematic reviews. BMJ 1994;309:597-9

Do I need to change my practice (Pulse Article 2001)?

This article is part of a series on Critical Reading.

When speaking to registrars about critical appraisal, one of the commonest question is “How do I decide whether the paper is good enough to warrant a change in my current practice?” In the article on asking a good question I described how to break down the question addressed by a research paper into its four components, and having done this you next have to decide whether the findings of the paper are likely to be important to you and especially to your patients.

Is it valid?

In particular is the approach being described in the paper worth trying on the next patient who presents with the relevant condition. To answer this we need to look at issues relating to the validity of the paper in question. Two types of validity have been described: internal validity which relates to the mechanisms of the study itself and external validity which is more to do whether the results of the paper can be extrapolated to the patient in our own practice. In the rest of this article I will concentrate on issues of internal validity using as an example an imaginary study of olive oil for children with acute otitis media.

Choosing controls

The key issue to think about in relation to internal validity is to look at how a comparison group is chosen in relation to the patients who are given the experimental treatment. In a case-series (for example a set of 6 patients who are given a new treatment in routine practice) there may be no comparison group at all, so the immediate concern is that they might have achieved a good result anyway. For example I might tell you that I have treated a series of 100 children with acute otitis media with warm olive oil and that 85 were better in a few days. This sounds impressive until you look at the results of placebo treatment in antibiotic trials for this condition and find a similar recovery rate.

Better than a case series would be a case-control study in which the records of patients who had prolonged pain following ear infections were checked to see how many had been given olive oil; this proportion receiving olive oil could then be compared to the proportion of olive oil use in other patients who did not have prolonged pain. The problem now is being sure that the children do not have other differences influencing the olive oil usage, and this is rarely possible.

Better still a group of children could be compared by offering parents the choice of whether they use the oil or not; this would constitute a prospective cohort study but uncertainty remains about possible important differences between those who chose to have the oil and those who refuse it.

Overcoming Bias

In both the case-control study and the cohort study design the threat to internal validity is related to bias in the choice of the comparison group (selection bias), as well as other possible biases which may be present because both the patient and the doctor are well aware of the treatment that they have received. It will be no surprise to you that the only secure way around these biases is to use a randomised controlled trial that is preferably double-blind, and these will be addressed in the next article.

HRT and heart disease

So are any of these biases important. They certainly can be and a couple of examples may help to show how. In the early non-randomised studies of Hormone Replacement Therapy the results suggested that women on HRT had lower rates of heart disease, and HRT has therefore been advocated as a measure to reduce risks of Ischaemic heart disease(1). Some of the authors of these early studies did point out that there were some problems, particularly as the rates of road traffic accident deaths were also lower in the group receiving HRT. The more recent evidence from randomised controlled trials (such as the HERS study[2]) has not confirmed the protective effect and it is probable that the women who opted for HRT had other differences from the control group and may have had generally lower risk factors for heart disease.

Preventing Teenage Pregnancy

Another example of this was a cross-sectional survey in the BMJ reporting the association between teenage pregnancies and practice characteristics in different areas (3). The results include this statement “On multivariate analysis, practices with at least one female doctor, a young doctor, or more practice nurse time had significantly lower teenage pregnancy rates. Deprivation and fundholding remained significantly associated with higher teenage pregnancy rates.” The problem here is that we have no evidence that the age or sex of the doctors caused the lower rates of pregnancy, and the unexplained association with fund-holding practices having higher pregnancy rates should perhaps ring some alarm bells. No one  suggested that the end of fundholding would solve the teenage pregnancy problem!

A fuller discussion of association and causation can be found in Follies and Fallacies of Medicine (Tarragon Press) [4] which I would recommend as both amusing and informative background reading for all registrars.


1. Barrett-Connor E, Grady D. Hormone replacement therapy, heart disease and other considerations. Annu Rev Public Health 1998;19:55-72

2. Hulley S, Grady D, Bush T et al. Randomised trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women. JAMA 1998;280:605-133.

3. Association between teenage pregnancy rates and the age and sex of general practitioners: cross sectional survey in Trent 1994-7. Julia Hippisley-Cox, Jane Allen, Mike Pringle, Dave Ebdon, Marion McPhearson, Dick Churchill, and Sue Bradley. BMJ 2000; 320: 842-845.

4. Follies and Fallacies in Medicine. Skrabanek and McCormick. Tarragon Press.

Asking a good question (Pulse Article 2001)

This article is part of a series on Critical Reading.

Where do you start when trying to judge papers in medical journals? All too often we are in a hurry and glance briefly at the title and then the conclusion of the abstract. However I would suggest that you try to get inside the mind of the writer of the article; try to work out why they carried out this piece of work. It is easier to do this if you have a structure to work to and I suggest using a four part question at this point.

  1. What are the characteristics of the Patients in the trial?
  2. What is the Intervention being studied?
  3. What is it Compared with?
  4. What Outcomes are measured?

Take a piece of paper and jot down the answers to the four questions shown in the box and you will have a neat summary of the question that your paper is trying to answer. You should have a note of the characteristics of the patients in the trial, the main intervention studied, what it was compared with and what outcomes were measured. You can remember the headings using the acronym PICO (Patient, Intervention, Comparison, and Outcome).

Is this an important question?

If you have been able to identify the four parts of the question that the paper is trying to answer the next thing to ask yourself is whether the answer is going to be relevant to you and the patients that you are looking after. Much research is driven by academic or industrial interest and the question may not be relevant to you.

All too often the outcomes chosen are surrogates that are easy to measure but may not reliably indicate whether the treatment will be of real benefit to the patient. Also the comparison may be with the wrong alternative treatment, or the patients in the trial may not be representative of those seen in your practice. Two examples may help to illustrate the point.

Antibiotics for Acute Otitis Media

There is not shortage of randomised controlled trials that have compared one antibiotic with another for the treatment of acute otitis media, and this is an important issue for pharmaceutical companies introducing new antibiotics. However the first question to answer is whether any antibiotic is needed at all, and this cannot be assessed from comparing two antibiotics with each other. What is needed is evidence from trials comparing antibiotic with placebo to decide how much overall difference they make, and indeed the evidence from all identified trials of this type showed limited benefit of antibiotics balanced by side effects from the treatment. (1)

Nebulised Steroids in Asthma

Here again the crucial question is what nebulised steroids are compared with; the obvious alternative delivery method is using a spacer and metered-dose inhaler since the two delivery methods appear to be equally effective when used for delivery of beta-agonists in acute asthma (2). In spite of this there are very few randomised controlled trials that compare these two delivery methods for steroids. Nebulised fluticasone has been shown to reduce the requirements for oral steroids in severe asthmatics when compare with placebo, but to my mind this is not really the key issue. The costs of nebulised steroids are considerably more than using spacer delivery after all, so we need clear evidence of superiority against spacers not placebos in this instance.

In a nutshell

So in summary use the 4 part question to summarise what the paper is about and then decide if it is a question that is worth spending the time to read in more detail. Consider if the question is an important one and if it is you will then need to think about the validity of the research method used before taking too much notice of the results; this will be the subject of the next article in this series.


1. Del Mar C, Glasziou P, Hayem M. Are antibiotics indicated as initial treatment for children with acute otitis media? A meta-analysis. BMJ 1997;314:1526 –1529

2. Cates C J, Rowe BH. Holding chambers versus nebulisers for beta-agonist treatment of acute asthma (Cochrane Review). In: The Cochrane Library, Issue 2, 2000. Oxford: Update Software.

Antibiotics ‘no use’ for acute cough: an example of biased reporting (Pulse Article 1999)

In a previous article I explained the advantages of reporting results of studies as an effect size with Confidence Intervals (usually 95%). The interval defines how certain the study result is in terms of its ability to predict the true average value of a treatment if it were to be given to everyone in the world with a certain condition. In the same issue of the BMJ in which Simon Chapman eloquently exposed the misuse of the lower confidence interval of data presented about the risk of passive smoking, a systematic review of evidence relating to antibiotics and acute cough was published.

Numbers Needed to Treat (NNT)

The review “Quantitative systematic review of randomised controlled trials comparing antibiotic with placebo for acute cough in adults” (Fahey T, Stocks N, Thomas T. BMJ 1998; 316: 906-910) carefully collected together the data from trials which addressed this important question. Nine trials were found but one was excluded because it did not fit the inclusion criteria, leaving eight trials with around 700 patients with results that could be analysed. The results are clearly presented as numbers needed to treat and harm in the Implications section of the paper and the authors calculated that “for every 100 people treated with antibiotic nine would report an improvement after 7-10 days if they visited their general practitioner but at the expense of seven who would have side-effects from the antibiotic. The resolution of illness in the remaining 84 people would not be affected by treatment with antibiotic.”

This information could be extremely useful in discussing with patients whether they need an antibiotic for their acute cough, although it should be noted that the majority of trials used doxycycline or Co-trimoxazole, which are perhaps not first choice antibiotics in this group of patients now. Unfortunately the reporting of the results earlier in the paper is not quite so elegant, and I wonder if the authors have been striving to push the figures into the form they want in order to obtain statistical significance. The diagram below shows the results from the meta-analysis for clinical improvement at day 7-11 and side effects of antibiotics.


To my mind these two effects are quite well balanced and fit with the description of the results for numbers needed to treat above. The authors however take a different view. They report the benefit of giving an antibiotic as being none (presumably because the 95% Confidence Interval includes the possibility of no difference as shown), whilst the possibility of side effects is reported as a non-significant increase. In view of the symmetry shown above this is not exactly even handed. Moreover they then proceed to adjust the data by removing the only trial that showed an excess of side effects in the placebo group, (which might be expected by chance in some trials with small numbers), and suddenly the non-significant trend reaches statistical significance!

All this makes me suspicious that the authors were keen to deliver the message that antibiotics are not much use in acute cough, and perhaps they have been a bit biased in the way that the results are displayed. This may not always be easy to spot in a paper, but it is certainly worth looking at the way results are reported when they take the form of a trend which does not reach significance as this may give clues about the authors’ views on the data.

Sensitivity Analysis, Sub-group analysis and Heterogeneity.

Sensitivity analysis is an expected part of meta-analysis and it involves excluding the data of lower quality to see whether the overall result is changed. It is also possible to carry out sub-group analysis to look for differences between different groups of patients or treatments, so for example the data could have been divided into trials which used Erythromycin as one sub-group, Doxycycline as a second group and Co-trimoxazole as a third. There are however dangers in data dredging and it is safest when specified in advance for a small number of sub-groups. It should also be pointed out that the sub-groups do not randomise one treatment against another and the protection against bias is lost in this type of comparison.

A final reason to split up the data is if significant heterogeneity is shown between the trials; normally this would be presented as a Chi-squared statistic for each outcome and hopefully will be accompanied by its p value. A simple shortcut when looking at the graphical display for the trials is to see whether the 95% Confidence Intervals all overlap; if they do not there are probably significant differences between the trials.

Reporting Results of Studies: can passive smoking really be good for you? (Pulse Article 1999)

Passive smoking and health risks.

“Passive smoking may be good for you” or so the tobacco companies would like us to believe! This idea arose from a misrepresentation of the confidence interval for data on passive smoking, and provides a good example of why we need a working knowledge of some statistics to deal with the propaganda that comes our way in General Practice. Sadly statistics is reported to be one of the subjects least liked by medical students, and those of us who have been in practice for more than a few years may be unfamiliar with some of the ways that results of studies are now reported. There has been a shift away from the use of p values towards Confidence Intervals (CI) in many medical journals, and the British Medical Journal now expects authors of papers to present data in this way.

Don’t forget common sense

Before going into more detail about the use of Confidence Intervals the example quoted for passive smoking above may be swallowed by the public, and even in some cases by journalists, but hopefully most GPs would be suspicious that such a finding just does not make sense. It does not fit with all the other data that has emerged in the past 20 years, and therefore needs some further looking at. Never leave common sense behind when looking at statistical reports!

Confidence Intervals or P values

So what are Confidence Intervals all about and how did they get misused in this example? In general when research is undertaken the results are analysed with two separate questions in mind. The first is how big is the effect being studied (in this case how big is the risk of lung cancer for passive smokers)? The second question is how likely is it that the result is due to chance alone? The two issues are connected, because a very large effect is much less likely to have arisen purely by chance, but the statistical approach used is different depending on which question you are trying answer. The “p” value will only answer the question “what is the chance that the study could show its result if the true effect was no different from placebo”? The Confidence Interval describes how sure we are about the accuracy of the trial in predicting the true size of the effect.

Both questions relate to the fact that we cannot know what the effect would be of a treatment or risk factor on everyone in the world; any study can only look at a sample of people who are treated or exposed to the risk. We then have to assume that if, say, one hundred identical studies were carried out in the same way on different groups of patients the results found would be normally distributed around the average effect size of the treatment. The larger the number of patients included in the trial the closer the result of that trial are likely to be to the true effect in the whole population. The result of any particular trial can therefore be presented as showing an effect of a certain size, and the Confidence Interval describes the range of values between which you can be 95% certain that the true value lies.

The data on Passive Smoking

Perhaps this can be illustrated with the passive smoking data. The results were that the on passive smoking study in seven European countries showed that there was an extra risk of developing lung cancer of around 16% for non-smokers who were exposed to smoke in the workplace or who had a spouse who smoked. This was comparing 650 lung cancer cases with 1542 controls in Europe and was accompanied by an estimate that 1100 deaths occurred each year in the European Union as a result of passive smoking.

common2The 95% Confidence Interval associated with this data is shown in the diagram and the tobacco industry had just chosen to highlight the lower end of the Confidence Interval, which shows a small chance that passive smoking could be associated with a 7% lower rate of lung cancer! Unsurprisingly they did not report the equal chance that the risk may be as high as 44% more lung cancer in passive smokers, and the Sunday Telegraph swallowed the story whole. More details are provided in the excellent article by Simon Chapman in the BMJ 1998;316:945.

Gardner and Altman mention this danger in their book “Statistics with Confidence”, and they suggest that results should be presented with the effect size, confidence interval and p value to prevent this kind of misunderstanding. The first two chapters are well worth reading if you want a fuller understanding of the rationale behind the use of Confidence Intervals. A final point about the Confidence Interval is that when it crosses the no-difference line (as shown in the diagram above) then the results do not reach significance at the level chosen (usually 5%).

Simon Chapman points out however that a meta analysis in the BMJ in the 18 October 1997 issue compared 4626 cases with 477924 controls and showed a 24% excess risk of lung cancer in non-smokers living with smokers. The 95% Confidence Interval was 13%to 36% which is well clear of the no-difference line and hence highly statistically significant, with a p value of >0.001. Again this data was conveniently ignored.

The moral of the story is that you cannot believe it just because you read it in the Newspaper. As far as the advantages of passive smoking are concerned, they can join the other myths and misunderstandings documented in one of my favourite books Follies and Fallacies in Medicine by Skrabanek and McCormick.

Statistics with Confidence MJ Gardner and D Altman BMJ Publishing 1989

Follies and fallacies in Medicine Skrabanek and McCormick Tarragon 1998


Can you trust what you read? Why we need Randomised Trials (Pulse Article 1999)

How can you tell if a paper is reliable? This was the question that many of the registrars wanted to have answered at a recent half-day release session on critical reading.

The Challenge of Archie Cochrane

Before he died Archie Cochrane expressed his sadness that no-one had gathered together the most reliable data available so that it could be used a basis for practice and research in Health Care. In response to this challenge the Cochrane Collaboration has emerged as a group of dedicated individual doctors and health care professionals who have set out to collect together data from controlled clinical trials, and summarise what they have found in the form of systematic reviews. The Collaboration is an international organisation and is structured by health problem areas to avoid duplication of effort. Many Journals have been hand-searched to identify controlled trials, and the reviews are structured so as to reduce bias at each stage of the process (which includes a ban on drug companies sponsoring individual reviews). In the UK many of the editorial bases are funded through the NHS Research and Development Programme.

The output of the Collaboration which includes a database of over 250,000 controlled trials identified and over 600 systematic reviews, is published in electronic form in the Cochrane Library, and a future article in this series will give an example of how it can be used.

The place of RCTs

There is considerable misunderstanding at this point about the place of Randomised Controlled Trials (RCTs). It would be unfair to say that you should not bother to read anything that is not an RCT, but it is also true to say that the most reliable way to study causation is with a systematic review of randomised controlled trials.

The way I like to look at the issue of randomisation is as follows; ask yourself the question “Could this trial have been randomised?” If you decide that it could have been randomised but it was not, then a large question mark should be placed over conclusions about whether the paper can reliably answer any questions related to the intervention causing good or bad outcomes.

Evidence for HRT

Hormone Replacement therapy is a good current example of this. Most of the current evidence relating to the purported benefits of HRT comes from non-randomised studies, and the results are therefore likely to be biased by differences between the type of women who opt for HRT and those who do not. An excellent editorial in the British Journal of General Practice in 1998 presents the current state of play in this area is recommended reading. Randomised Controlled Trials are currently under way to assess the effects of using HRT, but these will not report findings for a few years yet.

Sometimes you cannot randomise

There are of course some areas in which Randomisation is either impossible or unethical; you could not carry out a trial in which patients were randomised into cigarette-smoking or not! The very strong evidence on the dangers of smoking comes from large well conducted cohort studies, which are quite enough to leave little doubt about the size of the dangers involved.

Does it matter how you do it?

Whilst on the subject of Randomisation, how it is done matters too. The technical term to describe the actual randomisation is “allocation concealment” and if you read reports of older trials this often used to be done by using the patient’s hospital number to decide which treatment type they should receive or even alternate between treatments. It has been shown that trials with inadequate allocation concealment of this sort tend to show larger benefits of the intervention under study and it is not too difficult to imagine why.

Magic cure for warts?

Imagine you have developed a new treatment for removing warts and you arrange a trial to test it against one of the current methods. A patient walks into your surgery with a whole mass of horrible looking large warts, which you think that no treatment on earth will remove, and you can tell from the unconcealed alternated or random allocation that they would be in turn to receive your new technique. What will you do? Human nature is such that you will find some reason that this person will not quite fit into the trial and you will move on to another patient who has a nice small wart to treat next time. Obviously in this instance the advantage of randomisation in removing bias in the allocation process has been lost.

A better way to do it

When assessing the allocation concealment in Randomised trials I would look for at least opaque sealed envelopes which contain a random sequence of numbers to determine the next patient’s treatment. Even better would be a separate centre (such as the hospital pharmacy) to randomly allocate the treatment to the patient after the decision has been made to include the patient in the trial.

Try it yourself

So next time you are reading a paper, after you have asked yourself what is the question the paper was trying to answer, just pause to consider whether the trial could have been randomised. If it was randomised how easy would it have been to tell which treatment the next patient was getting? If you are satisfied on both of these fronts then read on, and if not perhaps move on to another paper.


1) Schulz KF, Chalmers I, Hayes RJ, Altman DJ. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 119;273:408-12

2) Hannaford PC. Is there sufficient evidence for us to encourage the widespread use of hormone replacement therapy to prevent disease? BJGP 48;427:951-2

Further Reading

So what’s so special about randomisation? Kleijnen J, Gotzsche P, Kunz RA, Oxman AD, Chalmers I. Chapter 5 in Non-random reflections on Health Services Research (Eds Maynard A and Chalmers I) BMJ Publishing Group 1997, pp93-106.