摘要
Systematic reviews and meta-analyses have long been considered to be at the top of the evidence-based medicine hierarchy and are frequently used to inform clinical practice and future research. The number of published systematic reviews continues to increase annually, with some estimates suggesting that 11 systematic reviews are published daily in the medical literature 1, 2. Despite their growing numbers, a large proportion of systematic reviews and meta-analyses are unnecessary, misleading and poorly conducted and reported 3. One key issue that has previously been discussed in Anaesthesia is the interpretation of systematic reviews and meta-analyses with sparse data, which often result in false positive (type-1 error) and false negative (type-2 error) findings 4. Over the past decade, trial sequential analysis (TSA) has emerged as an attractive statistical method to address this issue. It combines conventional meta-analytical techniques with statistical monitoring boundaries, that create thresholds for determining significance based on the impact of multiple testing and amount of information already available in the meta-analysis. Increasing numbers of meta-analyses, including one in a recent issue of Anaesthesia by Grape et al. 5, are now incorporating TSA into their methods. In this article, we aim to provide the reader with a basic understanding of the principles underlying TSA and its interpretation. The random variation and imprecision in the results of meta-analyses with sparse data (i.e. small numbers of trials or events) is more likely to lead to incorrect inclusions 6, 7. Such meta-analyses are often regularly updated with data from further trials and are, therefore, subject to repeated significance testing being carried out. This further increases the likelihood of a type-1 error, a phenomenon known as 'multiplicity due to repeated significance testing' 8, evident in randomised controlled trials, where repeated testing of accumulating data increases the overall risk of a type-1 error 9. Previous work has suggested that the actual risk of a type-1 error in meta-analyses may range from 10% to 30%, meaning that between 1 and 3 out of 10 interventions may be falsely reported as beneficial (or useless) 7, 10. A simple and useful way to start thinking about TSA is to draw an analogy with the methods and conduct of a randomised controlled trial. For a clinical trial, investigators derive a sample size calculation based on the following assumptions – (1) the event rate in the control group; (2) the anticipated effect size of the intervention; (3) the accepted risk of a type-1 error (typically < 5%); and (4) the desired statistical power (typically at least an 80% chance that an effect at least as large as the anticipated effect will be detected if it exists). Trial sequential analysis requires these same assumptions to derive a power calculation for a meta-analysis, which is often termed as the 'required information size'. Much like a clinical trial, these assumptions should be pre-specified in the systematic review protocol. Several methods, such as TSA, sequential meta-analysis using Whitehead's triangular test, Bayesian methods and the law of the iterated algorithm, have been proposed to mitigate the risk of misinterpreting random error in meta-analyses with sparse data (i.e. an effect at least as large as the anticipated effect exists) as they provide more conservative thresholds for declaring statistical significance 8, 11, 12. Trial sequential analysis has previously been described as a hybrid position between frequentist and Bayesian approaches, with the sequential analysis arising from frequentist statistics, and the Bayesian component arising from a single a priori effect estimate of the intervention (although Bayesian analysis incorporates multiple prior distributions with different anticipated effect estimates, for example, sceptical, realistic and optimistic priors) 13, 14. As discussed previously, the risk of misinterpreting random error increases when data are sparse. Using the assumptions used to create a required information size, TSA considers this risk and adjusts significance thresholds accordingly. The monitoring thresholds are built as a representation of the strength of the evidence and rest on an assumption that the amount of evidence will continue to accumulate until either a monitoring boundary (or significance threshold) is crossed, or the required information size is reached. If the accrued information size is less than the required information size, a stricter significance threshold is applied. Trial sequential analysis can display this threshold with wider confidence intervals, often reported in manuscripts as TSA-adjusted confidence intervals, which add more transparency to the uncertainty of the point estimate. Similarly, as the accrued information size approaches the required information size, thresholds become more relaxed and TSA-adjusted confidence intervals narrower (Fig. 1). The cumulative Z value in TSA represents the summary test statistic of all included trials and a new Z value is calculated each time a new trial is added. The Z value is an estimation of the random error in the data and a greater Z value (i.e. a lower p value) makes it less likely that the data are spurious or taken from a population where the null hypothesis is true. An estimation of heterogeneity (or diversity) is required to account for differences in trial populations, study designs and interventions, which is similar to adjustments for variations across centres in a multi-centre trial. Such heterogeneity reduces the precision of the results and increases the required information size. D2 is the measure of diversity used in TSA and is a measure of between trial variation, similar to the I2 used in conventional meta-analysis, but may mathematically be a better alternative to I2 when considering variation in any random effects meta-analysis particularly when data are sparse 15. Assumptions regarding the chosen values of D2, including any sensitivity analyses with varying D2 values, should also be pre-specified ideally. Trial sequential analysis can also be used to construct futility boundaries. These were originally developed for interim analyses in randomised trials, to allow early termination of trials if unexpectedly large differences arose between treatment groups, saving time, resources and minimising participants' exposure to the inferior treatment. Such analyses should be considered at the design stage, and incorporated into the study protocol and/or statistical analysis plan before any trial is started. Similarly, if a meta-analysis of an intervention has concluded that there is no evidence of an effect, we need to know whether this was due to a lack of statistical power, or because the intervention is truly unlikely to have any effect. Analogous to interim analyses for clinical trials, TSA requires a pre-specified minimum desired effect size to construct futility boundaries that will be used to provide a threshold for detecting a lack of an anticipated effect that is large enough to be clinically meaningful. In other words, they indicate when the anticipated effect could be considered as being unobtainable. Above this threshold, there is still a possibility that a statistically significant effect will be found, but below this threshold, it is extremely unlikely that an effect as large as the anticipated effect, given the constraints of power and statistical thresholds, will be found. In such instances conducting future trials is futile. For readers wishing to learn more about the underlying principles and conduct of TSA, the software and manual can be freely downloaded from www.ctu.dk/tsa. Trial sequential analysis can be applied to analyses on dichotomous data and on the mean difference of continuous data but not on standardised mean differences. Figure 1 demonstrates the components of two-sided TSA. Unfortunately, meta-analyses of anaesthetic interventions commonly have few data to draw on. Imberger et al., in a review of 50 randomly selected meta-analyses of anaesthetic interventions, observed that the median number of included trials was 8, the median (IQR [range]) number of participants was 964 (523–1736 [99–11 172]) and the median number of participants with the outcome of interest was 202 (96–443 [26–5762]). After applying TSA, only 6 out of 50 (12%) meta-analyses had sufficient power of greater than 80% and only 16 out of 50 (32%) preserved their risk of a type-1 error of less than 5% 16. An example of the utility of TSA was highlighted in a recent meta-analysis, published in Anaesthesia, which aimed to evaluate the effect of paravertebral block on the prevalence of persistent postsurgical pain after breast surgery 17. Two previous meta-analyses had demonstrated that paravertebral block reduced the odds of persistent postoperative pain after 6 months, but only included two 18 and four 19 studies. Heesen et al. updated these and included seven studies 17, but observed no statistically significant risk reduction for chronic postoperative pain using conventional meta-analysis. Using TSA, they demonstrated that the available evidence was not sufficient to reach a conclusion. In order to detect a pre-specified relative risk reduction of 20% in chronic postoperative pain at 3 months, only 317/1734 (18%) of the required information size was reached. For clinical trialists, this approach is potentially useful as it can provide information on the required number of participants in future trials to 'plug the gap'. Similar examples have also been reported in other specialties 20. Trial sequential analysis is a complex statistical tool that can be misused and has been criticised. Despite its possible benefits, its application in meta-analyses is not universal. Even within the Cochrane Collaboration, with its standardised methods, reviews on topics of clinical relevance to anaesthetists have not routinely applied TSA in the past, with some choosing to 21, 22 and other not 23, 24. In fact, the most recent Cochrane Scientific Committee Expert Panel recommended against the routine use of sequential methods for updated meta-analyses 25. The Panel argued that although systematic reviews are able to address the effect of an intervention on different outcomes and on different sub-groups, sequential methods such as TSA cannot accommodate multiple different thresholds for different outcomes. They are often based on a particular outcome that may not be of interest to all stakeholders. Others have argued along similar lines, commenting that TSA is likely to be performed on the primary outcome only, meaning that the risk of spurious findings will still persist for reported secondary outcomes 26. The Panel also argued that although similarities have been drawn between TSA and the conduct of a clinical trial, especially with regard to futility boundaries which assist data monitoring committees, meta-analyses are retrospective and observational by nature. The meta-analyst is, therefore, unable to control for the trials that have already been performed which are eligible for the meta-analysis. It is impossible to create a retrospective sequential programme that would maintain the pre-specified assumptions of a TSA. Knowledge and transparency of the assumptions used when performing TSA is critical. However, the complexity of statistical methods creates a veneer of certainty for naïve analysts and their readers which can lead them to gloss over the many assumptions and judgements which must be made in the process of performing any meta-analysis. Variations in these assumptions can significantly affect the required information size. There may be disagreements about acceptable type-1 and type-2 error rates, what constitutes an anticipated effect size and whether it is of clinical relevance. Sceptical or conservative a priori effect estimates do not take into consideration the effect already obtained from accrued data and can lead to unrealistically large required information sizes. While controlling for type-1 errors, TSA may unintentionally increase the rate of type-2 errors, that is, falsely concluding that there is no effect when one exists. In the meta-analysis by Grape et al., the authors performed TSA on their primary outcome of mean postoperative pain score at 2 h. They state that "trial sequential analysis indicated that firm evidence was reached and that dexmedetomidine was superior to remifentanil". The anticipated effect size appears to be a mean difference of −0.7 and variance of 1.17, which is the point estimate in their forest plot. The authors do not state whether or not this approach was defined a priori. The required information size to detect this difference was 657 participants and the accrued information size in this meta-analysis (672 participants), has already surpassed this and crossed the sequential monitoring boundary for benefit. To illustrate the point of how information sizes can vary with different assumptions, Fig. 2 shows a TSA where the required information size was calculated with the same type-1 (5%) and type-2 (80%) error rates, an anticipated mean pain score reduction of −0.5 and a heterogeneity estimated with diversity of 89%. The graph now shows that the required information size using these assumptions would be 1289, which is almost double than the number already accrued so far. The mean (95%CI) reduction in pain score with TSA was 0.7 (1.44 to −0.03). Given the possibly greater impact of meta-analyses on policy and practice, one could make a case that the thresholds for meta-analyses should be higher than that for clinical trials. If we repeat the primary TSA by Grape et al. but with 90% power instead, Fig. 3 shows a TSA where the required information size is now 885 participants. As the trial sequential monitoring boundary has been broached for statistical significance, we can interpret the evidence as being conclusive. Perhaps the most important question is whether a mean reduction of 0.7 in pain score at two postoperative hours is an important outcome of relevance to patients? In summary, TSA is becoming increasingly popular and provides more information around uncertainty and imprecision in meta-analyses with sparse data. We provide a brief checklist (Table 1) on what to consider when interpreting the results of a TSA, while also bearing in mind the concerns raised by the Cochrane Collaboration. We strongly advocate review authors to work with experienced methodologists when performing sequential methods, and for editors to ensure that the assumptions underlying trial sequential analyses are transparent and clearly conveyed when reviewing and publishing manuscripts. It is a complex statistical tool, and, to quote the software engineer Grady Booch, "a fool with a tool is still a fool". AS is a Trainee Fellow of Anaesthesia and is being supported by an NIHR Doctoral Research Fellowship (DRF-2017-10-094). AS is an editor of Anaesthesia and Co-ordinating Editor of the Cochrane Anaesthesia Review Group.