作者
Loes M. Hollestein,Serigne Lo,Jo Leonardi‐Bee,Saharon Rosset,Noam Shomron,Dominique‐Laurent Couturier,Sonia Gran
摘要
Research articles typically present the results of several hypothesis tests and often state 'all tests with P-values < 0·05 were considered statistically significant'. This ignores that multiple tests were performed, which can induce false-positive findings. Indeed, when multiple true null hypotheses are tested, the probability of rejecting at least one null hypothesis [referred to as the overall type I error rate or family-wise error rate (FWER)] increases with the number of tests. For instance, if 20 independent statistical tests are performed at the 0·05 significance level in a scenario in which all null hypotheses are true, the probability of rejecting at least one null hypothesis is almost 65%. This inflation of the type I error rate, known as a multiple testing problem or multiplicity, constitutes a real challenge to researchers and partly explains the lack of reproducibility of scientific findings.1 Many procedures have been developed to overcome multiplicity.2 Due to its simplicity, the most widely used approach is the Bonferroni procedure, where the type I error for each test equals the target overall type I error level (usually 0·05) divided by the number of tests. This multiplicity correction leads to an FWER close to the target overall type I error level when all tests are independent, but it is known to be overly conservative when the tested hypotheses are related, leading to an unnecessary loss of power (i.e. lower probability of finding true associations). Therefore, multiplicity correction methods taking their dependence into account are generally preferred in order to gain power (e.g. resampling methods such as bootstrap and permutation tests).3, 4 When the number of tests is very large, like in omics studies (e.g. genomics or transcriptomics), control of the false discovery rate (FDR; i.e. the proportion of true null hypotheses among all rejected null hypotheses) is usually preferred to the control of the FWER as it allows notable gains in power.5 The choice regarding which method to use depends on the type of study and the hypotheses to be tested. The aim of this editorial is to briefly discuss the use of multiplicity correction in different contexts and to state the multiplicity requirements for publication in the BJD. Sample sizes of clinical trials are based on a single endpoint or coprimary endpoints.6, 7 A trial with coprimary endpoints is considered negative if the result related to any of the coprimary endpoints is not significant. The use of multiple primary endpoints for a given sample size induces a loss of power but does not increase the type I error rate. In addition to the primary endpoint(s), a set of secondary and exploratory endpoints, for which no a priori sample size calculation was performed, is usually tested as well. In order to prevent false-positive findings among the set of secondary endpoints, a clear distinction between the true secondary endpoints (which may support the primary endpoint and/or show additional effects after success of the primary endpoint) and the exploratory endpoints (hypothesis generating or endpoints with very low event rates) should be made.7 Hypothesis testing for exploratory endpoints is not recommended,6 but the type I error rate should be controlled for secondary endpoints, typically by means of a FWER approach. If there is no effect on the primary endpoint(s), no effect on related secondary endpoints may be expected, so that one may decide to stop statistical testing after a nonsignificant result (a fixed-sequence or serial-gatekeeping approach).8 Endpoints may also be grouped into families (e.g. a family of multiple effectiveness outcomes and a family of multiple quality-of-life scores). All endpoints within a family can be tested with a correction for multiple comparisons, and one may only proceed to the next family when there is statistical success in the preceding family (a fixed-sequence approach applied to families). Omics studies investigate the relationship between a particular type of sample molecule and a sample attribute. Examples are genome-wide association studies, in which a large set of single-nucleotide polymorphisms is tested for the association with an outcome of interest (e.g. skin cancer), or RNA-Seq experiments, in which differences in gene or protein expression between conditions (e.g. treated vs. not treated) are investigated. As such studies typically involve hundreds to millions of (usually dependent) simultaneous tests, FWER control of the type I error would lead to a drastic loss of power, explaining why FDR approaches are preferred, as they control for the fraction of false discoveries among the rejected hypotheses.9 The most commonly used FDR multiplicity correction is the one introduced by Benjamini and Hochberg and is valid for independent10 or positively dependent test statistics,11 such as test statistics (positively) correlated due to measurement errors affecting all or some parameters of interest in a common way. As other dependence structures may be observed in practice, an FDR approach valid under more general dependence structures was later introduced by Benjamini and Yekutieli at the price of some loss of power.11 False-positive findings may occur in studies where subgroup analyses are performed without multiplicity adjustment (e.g. a meta-analyses stratified by timepoints of an outcome). As tests of such analyses typically involve correlated outcomes and/or comparisons repeatedly involving the same groups, a resampling-based FWER multiplicity correction would provide the greatest power. To maintain a high power, a limited number of subgroup analyses should be prespecified in the protocol, where the subgroups chosen should be based on a clear hypothesis with a pre-existing biological rationale. If regression models are used for causal inference, hypotheses of the association between an exposure and outcome are tested and multiplicity should be addressed, if there is more than one outcome, using the methods mentioned above. Note that in parametric models (e.g. generalized linear models and survival models), the dependence between the tests of interest can usually be obtained under standard asymptotic normality assumptions, allowing the dependence between them (e.g. middle age vs. young age, and old age vs. young age) to be taken into account when performing FWER multiplicity corrections.2 This leads to a gain in power compared with Bonferroni-like multiplicity corrections. When developing prediction models, the number of subjects (linear regression), cases (logistic regression) or events (survival models) determines the amount of statistical power and thus how many variables can be included in the model.12, 13 As a rule of thumb, 10 subjects, cases or events are needed per variable. When developing a prediction model with a multiplicity of variables and a too low number of events, there is a risk of predicting random error (i.e. overfitting) and very poor performance of the prediction model in another patient sample. In those situations, even more than 10 subjects, cases or events per variable may be required.14 Multiple comparisons can be foreseen at the design phase of a study, when multiple hypotheses are formulated. Therefore, methods to correct for multiple comparisons should be prespecified in the protocol and/or the statistical analysis plan. The BJD requires that clinical trials and systematic reviews are preregistered and encourages that the protocols of trials are published elsewhere and submitted as a supplementary file. We encourage authors of any type of study to consider multiple-testing strategies before the start of the study and to clearly report the strategy of choice in the methods. Loes Maria Hollestein: Writing-original draft (lead); Writing-review & editing (lead). Serigne Lo: Writing-original draft (equal); Writing-review & editing (equal). Jo Leonardi-Bee: Writing-original draft (equal); Writing-review & editing (equal). Saharon Rosset: Writing-original draft (equal); Writing-review & editing (equal). Noam Shomron: Writing-original draft (equal); Writing-review & editing (equal). Dominique-Laurent Couturier: Writing-original draft (equal); Writing-review & editing (equal). Sonia Gran: Writing-original draft (equal); Writing-review & editing (equal).