MULTIPLE ways to correct for MULTIPLE comparisons in MULTIPLE types of studies

I类和II类错误 邦费罗尼校正 无效假设 多重比较问题 统计 统计能力 统计假设检验 数学 标称水平 样本量测定 空(SQL) p值 统计显著性 错误发现率 多重性(数学) 替代假设 字错误率 计量经济学 计算机科学 人工智能 数据挖掘 置信区间 基因 生物化学 数学分析 化学
作者
Loes M. Hollestein,Serigne Lo,Jo Leonardi‐Bee,Saharon Rosset,Noam Shomron,Dominique‐Laurent Couturier,Sonia Gran
出处
期刊:British Journal of Dermatology [Wiley]
卷期号:185 (6): 1081-1083 被引量:20
标识
DOI:10.1111/bjd.20600
摘要

Research articles typically present the results of several hypothesis tests and often state 'all tests with P-values < 0·05 were considered statistically significant'. This ignores that multiple tests were performed, which can induce false-positive findings. Indeed, when multiple true null hypotheses are tested, the probability of rejecting at least one null hypothesis [referred to as the overall type I error rate or family-wise error rate (FWER)] increases with the number of tests. For instance, if 20 independent statistical tests are performed at the 0·05 significance level in a scenario in which all null hypotheses are true, the probability of rejecting at least one null hypothesis is almost 65%. This inflation of the type I error rate, known as a multiple testing problem or multiplicity, constitutes a real challenge to researchers and partly explains the lack of reproducibility of scientific findings.1 Many procedures have been developed to overcome multiplicity.2 Due to its simplicity, the most widely used approach is the Bonferroni procedure, where the type I error for each test equals the target overall type I error level (usually 0·05) divided by the number of tests. This multiplicity correction leads to an FWER close to the target overall type I error level when all tests are independent, but it is known to be overly conservative when the tested hypotheses are related, leading to an unnecessary loss of power (i.e. lower probability of finding true associations). Therefore, multiplicity correction methods taking their dependence into account are generally preferred in order to gain power (e.g. resampling methods such as bootstrap and permutation tests).3, 4 When the number of tests is very large, like in omics studies (e.g. genomics or transcriptomics), control of the false discovery rate (FDR; i.e. the proportion of true null hypotheses among all rejected null hypotheses) is usually preferred to the control of the FWER as it allows notable gains in power.5 The choice regarding which method to use depends on the type of study and the hypotheses to be tested. The aim of this editorial is to briefly discuss the use of multiplicity correction in different contexts and to state the multiplicity requirements for publication in the BJD. Sample sizes of clinical trials are based on a single endpoint or coprimary endpoints.6, 7 A trial with coprimary endpoints is considered negative if the result related to any of the coprimary endpoints is not significant. The use of multiple primary endpoints for a given sample size induces a loss of power but does not increase the type I error rate. In addition to the primary endpoint(s), a set of secondary and exploratory endpoints, for which no a priori sample size calculation was performed, is usually tested as well. In order to prevent false-positive findings among the set of secondary endpoints, a clear distinction between the true secondary endpoints (which may support the primary endpoint and/or show additional effects after success of the primary endpoint) and the exploratory endpoints (hypothesis generating or endpoints with very low event rates) should be made.7 Hypothesis testing for exploratory endpoints is not recommended,6 but the type I error rate should be controlled for secondary endpoints, typically by means of a FWER approach. If there is no effect on the primary endpoint(s), no effect on related secondary endpoints may be expected, so that one may decide to stop statistical testing after a nonsignificant result (a fixed-sequence or serial-gatekeeping approach).8 Endpoints may also be grouped into families (e.g. a family of multiple effectiveness outcomes and a family of multiple quality-of-life scores). All endpoints within a family can be tested with a correction for multiple comparisons, and one may only proceed to the next family when there is statistical success in the preceding family (a fixed-sequence approach applied to families). Omics studies investigate the relationship between a particular type of sample molecule and a sample attribute. Examples are genome-wide association studies, in which a large set of single-nucleotide polymorphisms is tested for the association with an outcome of interest (e.g. skin cancer), or RNA-Seq experiments, in which differences in gene or protein expression between conditions (e.g. treated vs. not treated) are investigated. As such studies typically involve hundreds to millions of (usually dependent) simultaneous tests, FWER control of the type I error would lead to a drastic loss of power, explaining why FDR approaches are preferred, as they control for the fraction of false discoveries among the rejected hypotheses.9 The most commonly used FDR multiplicity correction is the one introduced by Benjamini and Hochberg and is valid for independent10 or positively dependent test statistics,11 such as test statistics (positively) correlated due to measurement errors affecting all or some parameters of interest in a common way. As other dependence structures may be observed in practice, an FDR approach valid under more general dependence structures was later introduced by Benjamini and Yekutieli at the price of some loss of power.11 False-positive findings may occur in studies where subgroup analyses are performed without multiplicity adjustment (e.g. a meta-analyses stratified by timepoints of an outcome). As tests of such analyses typically involve correlated outcomes and/or comparisons repeatedly involving the same groups, a resampling-based FWER multiplicity correction would provide the greatest power. To maintain a high power, a limited number of subgroup analyses should be prespecified in the protocol, where the subgroups chosen should be based on a clear hypothesis with a pre-existing biological rationale. If regression models are used for causal inference, hypotheses of the association between an exposure and outcome are tested and multiplicity should be addressed, if there is more than one outcome, using the methods mentioned above. Note that in parametric models (e.g. generalized linear models and survival models), the dependence between the tests of interest can usually be obtained under standard asymptotic normality assumptions, allowing the dependence between them (e.g. middle age vs. young age, and old age vs. young age) to be taken into account when performing FWER multiplicity corrections.2 This leads to a gain in power compared with Bonferroni-like multiplicity corrections. When developing prediction models, the number of subjects (linear regression), cases (logistic regression) or events (survival models) determines the amount of statistical power and thus how many variables can be included in the model.12, 13 As a rule of thumb, 10 subjects, cases or events are needed per variable. When developing a prediction model with a multiplicity of variables and a too low number of events, there is a risk of predicting random error (i.e. overfitting) and very poor performance of the prediction model in another patient sample. In those situations, even more than 10 subjects, cases or events per variable may be required.14 Multiple comparisons can be foreseen at the design phase of a study, when multiple hypotheses are formulated. Therefore, methods to correct for multiple comparisons should be prespecified in the protocol and/or the statistical analysis plan. The BJD requires that clinical trials and systematic reviews are preregistered and encourages that the protocols of trials are published elsewhere and submitted as a supplementary file. We encourage authors of any type of study to consider multiple-testing strategies before the start of the study and to clearly report the strategy of choice in the methods. Loes Maria Hollestein: Writing-original draft (lead); Writing-review & editing (lead). Serigne Lo: Writing-original draft (equal); Writing-review & editing (equal). Jo Leonardi-Bee: Writing-original draft (equal); Writing-review & editing (equal). Saharon Rosset: Writing-original draft (equal); Writing-review & editing (equal). Noam Shomron: Writing-original draft (equal); Writing-review & editing (equal). Dominique-Laurent Couturier: Writing-original draft (equal); Writing-review & editing (equal). Sonia Gran: Writing-original draft (equal); Writing-review & editing (equal).

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
小蘑菇应助科研通管家采纳,获得10
刚刚
orixero应助科研通管家采纳,获得10
刚刚
JamesPei应助科研通管家采纳,获得10
刚刚
1秒前
duyuqing完成签到 ,获得积分10
1秒前
无极微光应助科研通管家采纳,获得20
1秒前
CodeCraft应助科研通管家采纳,获得10
1秒前
小蘑菇应助科研通管家采纳,获得10
1秒前
orixero应助科研通管家采纳,获得10
1秒前
1秒前
1秒前
1秒前
JamesPei应助科研通管家采纳,获得10
1秒前
1秒前
1秒前
1秒前
1秒前
无极微光应助科研通管家采纳,获得20
1秒前
CodeCraft应助科研通管家采纳,获得10
1秒前
1秒前
1秒前
bxhcs发布了新的文献求助10
1秒前
1秒前
CipherSage应助科研通管家采纳,获得10
1秒前
小马甲应助科研通管家采纳,获得10
2秒前
2秒前
天天快乐应助科研通管家采纳,获得10
2秒前
帝国之花应助科研通管家采纳,获得10
2秒前
Lucas应助科研通管家采纳,获得10
2秒前
2秒前
JamesPei应助夏侯绮山采纳,获得10
2秒前
wanci应助科研通管家采纳,获得10
2秒前
2秒前
2秒前
搜集达人应助科研通管家采纳,获得10
2秒前
隐形曼青应助科研通管家采纳,获得10
2秒前
甜甜的豆芽完成签到 ,获得积分10
2秒前
3秒前
leilei发布了新的文献求助10
4秒前
4秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Encyclopedia of Quaternary Science Reference Third edition 6000
Encyclopedia of Forensic and Legal Medicine Third Edition 5000
Introduction to strong mixing conditions volume 1-3 5000
Aerospace Engineering Education During the First Century of Flight 3000
Agyptische Geschichte der 21.30. Dynastie 3000
Les Mantodea de guyane 2000
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 计算机科学 有机化学 物理 生物化学 纳米技术 复合材料 内科学 化学工程 人工智能 催化作用 遗传学 数学 基因 量子力学 物理化学
热门帖子
关注 科研通微信公众号,转发送积分 5786198
求助须知:如何正确求助?哪些是违规求助? 5692433
关于积分的说明 15469181
捐赠科研通 4915143
什么是DOI,文献DOI怎么找? 2645551
邀请新用户注册赠送积分活动 1593292
关于科研通互助平台的介绍 1547635