摘要
Clinical studies, ranging from observational case reports to randomized controlled trials, typically depend on well-defined and testable research hypotheses. Power calculations and estimates of the necessary sample size to show an effect of a given size are important preconditions for testing the research hypothesis adequately and for deriving meaningful conclusions that generalize beyond the sample of patients in the study itself. The level of confidence about these generalizations is determined a priori by the researchers based on what they consider an acceptable level of error. Two types of error need to be considered: falsely concluding there is an effect when there is none (type I error), or falsely concluding there is no effect when one actually exists (type II error). In diagnostic terms, a type I error denotes a lack of specificity, and a type II error denotes a lack of sensitivity. Commonly accepted levels of errors for clinical studies are 5% for a type I error (alpha) and 20% or 10% for type II error (beta). A power calculation helps determine the number of participants required to avoid a type II error when comparing 2 conditions; in other words, the study is sufficiently sensitive to find a difference between 2 groups if one actually exists. The aim is to enroll enough participants for adequate power, but not more than necessary so as not to waste resources or cause unnecessary inconvenience or suffering to patients. The research hypotheses typically tested in clinical studies involve measuring the association of a clinically meaningful parameter, biomarker or treatment (“independent” or “predictor” variable) with a clinical state or outcome (“dependent” variable). For example, is the concentration in plasma of a proinflammatory protein associated with disease prognosis? Does the administration of a new antihypertensive drug decrease the mean systolic blood pressure? Once the acceptable levels of type I and II errors have been decided, we must then determine the effect size (ie, the magnitude of the difference) of a change in the independent variable on the dependent variable. An important consideration in clinical studies in particular is whether the independent variable will have an effect on the dependent variable that is both mechanistically plausible and large enough to be clinically relevant. The prediction of clinical outcomes through microbial biomarkers represents an active and promising area of research,1Doherty M.K. Ding T. Koumpouras C. et al.Fecal microbiota signatures are associated with response to ustekinumab therapy among Crohn's disease patients.MBio. 2018; 9Crossref PubMed Scopus (74) Google Scholar, 2He Y. Wu W. Zheng H.M. et al.Regional variation limits applications of healthy gut microbiome reference ranges and disease models.Nat Med. 2018; 24: 1532-1535Crossref PubMed Scopus (414) Google Scholar, 3Pascal V. Pozuelo M. Borruel N. et al.A microbial signature for Crohn's disease.Gut. 2017; 66: 813-822Crossref PubMed Scopus (436) Google Scholar and there has been increasing interest in the importance of the microbes in the human gut to health. The human gut hosts a variety of microbes including bacteria, archaea, fungi, other microbial eukaryotes, and viruses (including phages, viruses that attack bacteria or archaea). These microbes are collectively called the microbiota, and their genes are referred to as the microbiome.4Debelius J. Song S.J. Vazquez-Baeza Y. et al.Tiny microbes, enormous impacts: what matters in gut microbiome studies?.Genome Biol. 2016; 17: 217Crossref PubMed Scopus (88) Google Scholar The gut microbiome diversity in particular has been linked to many significant biological changes with direct or indirect consequences for human health,5Lozupone C.A. Stombaugh J.I. Gordon J.I. et al.Diversity, stability and resilience of the human gut microbiota.Nature. 2012; 489: 220-230Crossref PubMed Scopus (3077) Google Scholar and thus understanding the impact of microbial diversity is an expanding area of research across clinical specialties, from psychiatry to gastroenterology.6Knight R. Callewaert C. Marotz C. et al.The microbiome and human biology.Annu Rev Genomics Hum Genet. 2017; 18: 65-86Crossref PubMed Scopus (193) Google Scholar For example, the study of inflammatory bowel diseases is of special interest owing the dynamic nature of the disease, and the observed variability across sampling sites.7Gevers D. Kugathasan S. Denson L.A. et al.The treatment-naive microbiome in new-onset Crohn's disease.Cell Host Microbe. 2014; 15: 382-392Abstract Full Text Full Text PDF PubMed Scopus (1926) Google Scholar,8Halfvarson J. Brislawn C.J. Lamendella R. et al.Dynamics of the human gut microbiome in inflammatory bowel disease.Nat Microbiol. 2017; 2: 17004Crossref PubMed Scopus (592) Google Scholar Both of these factors impact the magnitude of the differences in the microbiota. For example, biopsy samples collected from the ileum have the highest predicting power, with rectal biopsies and fecal samples following in that order. However, it has been shown that combining multiple longitudinal fecal samples for a single participant can make their effectivity comparable with ileal biopsy samples (which are harder and more expensive to collect).9Vazquez-Baeza Y. Gonzalez A. Xu Z.Z. et al.Guiding longitudinal sampling in IBD cohorts.Gut. 2018; 67: 1743-1745Crossref PubMed Scopus (24) Google Scholar Altogether, this alludes to temporal variability of microbial communities as a potential marker for inflammatory bowel disease, and possibly even of other conditions.10Zaneveld J.R. McMinds R. Vega Thurber R. Stress and stability: applying the Anna Karenina principle to animal microbiomes.Nat Microbiol. 2017; 2: 17121Crossref PubMed Scopus (371) Google Scholar Owing to such potential applications, the number of studies involving the microbiome is growing rapidly, performed by not only academics and industry, but clinicians as well. It is therefore important to recognize that several steps are critical for successfully conducting a clinical microbiome study. Recent reviews cover experimental design considerations such as sample collection and storage,11Vandeputte D. Tito R.Y. Vanleeuwen R. et al.Practical considerations for large-scale gut microbiome studies.FEMS Microbiol Rev. 2017; 41: S154-S167Crossref PubMed Scopus (89) Google Scholar sample preparation techniques,4Debelius J. Song S.J. Vazquez-Baeza Y. et al.Tiny microbes, enormous impacts: what matters in gut microbiome studies?.Genome Biol. 2016; 17: 217Crossref PubMed Scopus (88) Google Scholar and analytical pipelines.12Nearing J.T. Douglas G.M. Comeau A.M. et al.Denoising the denoisers: an independent evaluation of microbiome sequence error-correction approaches.PeerJ. 2018; 6: e5364Crossref PubMed Scopus (149) Google Scholar,13Allaband C. McDonald D. Vazquez-Baeza Y. et al.Microbiome 101: studying, analyzing, and interpreting gut microbiome data for clinicians.Clin Gastroenterol Hepatol. 2019; 17: 218-230Abstract Full Text Full Text PDF PubMed Scopus (123) Google Scholar Here we focus on sample and statistical power calculations required to estimate the number of participants needed for a simple microbiome study using 2 examples to illustrate. For simplicity, we use the same study for both examples (QIITA [qiita.ucsd.edu] study id 1629).8Halfvarson J. Brislawn C.J. Lamendella R. et al.Dynamics of the human gut microbiome in inflammatory bowel disease.Nat Microbiol. 2017; 2: 17004Crossref PubMed Scopus (592) Google Scholar In the first example, we use a measure of diversity that uses a phylogenetic tree of the microbes, called Faith’s PD,14Faith D.P. Baker A.M. Phylogenetic diversity (PD) and biodiversity conservation: some bioinformatics challenges.Evol Bioinform Online. 2007; 2: 121-128PubMed Google Scholar as a dependent variable. PD describes the diversity of microbial communities within each sample (alpha diversity) while accounting for the phylogenetic structure that represents those microbes. In the second example, we compare microbial diversity across 3 different clinical phenotypes/categories using a distance metric that reports dissimilarity between samples, called UniFrac (beta diversity).15Lozupone C.A. Knight R. The UniFrac significance test is sensitive to tree topology.BMC Bioinformatics. 2015; 16: 211Crossref PubMed Scopus (14) Google Scholar UniFrac also uses a phylogenetic tree; measures of both alpha and beta diversity that instead use information only about lists of taxa are also available, but typically have less power because they ignore the information about relationships among organisms (ie, they would consider Escherichia coli and Salmonella as different from one another as from Bacillus). The first example is analogous to the standard sample size calculations used to power studies for clinical biomarkers, whereas the second example deals with the complexities of multivariate models and distance matrices. A clinical study has been designed to compare the diversity of microbial communities in 2 different groups of patients with Crohn’s disease (CD). Although the location of CD tends to be stable, its phenotype can vary markedly over the course of the disease.16Louis E. Collard A. Oger A.F. et al.Behaviour of Crohn's disease according to the Vienna classification: changing pattern over the course of the disease.Gut. 2001; 49: 777-782Crossref PubMed Scopus (764) Google Scholar The research hypothesis of the study proposes that differences in the microbial communities of the gut are associated with CD phenotype. To test this hypothesis, the study will compare patients with a nonstricturing, nonpenetrating phenotype (B1) and those with either a stricturing or penetrating phenotype (B2 and B3). To determine the number of participants required in each group to detect a difference in microbial diversity, we consult with the local statistician. To guide this calculation, the statistician is likely to ask the following questions. A possible study design is to enroll consecutive patients with a definitive diagnosis of CD, with and without a stricturing and penetrating phenotype (group B1 vs B2/B3). For simplicity, we assume that all phenotypes are equally easy to recruit and that a ratio of 1:1 is recommended. At each visit, the patients should provide a stool sample and the microbial communities will be studied using a marker gene approach (16S rRNA gene) as a starting point. (Note that for a pilot study, a marker gene study may be sufficient to test your research hypothesis; however, if the role of a particular strain or functional genes and pathways are of interest, a different sequencing strategy may be required, and the cost of additional data generation [both direct processing costs and costs of data analysis] against the information gained will have to be considered). In this particular case, the distribution of alpha diversity in our population is unknown, but a similar study has been conducted and the data are available in a public repository.17Gonzalez A. Navas-Molina J.A. Kosciolek T. et al.Qiita: rapid, web-enabled microbiome meta-analysis.Nat Methods. 2018; 15: 796-798Crossref PubMed Scopus (254) Google Scholar This analysis reveals that PD (one of the several metrics available to calculate alpha diversity) follows a normal distribution (with a slight negative skew) in a population of 100 patients with CD with B1 phenotype. PD has a mean of 13.5 and a standard deviation of 3.45 (Figure 1). Next, we need to know how large a change in diversity would be considered significant from a clinical viewpoint (a tiny difference between groups might be statistically detectable with a large enough sample size, but not clinically meaningful). Therefore, we need to calculate the sample size required to detect a meaningful difference between the 2 groups. Typically, the answer to this question—how large a difference is clinically significant?—is not known. For example, suppose we consider a drop in one unit of PD as a starting point. Is this a small or a large effect? Given the standard deviation of the distribution (3.45), a difference in one unit corresponds to a Cohen’s D (difference over standard deviation) of 0.29. There is a general consensus that a standardized mean difference of less than 0.4 is a relatively small effect size,18Kazis L.E. Anderson J.J. Meenan R.F. Effect sizes for interpreting changes in health status.Med Care. 1989; 27: S178-S189Crossref PubMed Scopus (2070) Google Scholar and it is possibly wiser to choose a larger difference for our initial calculations. Indeed, in this study a statistically significant decrease of 1.5 units of PD (P < .001) was observed in patients receiving antibiotics compared with those who did not receive antibiotics. Thus, a difference in PD of ≥1.5 might be a good starting point. Let’s assume that we are satisfied with detecting a difference of 2 units (μ2 – μ1) of PD between the 2 groups. Based on the distribution of PD reported in the study (QIITA study id 1629), a graph can be generated to show the statistical power achieved given a significance value and a sample size (Figure 1). From this graph, we observe that for an effect size of approximately 0.55, we see significant differences (dark blue circles) with as few as 10 patients. However, the statistical power for this sample size remains at <30%. In diagnostic terms, this is equivalent to a test that is highly specific but not very sensitive, so that it cannot detect a real effect (ie, reject the null hypothesis when the null hypothesis is false). Convention dictates that a level of statistical significance of 5% and a statistical power of 80% are generally accepted values for the majority of studies. For this example, we would therefore recommend enrolling a total of 110 patients (55 per group) to detect differences in alpha diversity of ≥2 units. It is worth noting that the logistics involved in recruiting 55 patients with a particular clinical phenotype may prove challenging, if not impossible, within the timeline available for some pilot studies. In addition properly accounting for additional factors such as medication, age, diet, or body mass index may further complicate this task. It is sensible, in these situations, to settle for a larger effect size; in the example provided, a total sample size of 50 patients may be sufficient for an effect size of 0.80 (ie, a mean difference of 3 Faith PD units) (Figure 1), at the risk of failing to detect real but smaller effects. We have seen that alpha diversity matches the paradigm for standard sample size calculations used in most clinical studies. The calculation of the number of participants required to test the research hypothesis that beta diversity differs between groups of patients or clinical phenotypes requires additional considerations. Here we need to consider pairwise distances between samples, using a suitable metric. The distance between samples reflects microbial composition (metric values approach 0 when the composition of the 2 samples is identical), and dissimilarities between groups reinforce the idea that differences in microbial composition per se (rather than differences in the number of types of microbes) are associated with a particular clinical phenotype. The independent variable is again the phenotype, and the dependent variable is now the distance in microbial compositions between samples from the 2 groups, with the null hypothesis that the distance between a pair of samples taken at random from the same group is the same as the distance between a pair of samples taken at random from different groups. Here, we calculate the minimum number of participants required to find statistically significant differences in beta diversity between patients with B1 and B3 phenotypes. We use a phylogenetic-based metric of beta diversity (unweighted UniFrac) to test our hypothesis. If we follow the same logic as before, we see that the distribution of this metric for pairwise distances between patients with the B1 phenotype follows an approximately normal distribution with a mean UniFrac value of 0.55 ± 0.08, whereas the distances between pairwise distances between B1 and B2/B3 show a larger distance with a mean of 0.60 ± 0.07. In Figure 2, we see that the total number of required patients to see a moderate effect size (0.60) with a significance of 5% and statistical power of 80% is approximately 100 patients (50 per group). The approach taken here is relevant for studies in which microbial diversity is measured at a single time point and in the situation where preliminary data are available. However, in those situations where preliminary data do not exist or come from small pilot studies, it is sensible to provide a range of estimates considering study feasibility and different effect sizes (calculated from larger standard deviations). Also, if the parameter of interest follows a non-normal distribution, different strategies have been proposed to analyze such data.19Cundill B. Alexander N.D. Sample size calculations for skewed distributions.BMC Med Res Methodol. 2015; 15: 28Crossref PubMed Scopus (22) Google Scholar Similarly, for studies that include repeated measures or a longitudinal component, sample size and power calculations are slightly more complex to calculate because the standard deviations and correlations (and type of correlation) among the repeated measurements must be specified.20Guo Y. Pandis N. Sample-size calculation for repeated-measures and longitudinal studies.Am J Orthod Dentofacial Orthop. 2015; 147: 146-149Abstract Full Text Full Text PDF PubMed Scopus (21) Google Scholar Sample size and power calculations to determine significant differences in shifts of specific microbes is also a much more challenging problem, a topic extensively covered by Morton et al.21Morton J.T. Marotz C. Washburne A. et al.Establishing microbial composition measurement standards with reference frames.Nat Commun. 2019; 10: 2719Crossref PubMed Scopus (257) Google Scholar However, in both cases, the model of querying existing data for estimating effect size as demonstrated here remains a critical exercise. In conclusion, determining the minimum number of participants needed for clinical microbiome studies is a largely unresolved subject. Here we propose the selection of a suitable metric of alpha or beta diversity, then use the distribution of that metric in published but related studies as a first step to estimate the sample size required to achieve adequate power for a given effect size and significance for both hypothesis-driven or more exploratory clinical microbiome studies. It is worth emphasizing that the challenge addressed in this review is how to use the distribution of a diversity metric in a specific group to design a study likely to draw meaningful conclusions. This is different from classifying a single patient or sample based on their individual values for a given metric (eg, in the example provided, measuring the PD of a single sample will not be sufficient to classify that sample according to inflammatory bowel disease phenotype, so it would not be useful as a clinical test for inflammatory bowel disease phenotype even though the average PD differs between groups). Review panels and institutional review board committees require, for obvious reasons, a well thought out study plan with a detailed rationale that justify the minimum number of participants to enroll in a study to draw significant clinical/biological conclusions (Appendix). In this context, we have attempted to reconcile the statistical power considerations used in studies for standard clinical parameters and the 2 most commonly measured independent variables in clinical microbiome studies, alpha and beta-diversity. The availability of public repositories of microbiome data that include sequences, well-curated metadata17Gonzalez A. Navas-Molina J.A. Kosciolek T. et al.Qiita: rapid, web-enabled microbiome meta-analysis.Nat Methods. 2018; 15: 796-798Crossref PubMed Scopus (254) Google Scholar and computational resources is important to test hypotheses with minimal overhead. We anticipate that as the number of studies available in repositories increase, the development of tools that streamline power calculations from microbiome data repositories will prove useful to design and interpret clinical microbiome studies. All data used was downloaded from Qiita, study ID 1629 and analysis id 25761. All analytical steps, including data download can be found in https://github.com/knightlab-analyses/sample-size-stat-power-clinical-microbiome. The sample size has been determined based on statistical power, effect size, time, and available resources requested in this grant. A total number of 110 patients is realistic and achievable enrollment in our clinical setting. The diversity of microbial communities is a good indicator of dysbiosis in patients with CD1, and we have selected Faith’s PD as a suitable metric to calculate alpha diversity. In a similar study, we observed that this metric shows an approximately normal distribution with mean 13.5 and standard deviation 3.45. Thus, to find a significant reduction of 2 units of Faith’s PD (effect size, Cohen’s D: 0.55) with an alpha value (type I error) of 5% and a statistical power (1 – beta) of 80%, we will have to enroll 110 patients (55 with B1 phenotype and 55 with B2/B3 phenotype).