Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations
Cohen's d. Interactive visualization of Cohen's d effect size Understanding The New Statistics: Effect Sizes, Conf $$ Bestseller. (19). DEAL OF. “Why then are correlation coefficients so attractive? Only bad reasons seem Standardised effect sizes in power analyses. I know come to the. C (Research Methods 2): Effect Sizes. Dr. Andy Field, (thanks to Dan Wright for some of these quotes – see web/acknowledgements sections of this handout): difference between means' or 'no relationship between variables' .
Well, a p-value, in as few words as possible, is a probability that the observed difference from the null distribution is by pure chance.
Why Isn't the P Value Enough? Statistical significance is the probability that the observed difference between two groups is due to chance. If the P value is larger than the alpha level chosen eg.
With a sufficiently large sample, a statistical test will almost always demonstrate a significant difference, unless there is no effect whatsoever, that is, when the effect size is exactly zero; yet very small differences, even if significant, are often meaningless.
Thus, reporting only the significant P value for an analysis is not adequate for readers to fully understand the results. And to corroborate DarrenJames's comments regarding large sample sizes For example, if a sample size is 10a significant P value is likely to be found even when the difference in outcomes between groups is negligible and may not justify an expensive or time-consuming intervention over another.
The level of significance by itself does not predict effect size. Unlike significance tests, effect size is independent of sample size.Introduction to Effect Size
Statistical significance, on the other hand, depends upon both sample size and effect size. For this reason, P values are considered to be confounded because of their dependence on sample size. Sometimes a statistically significant result means only that a huge sample size was used. Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples?
I would argue, that these each serve as importance components in statistical analysis that cannot be compared in such terms, and should be reported together. The p-value is a statistic to indicate statistical significance difference from the null distributionwhere the effect size puts into words how much of a difference there is. As an example, say your supervisor, Bob, who is not very stats-friendly is interested in seeing if there was a significant relationship between wt weight and mpg miles per gallon.
Understanding Statistical Power and Significance Testing
Min 1Q Median 3Q Max However, your boss asks, well, how different is it? It does not refer to your single use of the test, which may have been thrown off by assumption violations as well as random errors. This is yet another version of misinterpretation 1. Pvalues are properly reported as inequalities e. This is bad practice because it makes it difficult or impossible for the reader to accurately interpret the statistical result. Only when the P value is very small e. There is little practical difference among very small P values when the assumptions used to compute P values are not known with enough certainty to justify such precision, and most methods for computing P values are not numerically accurate below a certain point.
Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect significance. The effect being tested either exists or does not exist. One should always use two-sidedPvalues. Two-sided P values are designed to test hypotheses that the targeted effect measure equals a specific value e. When, however, the test hypothesis of scientific or practical interest is a one-sided dividing hypothesis, a one-sided P value is appropriate.
For example, consider the practical question of whether a new drug is at least as good as the standard drug for increasing survival time. This question is one-sided, so testing this hypothesis calls for a one-sided P value. Nonetheless, because two-sided P values are the usual default, it will be important to note when and why a one-sided P value is being used instead.
The disputed claims deserve recognition if one wishes to avoid such controversy. For example, it has been argued that P values overstate evidence against test hypotheses, based on directly comparing P values against certain quantities likelihood ratios and Bayes factors that play a central role as evidence measures in Bayesian analysis [ 377277 — 83 ].
Nonetheless, many other statisticians do not accept these quantities as gold standards, and instead point out that P values summarize crucial evidence needed to gauge the error rates of decisions based on statistical tests even though they are far from sufficient for making those decisions.
See also Murtaugh [ 88 ] and its accompanying discussion. Common misinterpretations of P value comparisons and predictions Some of the most severe distortions of the scientific literature produced by statistical testing involve erroneous comparison and synthesis of results from different studies or study subgroups. Among the worst are: This belief is often used to claim that a literature supports no effect when the opposite is case.
In reality, every study could fail to reach statistical significance and yet when combined show a statistically significant association and persuasive evidence of an effect. Thus, lack of statistical significance of individual studies should not be taken as implying that the totality of evidence supports no effect.
When the same hypothesis is tested in two different populations and the resultingPvalues are on opposite sides of 0. Statistical tests are sensitive to many differences between study populations that are irrelevant to whether their results are in agreement, such as the sizes of compared groups in each population. As a consequence, two studies may provide very different P values for the same test hypothesis and yet be in perfect agreement e.
For example, suppose we had two randomized trials A and B of a treatment, identical except that trial A had a known standard error of 2 for the mean difference between treatment groups whereas trial B had a known standard error of 1 for the difference.
Differences between results must be evaluated by directly, for example by estimating and testing those differences to produce a confidence interval and a P value comparing the results often called analysis of heterogeneity, interaction, or modification. When the same hypothesis is tested in two different populations and the samePvalues are obtained, the results are in agreement.
Again, tests are sensitive to many differences between populations that are irrelevant to whether their results are in agreement. Two different studies may even exhibit identical P values for testing the same hypothesis yet also exhibit clearly different observed associations. For example, suppose randomized experiment A observed a mean difference between treatment groups of 3. If one observes a smallPvalue, there is a good chance that the next study will produce aPvalue at least as small for the same hypothesis.
This is false even under the ideal condition that both studies are independent and all assumptions including the test hypothesis are correct in both studies. In general, the size of the new P value will be extremely sensitive to the study size and the extent to which the test hypothesis or other assumptions are violated in the new study [ 86 ]; in particular, P may be very small or very large depending on whether the study and the violations are large or small.
Finally, although it is we hope obviously wrong to do so, one sometimes sees the null hypothesis compared with another alternative hypothesis using a two-sided P value for the null and a one-sided P value for the alternative. This comparison is biased in favor of the null in that the two-sided test will falsely reject the null only half as often as the one-sided test will falsely reject the alternative again, under all the assumptions used for testing.
Common misinterpretations of confidence intervals Most of the above misinterpretations translate into an analogous misinterpretation for confidence intervals. A reported confidence interval is a range between two numbers.
The frequency with which an observed interval e. These further assumptions are summarized in what is called a prior distribution, and the resulting intervals are usually called Bayesian posterior or credible intervals to distinguish them from confidence intervals [ 18 ].
Symmetrically, the misinterpretation of a small P value as disproving the test hypothesis could be translated into: As with the P value, the confidence interval is computed from many assumptions, the violation of which may have led to the results. Even then, judgements as extreme as saying the effect size has been refuted or excluded will require even stronger conditions.
If two confidence intervals overlap, the difference between two estimates or studies is not significant. As with P values, comparison between groups requires statistics that directly test and estimate the differences across groups. Finally, as with P values, the replication properties of confidence intervals are usually misunderstood: This statement is wrong in several ways.
When the model is correct, precision of statistical estimation is measured directly by confidence interval width measured on the appropriate scale.
It is not a matter of inclusion or exclusion of the null or any other value. The first interval excludes the null value of 0, but is 30 units wide.
The second includes the null value, but is half as wide and therefore much more precise. Nonetheless, many authors agree that confidence intervals are superior to tests and P values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes compatible with the data—a shift recommended by many authors and a growing number of journals.
Another way to bring attention to non-null hypotheses is to present their P values; for example, one could provide or demand P values for those effect sizes that are recognized as scientifically reasonable alternatives to the null. As with P values, further cautions are needed to avoid misinterpreting confidence intervals as providing sharp answers when none are warranted.
The P values will vary greatly, however, among hypotheses inside the interval, as well as among hypotheses on the outside. Also, two hypotheses may have nearly equal P values even though one of the hypotheses is inside the interval and the other is outside. Thus, if we use P values to measure compatibility of hypotheses with data and wish to compare hypotheses with this measure, we need to examine their P values directly, not simply ask whether the hypotheses are inside or outside the interval.
This need is particularly acute when as usual one of the hypotheses under scrutiny is a null hypothesis. Common misinterpretations of power The power of a test to detect a correct alternative hypothesis is the pre-study probability that the test will reject the test hypothesis e.
The corresponding pre-study probability of failing to reject the test hypothesis when the alternative is correct is one minus the power, also known as the Type-II or beta error rate [ 84 ] As with P values and confidence intervals, this probability is defined over repetitions of the same study design and so is a frequency probability.
One source of reasonable alternative hypotheses are the effect sizes that were used to compute power in the study proposal. Pre-study power calculations do not, however, measure the compatibility of these alternatives with the data actually observed, while power calculated from the observed data is a direct if obscure transformation of the null P value and so provides no test of the alternatives.
Thus, presentation of power does not obviate the need to provide interval estimates and direct tests of the alternatives. For these reasons, many authors have condemned use of power to interpret estimates and statistical tests [ 4292 — 97 ], arguing that in contrast to confidence intervals it distracts attention from direct comparisons of hypotheses and introduces new misinterpretations, such as: If you accept the null hypothesis because the nullPvalue exceeds 0.
It does not refer to your single use of the test or your error rate under any alternative effect size other than the one used to compute power. It can be especially misleading to compare results for two hypotheses by presenting a test or P value for one and power for the other.
Thus, claims about relative support or evidence need to be based on direct and comparable measures of support or evidence for both hypotheses, otherwise mistakes like the following will occur: If the nullPvalue exceeds 0. This claim seems intuitive to many, but counterexamples are easy to construct in which the null P value is between 0. We will however now turn to direct discussion of an issue that has been receiving more attention of late, yet is still widely overlooked or interpreted too narrowly in statistical teaching and presentations: That the statistical model used to obtain the results is correct.
Too often, the full statistical model is treated as a simple regression or structural equation in which effects are represented by parameters denoted by Greek letters.
Yet these tests of fit themselves make further assumptions that should be seen as part of the full model. For example, all common tests and confidence intervals depend on assumptions of random selection for observation or treatment and random loss or missingness within levels of controlled covariates.
These assumptions have gradually come under scrutiny via sensitivity and bias analysis [ 98 ], but such methods remain far removed from the basic statistical training given to most researchers. Less often stated is the even more crucial assumption that the analyses themselves were not guided toward finding nonsignificance or significance analysis biasand that the analysis results were not reported based on their nonsignificance or significance reporting bias and publication bias.
Selective reporting renders false even the limited ideal meanings of statistical significance, P values, and confidence intervals. Because author decisions to report and editorial decisions to publish results often depend on whether the P value is above or below 0. Although this selection problem has also been subject to sensitivity analysis, there has been a bias in studies of reporting and publication bias: It is usually assumed that these biases favor significance.
Addressing such problems would require far more political will and effort than addressing misinterpretation of statistics, such as enforcing registration of trials, along with open data and analysis code from all completed studies as in the AllTrials initiative, http: In the meantime, readers are advised to consider the entire context in which research reports are produced and appear when interpreting the statistics and conclusions offered by the reports.
Conclusions Upon realizing that statistical tests are usually misinterpreted, one may wonder what if anything these tests do for science. They were originally intended to account for random variability as a source of error, thereby sounding a note of caution against overinterpretation of observed associations as true effects or as stronger evidence against null hypotheses than was warranted.
We have no doubt that the founders of modern statistical testing would be horrified by common treatments of their invention. But it has long been asserted that the harms of statistical testing in more uncontrollable and amorphous research settings such as social-science, health, and medical fields have far outweighed its benefits, leading to calls for banning such tests in research reports—again with one journal banning P values as well as confidence intervals [ 2 ].
Given, however, the deep entrenchment of statistical testing, as well as the absence of generally accepted alternative methods, there have been many attempts to salvage P values by detaching them from their use in significance tests. One approach is to focus on P values as continuous measures of compatibility, as described earlier.
Although this approach has its own limitations as described in points 1, 2, 5, 9, 15, 18, 19it avoids comparison of P values with arbitrary cutoffs such as 0.
Another approach is to teach and use correct relations of P values to hypothesis probabilities.
Understanding Statistical Power and Significance Testing — an Interactive Visualization
For example, under common statistical models, one-sided P values can provide lower bounds on probabilities for hypotheses about effect directions [ 4546, ]. Whether such reinterpretations can eventually replace common misinterpretations to good effect remains to be seen. A shift in emphasis from hypothesis testing to estimation has been promoted as a simple and relatively safe way to improve practice [ 56163, ] resulting in increasing use of confidence intervals and editorial demands for them; nonetheless, this shift has brought to the fore misinterpretations of intervals such as 19—23 above [ ].
Other approaches combine tests of the null with further calculations involving both null and alternative hypotheses ; such calculations may, however, may bring with them further misinterpretations similar to those described above for power, as well as greater complexity.
Meanwhile, in the hopes of minimizing harms of current practice, we can offer several guidelines for users and readers of statistics, and re-emphasize some key warnings from our list of misinterpretations: Correct and careful interpretation of statistical tests demands examining the sizes of effect estimates and confidence limits, as well as precise P values not just whether P values are above or below 0. Careful interpretation also demands critical examination of the assumptions and conventions used for the statistical analysis—not just the usual statistical assumptions, but also the hidden assumptions about how results were generated and chosen for presentation.
It is simply false to claim that statistically nonsignificant results support a test hypothesis, because the same results may be even more compatible with alternative hypotheses—even if the power of the test is high for those alternatives.
Interval estimates aid in evaluating whether the data are capable of discriminating among various hypotheses about effect sizes, or whether statistical results have been misrepresented as supporting one hypothesis when those results are better explained by other hypotheses see points 4—6.
We caution however that confidence intervals are often only a first step in these tasks. To compare hypotheses in light of the data and the statistical model it may be necessary to calculate the P value or relative likelihood of each hypothesis.