There are no statistics that inflame the passions of statisticians and scientists as does the p value. The p value is, informally, a statistic used for assessing whether a "null hypothesis" (e.g., that the difference in performance between two conditions is 0) should be taken seriously. It is simultaneously the most used and most hated statistic in all of science. Use of p values has been called bad science, associated with cult-like and ritualistic behaviour, and pegged as a major cause of the so-called "crisis" that many of the sciences find themselves in.
In response to the controversy over p values, the American Statistical Association has today taken the unprecedented step of releasing a statement regarding a consensus viewpoint on the use of p values. This statement represents the input of the world's top experts on the topic. The whole statement is worth reading, but I'm going to focus on their six "principles" regarding values, first defining the p value, then describing the principles and adding my own brief commentary.
Formally, the p value is the probability of finding a result as, or more, "discrepant" than the one observed, assuming that some hypothesis is true. Small values represent more discrepant observations. As an example, consider flipping a coin 10 times and suppose we observe a single heads-up flip out of 10. Under the hypothesis that the coin is fair, we would expect five heads-up flips. One head or no head at all are as (or more) discrepant as what we observed, and the same applies to nine heads and ten heads. The probability of obtaining either 0, 1, 9, or 10 heads-up flips is small, assuming the coin is fair: about 0.02.
Thus, if we observe one of those outcomes, it seems clear that this represents a discrepancy from what we would expect if the coin were fair, and therefore, in some sense, represents evidence relevant to judging whether the coin is fair or not. The question has always been: how should we use this information? There is a rich, interesting, and ongoing philosophical debate about this question that I do not have space to engage. I will simply say that in much of psychology, common practice is simply to look and see whether —as in the case of 0, 1, 9, or 10 heads out of 10 flips—the p value is less than .05 and if so, declare the null hypothesis false. This is bad science, as we’ll discuss in the context of the ASA’s six principles.
The American Statistical Association’s six principles regarding p values
- P-values can indicate how incompatible the data are with a specified statistical model.
This principle arises from the typical definition of a p value and is relatively uncontroversial, with the possible exception of the phrase "how incompatible". The same p value does not necessarily indicate the same incompatibility, particularly for different sample sizes (e.g., a moderate discrepancy with a moderate sample size can yield the same p value as a very tiny discrepancy with a large sample size). But p values are useful for informally assessing the performance of a model and perhaps finding specific weaknesses in it.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
There's not much to say about this except that it is true. Although p values are often misunderstood as "the probability of the null hypothesis", this simply isn't correct. This misunderstanding can lead to weakly supported conclusions being published.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
One should, in general, beware of arbitrary thresholds. There are two possible issues here: first, the evidence is not substantially different on either side of the threshold. Declaring evidence for an effect when p=.0499 but not when p=.0501 is self-evidently silly. In practice, this leads to opportunistic interpretations of the evidence around the threshold. If one wants to argue for an effect, a value that just misses the threshold is much more convincing than if the existence of the effect would be problematic for one's preferred theory.
Moreover, likelihoodists and Bayesians argue that if you want a threshold on the evidence, p might be the wrong statistic for that threshold.
- Proper inference requires full reporting and transparency.
The interpretation of statistical results can be substantially affected by information that is not conveyed in the p value. Many articles rush to report a p value that helps their argument without reporting details such as stopping rules and the extent of exploratory analysis on the data. Good, cumulative science requires that researchers report all the relevant aspects of an analysis, and possibly sharing the data itself.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
A small p value may simply mean that we collected a lot of data and have the resolution to see very small effects. Good scientific reporting requires a numerical, visual, or practical assessment of the effect size found in an experiment. The importance of a finding must be assessed in light of the theoretical context and is largely independent of the size of the p value itself. A small p value might revolutionize physics or represent uninteresting nuisance variation (Meehl’s “crud” factor). Careful consideration of the research context is essential.
- By itself, a p value does not provide a good measure of evidence regarding a model or hypothesis.
This is likely to be surprising to many users of p values, because a common understanding of p values is that they measure the evidence against the null hypothesis. This is, however, an incorrect interpretation of values. The distinction is rather subtle: a p value is evidence, but is not a measure of evidence. A sense of the evidence requires to bring something more to the table. A frequentist needs something like the concept of power and an alternative hypothesis, because they want to ensure that the experiment provides a good test of the null hypothesis (e.g., Mayo & Spanos, 2006). A Bayesian requires an alternative hypothesis against which to compare the null (e.g., Bayes factors).
The statement by the American Statistical Association wraps up by suggesting that given the problems that researchers have interpreting p values, other approaches might be entertained, including confidence intervals and Bayesian methods. These alternative approaches, of course, come with their own conceptual challenges, some discussed previously on this blog.
Hopefully with their six principles, the American Statistical Association has laid the groundwork for a productive conversation in the research community about the future direction of methods in the sciences.